Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Zhongguo Li
Department of Computer Science and Technology, Tsinghua University


Abstract

Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1141.pdf