A Fast and Accurate Method for Approximate String Search

Ziqi Wang1,  Gu Xu2,  Hang Li2,  Ming Zhang1
1Peking University, 2Microsoft Research Asia


Abstract

This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a string, the system finds words in a dictionary, which are most ``similar'' to the string. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for top $k$ candidates retrieval. The log linear model is defined as conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal $k$ candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods on accuracy in different settings.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1006.pdf