Unsupervised Discovery of Domain-Specific Knowledge from Text

Dirk Hovy1,  Chunliang Zhang1,  Eduard Hovy1,  Anselmo Peñas2
1Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA 90292, 2UNED NLP and IR Group, Juan del Rosal 16, 28040 Madrid, Spain


Abstract

Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicate-argument structures like "quarterbacks throw passes to receivers". Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1147.pdf