Discovering Sociolinguistic Associations with Structured Sparsity

Jacob Eisenstein,  Noah A. Smith,  Eric P. Xing
CMU


Abstract

We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors' geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite $\ell_{1,\infty}$ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into \emph{features}, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.




Full paper: http://www.aclweb.org/anthology/P/P11/P11-1137.pdf