K-means Clustering with Feature Hashing

Hajime Senuma
University of Tokyo


Abstract

One of the major problems of $K$-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing \cite{Weinberger2009}, a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to $K$-means, by showing how much will the objective of $K$-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.




Full paper: http://www.aclweb.org/anthology/P/P11/.pdf