摘要
一种基于文本向量化的短文本聚类方法。该方法以词向量作为基本特征,使用基于LSTM的自动编码机,对表征文本的词向量进行压缩编码,从而将文本不定长的词向量特征统一提取为统一输入长度的文本特征向量。这些文本特征向量的聚类结果即为短文本的聚类结果。对这一方法使用带标注的数据集进行了测试,使用基尼非纯度作为指标衡量该方法的聚类效果与人工聚类的拟合度;同时,使用聚类中心平均距离来衡量聚类结果中句子之间的结构相似度。结果表明,该方法更着重于匹配整体的文档结构,得到的聚类的句子间的结构相似度较高。
A short text clustering method,which is based on text vectorization, is proposed in this article.Word vectors are used as basic features in this method, and are encoded and compressed by LSTM-AutoEncoder to get fixed length vectors to represent the texts.These fixed length vectors are used for the final clustering.A test with labeled dataset is applied for this method,with Gini impurity used to measure the similarity of the result of the clustering of this method and human cluste- ring.Meanwhile,inner-cluster distance is used as evaluation for the structural similarity within a cluster. The result shows that, this method is focused on matching the structural information of the text, as the result got a relatively high similarity within the cluster.
出处
《计算技术与自动化》
2017年第3期75-80,共6页
Computing Technology and Automation
关键词
自然语言处理
短文本
聚类
长短期记忆网络
自动编码机
natural language processing
short text
clustering
long short-term memory network
auto-encooder