基于LSTM自动编码机的短文本聚类方法

A Short Text Clustering Method Based on LSTM-autoencoder

下载PDF

导出

摘要一种基于文本向量化的短文本聚类方法。该方法以词向量作为基本特征,使用基于LSTM的自动编码机,对表征文本的词向量进行压缩编码,从而将文本不定长的词向量特征统一提取为统一输入长度的文本特征向量。这些文本特征向量的聚类结果即为短文本的聚类结果。对这一方法使用带标注的数据集进行了测试,使用基尼非纯度作为指标衡量该方法的聚类效果与人工聚类的拟合度;同时,使用聚类中心平均距离来衡量聚类结果中句子之间的结构相似度。结果表明,该方法更着重于匹配整体的文档结构,得到的聚类的句子间的结构相似度较高。 A short text clustering method,which is based on text vectorization, is proposed in this article.Word vectors are used as basic features in this method, and are encoded and compressed by LSTM-AutoEncoder to get fixed length vectors to represent the texts.These fixed length vectors are used for the final clustering.A test with labeled dataset is applied for this method,with Gini impurity used to measure the similarity of the result of the clustering of this method and human cluste- ring.Meanwhile,inner-cluster distance is used as evaluation for the structural similarity within a cluster. The result shows that, this method is focused on matching the structural information of the text, as the result got a relatively high similarity within the cluster.

作者黄健翀邓玫玲

机构地区广东东华发思特软件有限公司珠海市人民医院

出处《计算技术与自动化》 2017年第3期75-80,共6页 Computing Technology and Automation

关键词自然语言处理短文本聚类长短期记忆网络自动编码机 natural language processing short text clustering long short-term memory network auto-encooder

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1马海云.一种聚类方法的研究[J].自动化与仪器仪表,2010(1):14-15.
2王实,高文.数据挖掘中的聚类方法[J].计算机科学,2000,27(4):42-45. 被引量：88
3李明华,刘全,刘忠,郗连霞.数据挖掘中聚类算法的新发展[J].计算机应用研究,2008,25(1):13-17. 被引量：50
4卢永红,杨爱民.关于图平均距离猜想的一类反例[J].雁北师范学院学报,2004(5):1-2.
5徐保根.正则图的直径与平均距离[J].华东交通大学学报,1993,10(4):60-63.
6董晴,宋威.基于粒子群优化的深度神经网络分类算法[J].传感器与微系统,2017,36(9):143-146. 被引量：6
7殷新春,张春华.一类积图的平均距离[J].扬州师院学报（自然科学版）,1991,11(1):19-22.
8陈芳,郁敬颉.初中英语阅读教学应回归文本、基于文本[J].基础教育课程,2017(18):82-85.
9吉书瑶,吕红芳.无线传感器节点多特征组合加权K-means聚类算法[J].上海电机学院学报,2017,20(4):226-231. 被引量：1
10马鸿飞,赵月娇,刘珂,刘浩.一种采用栈自动编码机的语音分类算法[J].西安电子科技大学学报,2017,44(5):13-17. 被引量：4

计算技术与自动化

2017年第3期

浏览历史

内容加载中请稍等...

基于LSTM自动编码机的短文本聚类方法

相关作者

相关机构

相关主题

浏览历史