摘要
新型冠状病毒肺炎(corona virus disease 2019,COVID-19)的快速暴发引发了广泛的社会关注,给网络舆情分析带来了极大的挑战。针对这个问题,本文使用网络爬虫技术对官方媒体发布的关于COVID-19的评论信息进行数据收集,对收集到的评论信息按时间顺序进行排列,首先使用TF-IDF对文本的关键特征词进行提取,其次利用OLDA(online latent Dirichlet allocation)模型依照时间顺序进行主题词演化分析,构建评论集词向量模型,最后使用K-means对主题进行聚类,并对聚类结果通过词性标注进行分析。实验表明,本文的方法可以获得随时间变化的评论信息,能够检测到需要关注的信息。
The rapid outbreak of COVID-19 has aroused a wide range of social concerns, which has brought great challenges to the analysis of online public opinions. To solve this problem, this paper uses the web crawler technology to collect the data of the information comments on COVID-19 published by the official media, and arranges the collected comments in chronological order.First, TF-IDF is used to extract the key feature words of the text, then OLDA model is used to analyze the evolution of the subject words in chronological order, and then a vector model of comment set words is constructed.Then K-means is used to cluster the topics, and the clustering results are analyzed by part of speech tagging. Experiments show that this method can get the information in the comments changing with time, and can detect the information that needs to be concerned.
作者
黄勃
陈欢
方志军
王明胜
刘文竹
HUANG Bo;CHEN Huan;FANG Zhijun;WANG Mingsheng;LIU Wenzhu(School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China;In dustrial and Commercial Bank of China Hefei Branch,Hefei 230031,Anhui,China)
出处
《武汉大学学报(理学版)》
CAS
CSCD
北大核心
2020年第5期425-432,共8页
Journal of Wuhan University:Natural Science Edition
基金
国家自然科学基金青年基金(61603242,61802251)
江西省经济犯罪侦查与防控技术协同创新中心开放基金资助课题(JXJZXTCX-030)。