摘要
随着移动互联技术的不断发展,社交媒体成为了公众分享观点和抒发情感的主要平台,在重大社会事件下对社交媒体文本进行情感分析能够有效监控舆情。针对现有中文社交媒体情感分析算法的准确性能和运行效率较低的问题,提出了一种基于Spark分布式系统的集成情感大数据分析方法(Spark Feature Weighted Stacking,S-FWS)。该方法首先基于Jieba库预分词和PMI关联度完成新词发现;然后考虑词语重要度混合提取文本特征,并使用Lasso进行特征选择;最后改进传统Stacking框架忽略特征重要度的缺点,使用初级学习器的准确率信息对类概率特征进行加权处理并构造多项式特征,进而训练次级学习器。分别在单机模式和Spark平台下引入多种算法进行对比实验,实验结果证明所提S-FWS方法的准确性能和耗时性能具备一定优势,并且分布式系统能够大幅提高算法的运行效率,同时随着集群工作节点的增加,算法耗时逐渐降低。
With the development of mobile Internet technology,social media has become the main approach for the public to share views and express their emotions.Sentiment analysis for social media texts in major social events can effectively monitor public opinion.In order to solve the problem of low accuracy and efficiency of existing Chinese social media sentiment analysis algorithms,an ensemble sentiment analysis big data method(S-FWS)based on Spark distributed system is proposed.Firstly,the new words are found by calculating the PMI association degree after pre-segmentation by Jieba library.Then,the text features are extracted by considering the importance of words and feature selection is realized by Lasso.Finally,in order to improve the traditional Stacking framework neglecting the feature importance,the accuracy information of the primary learners is used to weight the probabilistic features,and the polynomial features are constructed to train the secondary learner.A variety of algorithms are introduced in the stand-alone mode and the Spark platform receptively to carry out comparative experiments.Results show that the S-FWS method proposed in this paper has certain advantages in accuracy and time consumption;distributed system can greatly improve the operating efficiency of the algorithms,and with the increase of working nodes,the time consumption of the algorithms gradually decreases.
作者
戴宏亮
钟国金
游志铭
戴宏明
DAI Hong-liang;ZHONG Guo-jin;YOU Zhi-ming;DAI Hong-ming(School of Economics and Statistics,Guangzhou University,Guangzhou 510006,China;School of Software,South China University of Technology,Guangzhou 510006,China)
出处
《计算机科学》
CSCD
北大核心
2021年第9期118-124,共7页
Computer Science
基金
国家社会科学基金项目(18BTJ029)