摘要
从海量数据下的社会化网络中识别出各个领域下产出高质量内容的具有一定影响力的专家,进行具有针对性的广告推荐与决策支持,已经成为微博数据挖掘亟待解决的问题之一。从微博的用户特征和行为特征出发,确定了采集博文的规则与互动量计算公式,并应用PageRank算法对微博用户影响力计算时存在的数据陈旧性和主题不相关性的问题进行了改进,最后分别基于MapReduce和Spark的并行计算框架对算法进行了实现。实验结果表明,该挖掘方法具有较好的准确性,在Spark并行计算框架下表现出较高的性能,尤其适合大规模数据集的场景。
It has been one of the urgent problems of micro-blog mining to identify experts with ability to produce high-quality content and high influence under various fields in social network with massive data, and make targeted advertising recommendation and decision support. In this paper, on the basis of user features and behavior features, the rules of selecting article in miero-blog and interaction calculation formula are determined, and the obsolescence of data and irrelevance of theme have been improved by PageRank algorithm. Finally, the algorithm is implemented respectively in the parallel computing framework of MapReduce and Spark. Experimental results show that the proposed method has high accuracy and great performance under Spark , especially under large-scale dataset scene.
作者
原野
李晨
田丽华
Yuan Ye Li chen Tian Lihua(Software Engineering School, Xi' an Jiaotong University, Xi' an 710049, Shaanxi, China Sina Corporation, Beijing 100000, China)
出处
《计算机应用与软件》
2017年第3期31-37,共7页
Computer Applications and Software
基金
国家自然科学基金项目(61403302)