期刊文献+

基于MapReduce模型的大数据相似重复记录检测算法 被引量:41

A Similar Duplicate Record Detection Algorithm for Big Data Based on MapReduce
下载PDF
导出
摘要 针对大数据来源多、维度高和体量大的特点,提出一种云环境下检测大数据相似重复记录的并行算法MP-SYYT.利用汉语词法分析技术、德尔菲法以及词频-逆向文件频率算法对传统的SimHash算法进行改进,以解决算法中关键词提取速度慢、精度和权重计算精度低的问题;利用倒排索引算法对传统SimHash算法进行优化,以提高其相似重复记录的匹配效率;利用所提MP-SYYT算法在云平台上定义Map函数和Reduce函数,并用MapReduce模型在云环境下实现了大数据相似重复记录的并行检测和直接输出;在Hadoop平台上进行实例分析,以验证MP-SYYT算法的高效性和实用性. In view of the characteristics of multi-source,high dimension and large volume of big data,traditional algorithms have been unable to effectively complete the similar duplicate records detection for big data,therefore,a new parallel algorithm MP-SYYT for the detection of similar duplicate records of big data in the cloud environment is put forward.Firstly,Institute of computing technology chinese lexical analysis system(ICTCLAS)word segmentation technology,Delphi method and team frequency Inverse document frequency(TF-IDF)algorithm are used to improve the traditional SimHash algorithm,and these methods effectively solve the insufficiency of the traditional one,such as the low extraction speed,the imprecision of the keywords,and the low accuracy on weight calculation.Secondly,the inversed file retrieval algorithm is used to optimize the traditional SimHash algorithm to improve the matching efficiency of similar duplicate records.Finally,the Map function and the Reduce function based on the improved SimHash algorithm are defined on a cloud platform to realize the parallel detection of big data and the direct output of duplicate records in cloud environment with MapReduce model,and an experimental analysis about the multi-source measured data is made on a Hadoop platform.The results show that MP-SYYT is an efficient and accurate algorithm with good scalability and acceleration ratio,and it is suitable for similar duplicate record detection of big data.
出处 《上海交通大学学报》 EI CAS CSCD 北大核心 2018年第2期214-221,共8页 Journal of Shanghai Jiaotong University
基金 国家自然科学基金项目(61271115)资助
关键词 云环境 大数据 相似重复记录 并行检测 冗余识别 cloud environment big data similar duplicate records parallel detection redundant identification
  • 相关文献

参考文献9

二级参考文献265

共引文献584

同被引文献358

引证文献41

二级引证文献95

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部