期刊文献+

Web信息整合中的数据去重方法 被引量:4

Data deduplication in Web information integration
下载PDF
导出
摘要 针对现有数据去重方法中存在的时间效率和检测精度低的问题,结合Web信息整合的特点,提出一种逐级聚类的数据去重方法(SCDE)。首先通过关键属性分割和Canopy聚类将数据划分成小记录集,然后精确检测相似重复记录,并提出基于动态权重的模糊实体匹配策略,采用动态权重赋值,降低属性缺失对记录相似度计算带来的影响,并对名称的特殊性进行处理,提高匹配准确率。实验结果显示:该方法在时间效率和检测精度上均优于传统算法,其中准确率提高12.6%。该方法已应用于林业黄页系统中,取得了较好的应用效果。 Since traditional data dedupliation methods are of low time efficiency and detection accuracy, a Stepwise Clustering Data Elimination (SCI)E) method was presented based on the features of Web information integration. Firstly the whole record set was divided into sub-sets using both key attributes division and the Canopy clustering technique, and then the similar records in each sub-set were accurately eliminated. A fuzzy entity matching strategy based on dynamic weight was proposed to accurately eliminate the duplicate records, which reduced the influence of missing attribute on record similarity calculation, and the name of company was especially treated to improve the matching accuracy. The results show that the method is superior to traditional algorithms in time efficiency and detection accuracy, and the precision is improved by 12. 6%. The method is applied in forestry yellow page system and performs well.
出处 《计算机应用》 CSCD 北大核心 2013年第9期2493-2496,共4页 journal of Computer Applications
基金 中央高校基本科研业务费专项基金资助项目(BLYX200928)
关键词 Web信息整合 相似重复记录 动态权重 模糊实体匹配 Web information integration approximately duplicate record dynamic weight fuzzy entity matching
  • 相关文献

参考文献11

二级参考文献140

共引文献96

同被引文献30

  • 1林子雨,杨冬青,宋国杰,王腾蛟.实时主动数据仓库中的变化数据捕捉研究综述[J].计算机研究与发展,2007,44(z3):447-451. 被引量:7
  • 2章晓芳,徐宝文,聂长海,史亮.一种基于测试需求约简的测试用例集优化方法[J].软件学报,2007,18(4):821-831. 被引量:59
  • 3李国杰.大数据研究的科学价值[J].中国计算机学会通讯,2012,8(9):8-15.
  • 4周晓方,陆嘉恒,李翠平,等.从数据管理视角看大数据挑战[J].中国计算机学会通讯,2012,8(9):16-20.
  • 5Weber L. Marketing to the Social Web: How Digital Customer Communities Build Your Business[M].John Wiley & Sons,2007.
  • 62012年IBM软件技术峰会专题[EB/OL].[2013-05-01].http://tech.sina.tom.cn/it/2012-08-23/11387538429.shtml.
  • 7Lee K P,Hu J K.XMLSchema Representation of DICOM StructuredReporting[J].Journal of the American Medical Informatics Association,2003,10(2) :213-223.
  • 8IBM展示业界最完整大数据解决方案[EB/OL].[2013-05-01].http://server.zdnet.corn.cn/server/2013/0317/2148815.shtml.
  • 9Heterogeneous Database System[EB/OL].[2013-03-14].http://en. wikipedia.org/wiki/Heterogeneous Database_System.
  • 10程学旗.大数据的应用与科学问题探讨[R].数学与大数据科学论坛.北京:中国科学院,2013:43-52.

引证文献4

二级引证文献25

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部