期刊文献+

基于子主题选择与三级分层结构的Web文本挖掘方法 被引量:1

Web text mining method based on subtopic selection and three-level stratified structure
下载PDF
导出
摘要 针对用户和查询之间的意图差距导致的查询模糊宽泛和数据稀疏问题,根据流行性和多样性返回可能子主题的排名列表,利用子主题选择与排序的分层结构进行Web文本挖掘。首先,在名词性短语和可替代部分查询的基础上,使用简单模式提取各种相关的短语作为候选子主题;然后,使用网页文档集合中的相关文档构建候选子主题的三级层次结构;最后,综合考虑流行性和多样性,利用该结构和估计的流行度进行排序。实验使用了NTCIR-9库的100个日文查询和来自TREC 2009库的100个英文查询以及网络跟踪多样性任务,实验结果验证了本文方法可有效应用于各种搜索,对于高排名的子主题挖掘优于外部资源。 As the problem of fuzzy inquiry and data sparseness cased by intention gap between users and queries,according to the ranking list of possible subtopic from popularity and diversity, subtopic selection and sorting of stratified structure were used for web text mining. Firstly, on the basic of noun phrase and substitute of part query, a simple model was used to extract a variety of related phrases as candidate subtopic. Then, related documents of a web document collection were used to build three-level stratified structure of candidate subtopic. Finally, considering popularity and diversity, the stratified structure and estimated popularity were applied for sorting. Based on 100 Japanese queries from NTCIR-9 library, 100 English queries from TREC 2009 library and network tracking diversity task, experiments verify that the proposed method can be effectively applied to a variety of search, and the proposed mining is better than external resources for high ranking subtopics.
出处 《电信科学》 北大核心 2016年第5期96-104,共9页 Telecommunications Science
基金 河南省科技厅科技重点攻关项目(No.142102210226)~~
关键词 数据稀疏 文本挖掘 层次结构 多样性 流行性 data sparseness text mining stratified structure diversity popularity
  • 相关文献

参考文献12

  • 1唐晓波,肖璐.基于单句粒度的微博主题挖掘研究[J].情报学报,2014,33(6):623-632. 被引量:7
  • 2BHATIA S,MAJUMDAR D,MITRA P.Query suggestions in the absence of query logs[C]//International ACM SIGIR Conference on Research&Development in Information Retrieval,July 24-28,2011,Beijing,China.New York:ACM Press,2011:795-804.
  • 3HE J,HOLLINK V,DE VRIES A.Combining implicit and explicit topic representations for result diversification[C]//The35th international ACM SIGIR conference on Research and development in information retrieval,August 12-16,2012,Poreland,OR,USA.New York:ACM Press,2012:851-860.
  • 4肖璐,唐晓波.基于句子成分的微博热点主题挖掘模型研究[J].情报科学,2015,33(11):44-47. 被引量:3
  • 5ZHU X,GUO J,CHENG X,et al.A unified framework for recommending diverse and relevant queries[C]//World Wide Web Conference Series,March 28-April 1,2011,Hyderabad,India.New York:ACM Press,2011:37-46.
  • 6KIM S J,SHIN K Y,LEE J H.Hierarchical subtopic mining for topic annotation[C]//The 6th international workshop on exploiting semantic annotations in information retrieval,October 28,2013,San Francisco,CA,USA.New York:ACM Press,2013:49–52.
  • 7刘少鹏,印鉴,欧阳佳,黄云,杨晓颖.基于MB-HDP模型的微博主题挖掘[J].计算机学报,2015,38(7):1408-1419. 被引量:31
  • 8岑荣伟,刘奕群,张敏,茹立云,马少平.基于日志挖掘的搜索引擎用户行为分析[J].中文信息学报,2010,24(3):49-54. 被引量:31
  • 9DANG V,CROFT B W.Term level search result diversification[C]//International ACM SIGIR Conference on Research&Development in Information Retrieval,July 28-August 1,2013,Dublin,Ireland.New York:ACM Press,2013:603-612.
  • 10曾依灵,许洪波,白硕.网络文本主题词的提取与组织研究[J].中文信息学报,2008,22(3):64-70. 被引量:14

二级参考文献134

共引文献103

同被引文献10

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部