期刊文献+

基于最大频繁项集的搜索引擎查询结果聚类算法 被引量:5

Search Result Clustering Algorithm Based on Maximal Frequent Itemsets
下载PDF
导出
摘要 现有的搜索引擎查询结果聚类算法大多针对用户查询生成的网页摘要进行聚类,由于网页摘要篇幅较短,质量良莠不齐,聚类效果难以有较大的提高(比如后缀树算法,Lingo算法);而传统的基于全文的聚类算法运算复杂度较高,且难以生成高质量的类别标签,无法满足在线聚类的需求(比如KMeans算法)。该文提出一种基于全文最大频繁项集的网页在线聚类算法MFIC(Maximal Frequent Itemset Clustering)。算法首先基于全文挖掘最大频繁项集,然后依据网页集合之间最大频繁项集的共享关系进行聚类,最后依据类别包含的频繁项生成类别标签。实验结果表明MFIC算法降低了基于网页全文聚类的时间,聚类精度提高15%左右,且能生成可读性较好的类别标签。 Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance (e. g. , STC and Lingo algorithms). On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering (for example, Kmeans algorithm). To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets (MFIC). At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.
出处 《中文信息学报》 CSCD 北大核心 2010年第2期58-67,共10页 Journal of Chinese Information Processing
基金 863专题目标导向类资助项目(2006AA01Z197) 国家自然科学基金资助项目(60703015)
关键词 计算机应用 中文信息处理 搜索引擎 网页聚类 频繁项集 computer application Chinese information processing search engine Web page clustering frequent itemset
  • 相关文献

参考文献27

  • 1Lan Huang. A Survey on Web Information Retrieval Teehnologies[EB/OL]. ECSL Technical Report, State University of New York,2000.
  • 2C. J van Rijsbergen. Information Retrieval[M]. London: Butterworths, 1979.
  • 3Oren Zamir, Oren Etzioni. Web document clustering A Feasibility Demonstration[C]//Research and Devel opment in Information Retrieval, 1998: 46-54.
  • 4Stanislaw Osinski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition[C]//Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference, Advances in Soft Computing, 2004 : 359-368.
  • 5Liping Jing, Michael K. Ng, and Joshua Zhexue Huang. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data [J]. IEEE Transactions on Knowledge and Data Engineering,2007,19(8) :1026-1040.
  • 6Michael Steinbach, George Karypis, Vipin Kumar. A Comparison of Document Clustering Techniques [EB/ OL]. Technical Report, University of Minnesota, 2000.
  • 7Wei Song; Soon Cheol Park. Genetic algorithm-based text clustering technique: Automatic evolution of clustes with high efficientcy [C]//Seventh International Conference on Web-Age Information Management Workshops. Hong Kong 2006: 17-17.
  • 8Richard Freeman, Hujun Yin. Self-Organising Maps for Hierarchical Tree View Document Clustering Using Contextual Information[C]//Proceedings of the IEEE International Joint Conference on Neural Networks. 2002: 123-128.
  • 9Daniel Crabtree, Xiaoying Gao, Peter Andreae. Improving Web Clustering by Cluster Selection[C]//The 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005 : 172-178.
  • 10Hung Chim,Xiaotie Deng. A New Suffix Tree Similarity Measure for Document Clustering[C]//World Wide Web Conference Committee. 2007 : 121-129.

二级参考文献111

共引文献103

同被引文献45

  • 1刘远超,王晓龙,徐志明,关毅.文档聚类综述[J].中文信息学报,2006,20(3):55-62. 被引量:65
  • 2彭京,杨冬青,唐世渭,付艳,蒋汉奎.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363. 被引量:44
  • 3Yang X, Ghoting A, Ruan Y, et al. A framework for summarizing and analyzing Twilter feeds [C] //Proc of the 18th ACM SIGKDD lnt Conf on Knowledge Discovery and Data Mining (KDD'12). New York: ACM, 2012:370-378.
  • 4Zhang X, Zhu S, Liang W. Detecting spare and promoting campaigns in the Twitter social network [C] //Proc of the 12th IEEE Int Conf on Data Mining (ICDM'12). Los Alamitos, CA: IEEEComputer Society, 2012:1194-1199.
  • 5Peng Min, Huang Jiaiia, Fu Hui, et al. High quality microblog extraction based on multiple features fusion and time frequency lransformation [G] //LNCS 8181 : Proc of the 14th Int Conf of Web Information Systems Engineering (WlSE'13). Berlin: Springer, 2013:188- 201.
  • 6Lin D. An information theoretic definition of similarity [C]// Proc of the 15th Int Conf on Machine I.earning (ICMI.'98). San Francisco, CA: Morgan Kaufmann, 1998, 296-304.
  • 7Schiitze H, Silverstein C. Projections for efficient document clustering [C] //Proc of the 20th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval (SIGIR'97). New York: ACM, 1997: 74-81.
  • 8Ramage D, Heymann P, Manning C D, et al. Clustering the tagged Web [C] //Proc of the 2nd ACM Int Conf on Web Search and Data Mining (WSDM'09). New York: ACM, 2009:54-63.
  • 9Freeman R, Yin H. Self-organising maps for hierarchical tree view document clustering using contextual information [G]//LNCS 2412: Proc of the IEEE Int Joint Conf on Neural Networks. Berlin: Springer, 2002:123-128.
  • 10Sahami M, Heilman T D. A Web based kernel function for measuring the similarity of short text snippets [C] //Proc of the 15th Int Conf on World Wide Web (WWW'06). New York: ACM, 2006: 377-386.

引证文献5

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部