期刊文献+

网页分类技术 被引量:18

Web document classification techniques
原文传递
导出
摘要 网页分类是使用机器学习的方法实现网页类别的自动标注。回顾了文本分类技术的研究状况,分析了网页的结构特征,难点在于结合网页的结构信息选择合理的表示方式和分类算法。使用纯文本分类技术处理网页是不合理的。基于概率模型的方法和关系学习方法计算量大,关系学习方法学习结果的可解释性好,支持向量机方法分类准确率高,但核函数的构造和大规模数据集的训练都是该算法的难题。应该采用多种指标对网页分类算法进行评价。 Web document classification assigns labels to web documents based on machine learning techniques. A review of various text classification techniques showed that the main difficulties in web document classification are the page representation methods and the classification algorithms. Techniques that go beyond text categorization approaches are needed. Probabilistic algorithms and relational learning methods are both time-consuming. SVM (support vector machine) classifiers are quite accurate but the automatic kernel selection and the large scale training are both key problems. Various measures were investigated to compare algorithm performance based on sample datasets.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2004年第1期65-68,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家"九七三"基础研究基金项目(G1998030414)
关键词 网页分类 机器学习 文本分类 网络挖掘 machine learning web document classification text categorization web mining
  • 相关文献

参考文献18

  • 1Salton G, McGill J. Introduction to Modern Information Retrieval 1 edition [M]. Auckland: McGraw Hill, 1983.
  • 2Slattery S. Hypertext Classification [D]. Pittsburgh: Carnegie Mellon Univ, 2001.
  • 3Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization [J]. J Intelligent Info Syst, 2002, 18(2/3): 219-241.
  • 4Furnkranz J. Exploiting structural information for text classification on the WWW [A]. IDA'99 [C]. Amsterdam: Springer Verlag, 1999. 487-497.
  • 5Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks [A]. Laura M H, Tiwary A. Proc ACM SIGMOD Int Conf on Management of Data [C]. New York: ACM Press, 1998. 307-318.
  • 6Ghani R, Slattery S, Yang Y. Hypertext categorization using hyperlink patterns and meta data [A]. Brodley C, ICML'01 [C]. San Francisco: Morgan Kaufmann, 2001.
  • 7Oh H, Myaeng S, HoLee M. A practical hypertext categorization method using links and incrementally available class information [A]. Nicholas B, Peter I. Proc SIGIR-00 [C]. New York: ACM Press, 2000. 264-271.
  • 8Choon Y. Classification of world wide web documents [D]. Pittsburgh: Carnegie Mellon Univ, 2000.
  • 9范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392. 被引量:53
  • 10Koller D, Sahami M. Hierarchically classifying documents using very few words [A]. Fisher D, ICML97 [C]. San Francisco: Morgan Kaufmann, 1997. 170-178.

二级参考文献1

  • 1Lang K,Proc the 12th Int Conference on Machine Learning(ICML 95),1995年,331页

共引文献52

同被引文献133

引证文献18

二级引证文献46

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部