摘要
网页分类是使用机器学习的方法实现网页类别的自动标注。回顾了文本分类技术的研究状况,分析了网页的结构特征,难点在于结合网页的结构信息选择合理的表示方式和分类算法。使用纯文本分类技术处理网页是不合理的。基于概率模型的方法和关系学习方法计算量大,关系学习方法学习结果的可解释性好,支持向量机方法分类准确率高,但核函数的构造和大规模数据集的训练都是该算法的难题。应该采用多种指标对网页分类算法进行评价。
Web document classification assigns labels to web documents based on machine learning techniques. A review of various text classification techniques showed that the main difficulties in web document classification are the page representation methods and the classification algorithms. Techniques that go beyond text categorization approaches are needed. Probabilistic algorithms and relational learning methods are both time-consuming. SVM (support vector machine) classifiers are quite accurate but the automatic kernel selection and the large scale training are both key problems. Various measures were investigated to compare algorithm performance based on sample datasets.
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2004年第1期65-68,共4页
Journal of Tsinghua University(Science and Technology)
基金
国家"九七三"基础研究基金项目(G1998030414)
关键词
网页分类
机器学习
文本分类
网络挖掘
machine learning
web document classification
text categorization
web mining