摘要
网页自动分类是当前互联网搜索领域一个热点研究课题,目前主要有基于网页文本内容的分类和基于网页间超链接结构的分类。但是这些分类都只利用了网页的信息,没有考虑到网页所在网站提供的信息。文中提出了一种全新的对网站内部拓扑结构进行简约的算法,提取网站隐含的层次结构,生成层次结构树,从而达到对网站内部网页实现多层次分类的目的,并且已经成功应用到电子商务智能搜索和挖掘系统中。
Web page classification was one of the hot study problems in the domain of Internet Search currently. Now there were the classifiers based on text and the hyperlinks. But all these methods of classification only used the information of the pages without the information that was provided from the whole web site. In the article, there was a new arithmetic that simplifies the topology structure of the Web site and extracted the connotative hierarchy of the classification to build the classified tree, through which we could achieve the multi-level classification. This method has been applied to the system of intelligent searching and mining of electronic business successfully.
出处
《计算机应用》
CSCD
北大核心
2006年第5期1134-1136,共3页
journal of Computer Applications
基金
广东省科技攻关项目(2005B10101033
A10202001)
广州市科技攻关项目(2004Z2-D0091)
关键词
网页分类
网站层次结构
URL聚类
Web page classification
Hierarchy of Web site
URL clustering