摘要
在Web文档中,同一个关键词处在不同html标签中,其对中心思想影响程度各不相同。选择合适的标签影响因子,对于构建文档的数学模型至关重要。本文在总结前人研究基础之上,提出了一种新的推导算法,该算法提出了ttf(标引词标签频率)和itf(逆标签频率)等定义,构造出行序为标签、列序为关键词的文档矩阵。从中抽取每个文档的某一特定行向量组构成一个新的向量集合,根据这个新集合中各个向量到质心的平均距离,就能得出该特定行向量集合所代表的标签的影响因子(针对训练文档集合)。如果训练文档集合的容量放大到足够,就可以近似认为这个影响因子具有一般意义。通过试验验证,推导出的影响因子作用于新的文档集合的时候,在一定程度上改善了检索的性能。
In html documents , one kind of keyword may have different influence factor to main idea, because it lay in different html tag. So it's important to choose a suitable common influence factor in setting up a math model of html document. This paper, based on the recently research, brings out a new deriving arithmetic. The arithmetic , ground on some new concepts, such as ttf(Term Frequency in Tag), itf(Inverse Tag Frequency), transform one document to a matrix which row represent html tags and column represent keywords. Use certain row (as a row vector) in every document to form a new vector list , then calculate the average distance between every vector to Centroid in the list. By the averay distance we can get the tag's influence factor( to the documents aggregate we used). If the documents aggregate is big enough, then the influence factor we get is approximately be regard as the common influence factor. Apply the result in the new documents aggregate , we find searching is effective than before.
出处
《计算机科学》
CSCD
北大核心
2007年第10期226-228,共3页
Computer Science
基金
本研究得到国家自然科学基金青年基金资助(编号:60403009)。