期刊文献+

从文档集推导html标签影响因子的算法

Arithmetic of Deriving Html Tags Influence Factor from Docment Collections
下载PDF
导出
摘要 在Web文档中,同一个关键词处在不同html标签中,其对中心思想影响程度各不相同。选择合适的标签影响因子,对于构建文档的数学模型至关重要。本文在总结前人研究基础之上,提出了一种新的推导算法,该算法提出了ttf(标引词标签频率)和itf(逆标签频率)等定义,构造出行序为标签、列序为关键词的文档矩阵。从中抽取每个文档的某一特定行向量组构成一个新的向量集合,根据这个新集合中各个向量到质心的平均距离,就能得出该特定行向量集合所代表的标签的影响因子(针对训练文档集合)。如果训练文档集合的容量放大到足够,就可以近似认为这个影响因子具有一般意义。通过试验验证,推导出的影响因子作用于新的文档集合的时候,在一定程度上改善了检索的性能。 In html documents , one kind of keyword may have different influence factor to main idea, because it lay in different html tag. So it's important to choose a suitable common influence factor in setting up a math model of html document. This paper, based on the recently research, brings out a new deriving arithmetic. The arithmetic , ground on some new concepts, such as ttf(Term Frequency in Tag), itf(Inverse Tag Frequency), transform one document to a matrix which row represent html tags and column represent keywords. Use certain row (as a row vector) in every document to form a new vector list , then calculate the average distance between every vector to Centroid in the list. By the averay distance we can get the tag's influence factor( to the documents aggregate we used). If the documents aggregate is big enough, then the influence factor we get is approximately be regard as the common influence factor. Apply the result in the new documents aggregate , we find searching is effective than before.
出处 《计算机科学》 CSCD 北大核心 2007年第10期226-228,共3页 Computer Science
基金 本研究得到国家自然科学基金青年基金资助(编号:60403009)。
关键词 TTF ITF 规范化因子 质心 平均距离 标签影响因子向量 ttf, itf, Standardization factor, Centroid, Average distance, Tags influence factor
  • 相关文献

参考文献5

二级参考文献39

  • 1Robertson S E,Jones S K.Relevance Weighting of Search Terms.JASIS,1976,27:129-146.
  • 2Bookstein A,Swanson D R.Probabilistic models for autortmtic indexing.Journal of the American Society for Information Science,1974,25:312~319.
  • 3Van Rijsbergen C J.A Theoretical Basis for the Use of Co-Occurrence Data in Information Retrieval.Journal of Documentation, 1977,33:106~119.
  • 4Hatter S P, A probabilistic approach to automatic keyword indexing. Information Science, 1975,26:197-205(Part Ⅰ), 280-289(Part Ⅱ).
  • 5Sparck J K, Jackson D M. The use of automatically - obtained keyword classifications for information retrieval. Information Storage and Retrieval, 1970,5 : 175- 201.
  • 6Chow C K,Liu C N. Approximating discrete probability distributions with dependence trees. IEEE Transactions on information theory,1968,IT-14:462-467.
  • 7Margulis E L. Modeling documents with multiple Poisson distributions, Information Processing and Management, 1993,29 (2) :215-227.
  • 8Titterington D M,Markov U E, Smith A F M. Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons,1985.
  • 9Turtle H,Croft W B, Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems,1991,9(3) : 187-222.
  • 10Ribeiro-Neto B A, Muntz R. A belief network model for IR. In: Proc of the 19th annual int, ACM SIGIR conf, on research and development in information retrieval, Zurich, Switzerland, 1996.253-260.

共引文献47

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部