从文档集推导html标签影响因子的算法

Arithmetic of Deriving Html Tags Influence Factor from Docment Collections

下载PDF

导出

摘要在Web文档中,同一个关键词处在不同html标签中,其对中心思想影响程度各不相同。选择合适的标签影响因子,对于构建文档的数学模型至关重要。本文在总结前人研究基础之上,提出了一种新的推导算法,该算法提出了ttf(标引词标签频率)和itf(逆标签频率)等定义,构造出行序为标签、列序为关键词的文档矩阵。从中抽取每个文档的某一特定行向量组构成一个新的向量集合,根据这个新集合中各个向量到质心的平均距离,就能得出该特定行向量集合所代表的标签的影响因子(针对训练文档集合)。如果训练文档集合的容量放大到足够,就可以近似认为这个影响因子具有一般意义。通过试验验证,推导出的影响因子作用于新的文档集合的时候,在一定程度上改善了检索的性能。 In html documents , one kind of keyword may have different influence factor to main idea, because it lay in different html tag. So it＇s important to choose a suitable common influence factor in setting up a math model of html document. This paper, based on the recently research, brings out a new deriving arithmetic. The arithmetic , ground on some new concepts, such as ttf（Term Frequency in Tag）, itf（Inverse Tag Frequency）, transform one document to a matrix which row represent html tags and column represent keywords. Use certain row （as a row vector） in every document to form a new vector list , then calculate the average distance between every vector to Centroid in the list. By the averay distance we can get the tag＇s influence factor（ to the documents aggregate we used）. If the documents aggregate is big enough, then the influence factor we get is approximately be regard as the common influence factor. Apply the result in the new documents aggregate , we find searching is effective than before.

作者邓剑勋邢永康

机构地区重庆大学计算机学院

出处《计算机科学》 CSCD 北大核心 2007年第10期226-228,共3页 Computer Science

基金本研究得到国家自然科学基金青年基金资助(编号:60403009)。

关键词 TTF ITF 规范化因子质心平均距离标签影响因子向量 ttf, itf, Standardization factor, Centroid, Average distance, Tags influence factor

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献5

1许忠锡.查准率与查全率关系辨析[J].上海高校图书情报工作研究,2004,14(4):21-23. 被引量：2
2邢永康,沈一栋.一种新的知识表达模型——信度网[J].计算机科学,2000,27(9):40-43. 被引量：5
3许建潮,胡明.中文Web文本的特征获取与分类[J].计算机工程,2005,31(8):24-25. 被引量：24
4宋斌,方小璐.基于网页特征的TFIDF改进算法[J].微计算机应用,2002,23(1):18-20. 被引量：9
5邢永康,马少平.信息检索的概率模型[J].计算机科学,2003,30(8):13-17. 被引量：14

二级参考文献39

1Robertson S E,Jones S K.Relevance Weighting of Search Terms.JASIS,1976,27:129-146.
2Bookstein A,Swanson D R.Probabilistic models for autortmtic indexing.Journal of the American Society for Information Science,1974,25:312～319.
3Van Rijsbergen C J.A Theoretical Basis for the Use of Co-Occurrence Data in Information Retrieval.Journal of Documentation, 1977,33:106～119.
4Hatter S P, A probabilistic approach to automatic keyword indexing. Information Science, 1975,26:197-205(Part Ⅰ), 280-289(Part Ⅱ).
5Sparck J K, Jackson D M. The use of automatically - obtained keyword classifications for information retrieval. Information Storage and Retrieval, 1970,5 : 175- 201.
6Chow C K,Liu C N. Approximating discrete probability distributions with dependence trees. IEEE Transactions on information theory,1968,IT-14:462-467.
7Margulis E L. Modeling documents with multiple Poisson distributions, Information Processing and Management, 1993,29 (2) :215-227.
8Titterington D M,Markov U E, Smith A F M. Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons,1985.
9Turtle H,Croft W B, Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems,1991,9(3) : 187-222.
10Ribeiro-Neto B A, Muntz R. A belief network model for IR. In: Proc of the 19th annual int, ACM SIGIR conf, on research and development in information retrieval, Zurich, Switzerland, 1996.253-260.

共引文献47

1王思丽,祝忠明.机构知识库相关性检索机制研究与试验[J].情报科学,2020,0(2):94-101. 被引量：1
2李孝明,曹万华.文本信息检索的精确匹配模型[J].计算机科学,2004,31(9):100-102. 被引量：7
3朱征宇,裴仰军,陈华月,付关友.个性化服务中用户近期兴趣视图的生成[J].计算机工程与设计,2005,26(4):951-954. 被引量：5
4李孝明,曹万华.舰载作战指挥系统软件构件库技术研究(续三):检索和管理[J].舰船电子工程,2005,25(3):34-38. 被引量：2
5朱征宇,张小林,熊茜,谢祈鸿.基于用户兴趣子类的协作推荐算法[J].计算机科学,2005,32(10):176-180. 被引量：5
6余强,张海盛.个性化Web信息服务技术研究[J].计算机应用研究,2006,23(2):198-200. 被引量：13
7陈浩声,李安,胡柏青.多层分类算法在维修信息挖掘中的应用[J].微计算机应用,2006,27(2):195-198.
8王圆,孙铁利,李杨.Web文本挖掘中的特征表示和特征提取[J].电脑知识与技术,2006,1(5):67-68. 被引量：2
9田苗苗,许建潮,汪津,丁桂英.基于遗传算法的Web信息自动标引研究[J].吉林大学学报（信息科学版）,2006,24(5):542-547. 被引量：6
10白田恬,邢永康.一种实用的信息检索方法[J].计算机科学,2006,33(B12):245-248.

1向日葵.新年新玩法美行科技携众多新品亮相2015 AA ITF[J].音响改装技术,2015,0(2):122-122.
2胡志敏.基于综合权重的多文档关键词抽取算法[J].计算机与数字工程,2010,38(6):45-48. 被引量：1
3上海瀚宇和加拿大ITF LABS公司签署代理协议[J].稀土信息,2008,14(4):35-35.
4向日葵.智能大屏与您同行索菱第二代安卓车载智能大屏闪耀2015AAITF[J].音响改装技术,2015,0(2):112-112.
5吴玲琦,蔡磊,卢军,黄涛.Itf-N助力精细网络运维[J].电信技术,2008(4):106-108.
6郭程,张龙军,许钟华.云存储环境下的支持隐私保护方案[J].计算机系统应用,2014,23(11):87-91. 被引量：1
7李薇,杨庆华.独立成分分析应用于人脸识别中的几个问题[J].计算机与现代化,2011(2):11-13.
8齐宏峰.无人机自动综合测试系统——总体框架设计[J].民营科技,2009(9):23-23.
9曾碧,毛勤.改进的室内三维模糊位置指纹定位算法[J].山东大学学报（工学版）,2015,45(3):22-27. 被引量：6
10张豪,陈黎飞,郭躬德.规范化相似度的符号序列层次聚类[J].计算机科学,2015,42(5):114-118.

计算机科学

2007年第10期

浏览历史

内容加载中请稍等...

从文档集推导html标签影响因子的算法

参考文献5

二级参考文献39

共引文献47

相关作者

相关机构

相关主题

浏览历史