摘要
词频分析法高频关键词或主题词的界定是开展信息分析的重要基础。首先,在文献统计分析的基础上,总结了目前词频分析法高频词确定的四种方法:TOPN方法、WF>=M方法、%WF=P方法以及T计算方法,这些方法存在着经验性、随意性、理论基础和适用性上的问题。接着,通过实证方法,验证了关键词和主题词在文献库中的分布符合正态分布,并根据正态分布的特性,提出了词频分析法高频词阈值的F计算方法。最后,在多个数据样本基础上,将F方法与T方法进行了对比分析,认为基于正态分布的高频词阈值F计算方法在理论基础和适用性上都能达到较好的效果。
Along with the outburst of information and the developing of information analysis,word frequency analysis is becoming more and more popular in which the defining of high-frequency words serves as the cornerstone.By summarizing the precedent literature researches,this paper first concluded four methods of defining high-frequency words at present,i.e.TOPN,WF = M,% WF = P and T formula.After briefly discussing the main and obvious shortcomings of the above four methods,such as depending on experience too much,subjectivity,lack of theoretical background,inapplicability or impracticability and so on,the paper empirically tested and verified the normal distribution of high-frequency words in depositories,and accordingly proposed the F formula for threshold analysis of high-frequency words.At the final part,the paper compared and contrasted the T formula and the F formula through the analysis of many datasets,and by doing this the F formula was theoretically and applicably legitimized in the research of threshold of high-frequency words based on normal distribution.
出处
《情报杂志》
CSSCI
北大核心
2014年第10期129-136,共8页
Journal of Intelligence
关键词
词频分析法
正态分布
高频词
齐普夫定律
word frequency analysis normal distribution High-frequency Words Zipf's Law