摘要
文本的特征描述是文本自动处理的基础工作之一,目前的文本特征描述一般采用加权VSM模型,该模型大都使用统计的和经验的加权算法,该算法方便了计算机对中文文本的相似度计算,但不能很好地揭示文本中词与词的关系。针对此缺点,提出了一种基于词同现频率的加权算法,使得文本的特征向量蕴涵了词与词的相关信息,最后用实验对该算法的效果进行了证明。
The description of text feature is one of the fundamental works of natural language. Some scholar often use the VSM in descriptionoftextfeatureatpresent, The model adopts term weighting algorithm based on statistical or experiential, It makes the computer can compare text similarity more easily, but the model don't think about the relation between word and word in the text. A term weighting algorithm on word co-occurrence is discussed to make the text feature contain some relative information between word and word, Finally, some experiment results are given to show the validity of algorithm and compare them with results obtained using other algorithm.
出处
《计算机工程与设计》
CSCD
北大核心
2005年第8期2180-2182,共3页
Computer Engineering and Design
关键词
向量空间模型
文本挖掘
词同现频率
权重计算
匹配
VSM (vector space model)
text mining
word co-occurrence
term-weighing
matching