摘要
自动文摘是计算机语言学领域的一个研究重点,其研究和应用受到了计算机科学、语言学、情报信息学等相关学科的广泛关注。首先介绍了基于LexRank算法的自动文摘方法。针对该方法的不足,从句子相似度计算方法、句子权重计算方法以及冗余处理等方面对它进行了改进,从而可以根据输入文本内容动态地调整相关影响因子。实现的文摘系统,可以对中文和英文的单文本或多文本进行自动文摘。在哈工大和DUC的测评语料上进行了实验,结果表明该系统在一定程度上改进了文摘的质量,在多文本文摘中的抗噪声方面也有一定的优越性。最后讨论了自动摘要研究存在的问题,并指出了自动文摘的研究趋势。
Automatic abstracting has been a priority research point in computational linguistics field, and the study and application of automatic summarization have widely attracted the attention of interrelated academic subjects such as computer science, linguistics, informatics. I}his article firstly brought out how LexRank algorithm works in automatic summarization, then improved the method in three aspects including sentence similarity computing, sentence weight computing and redundancy resolution. And the factors of influence could be dynamically adjusted according to the documents content. The system described in this article could deal with single or multi-document summarization both in English and Chinese. With evaluations on two corpuses, our methods could produce better summaries than the original LexRank algorithm to a certain degree. We also show that our system is quite insensitive to the noise in the data that may result from an imperfect topical clustering of documents. And in the end, existing problem and the developing trend of automatic summarization technology were discussed.
出处
《计算机科学》
CSCD
北大核心
2010年第5期151-154,218,共5页
Computer Science
基金
国家自然科学基金项目(60573057
60473057
90604007)资助
关键词
自动文摘
LexRank
句子相似度
动态调整
冗余处理
Automatic abstracting LexRank Sentence similarity Dynamic adjustment Redundancy resolution