摘要
自动文摘技术的目标是致力于将冗长的文档内容压缩成较为简短的几段话,将信息全面、简洁地呈现给用户,提高用户获取信息的效率和准确率。所提出的方法在LDA(Latent Dirichlet Allocation)的基础上,使用Gibbs抽样估计主题在单词上的概率分布和句子在主题上的概率分布,结合LDA参数和谱聚类算法提取多文档摘要。该方法使用线性公式来整合句子权重,提取出字数为400字的多文档摘要。使用ROUGE自动摘要评测工具包对DUC2002数据集评测摘要质量,结果表明,该方法能有效地提高摘要的质量。
Automatic summarization aims to compress lengthy document into a few short paragraphs, offers comprehensive and concise information to the users and improves the efficiency and accuracy of the information. A summarization method based on Latent Dirichlet Allocation(LDA) is proposed, using Gibbs sampling to estimate the word probability on topics and topic proba- bility on sentences, combing with the LDA parameters and spectral clustering algorithm to extract multi-document summariza- tion. The proposed approach uses a linear formula to integrate the sentence weights, extracting 400-words multi-document sum- marization. The experimental results show that the proposed method can improve the quality of summary effectively with the au- tomatic summarization evaluation toolkit ROUGE on DUC2002.
出处
《计算机工程与应用》
CSCD
2013年第16期142-145,154,共5页
Computer Engineering and Applications
基金
国家高技术研究发展计划项目(863)(No.2007AA01Z151)