摘要
为了更快地从海量微博中获取话题的核心内容,提出融合句义结构模型的微博话题摘要方法.该方法利用句义结构模型抽取句子的语义格得到句子的语义特征,并基于LDA主题模型使用句义结构计算句子两两之间的语义相似度构建相似度矩阵,划分子主题类,得到句子的关联特征.融合句子的语义特征和关联特征,选取子主题内信息量最大的句子作为摘要结果.当压缩比为0.5%、1.0%和1.5%时,ROUGE值均明显优于对比系统.当压缩比为1.5%时,ROUGE-1值达到51.30%,ROUGE-SU*达到25.27%.实验结果表明:融合句义结构模型的分析方法能够深化句子的语义分析层次,提取的句义特征增强了语义信息的表达能力.综合考虑句子语义特征和关联特征的句子权重计算方法能够丰富句子的特征表示,减少语义信息丢失,使同类数据的语义相关性增强,有效降低了噪声的影响,从而提升摘要与话题的相关度.此外,所提出的方法处理不同话题的泛化能力较好,适用范围较广.
A new microblog summarization framework based on sentential semantic structure model was proposed in order to provide concise summarization to help users quickly grasp the essence of topics. Sentential semantic features were extracted by sentential semantic structure model. I.atent Dirichlet allocation (LDA) topic model was used to calculate the pairwise sentence similarities and construct the similarity matrix based on sentential semantic structure. Sentences were clustered into several subtopics and the sentential relationship features were obtained. The most informative sentences were extracted from each subtopic through combining both sentential semantic features and relationship features. As a result, the value of ROUGE outperforms the contrast algorithms when the the compress ratio was 0.5 %, 1.0 and 1. 5%. The value of ROUGE-1 was 51. 30%, while that of ROUGE-SU* was 25. 27% when the compress ratio was 1.5%. Results indicate that the method that introduces sentential semantic structure model can better understand sentential semantic, and the extracted semantic features can highlight the description power of sentential semantic. Meanwhile, using both sentential semantic features and relationship features can enrich the features representation and reduce information loss, increasing the semantic relevance of similar data. Moreover, the impact of noise can he reduced. Besides, the proposed method has excellent generalization ability and can be applied to various topics.
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2015年第12期2316-2325,共10页
Journal of Zhejiang University:Engineering Science
基金
国家"242"信息安全计划资助项目(2005C48)
北京理工大学科技创新计划重大项目培育专项资助项目(2011CX01015)
关键词
微博
话题摘要
句义结构模型
自然语言处理
microblog
topic summarization
sentential semantic structure model
natural languageprocessing