期刊文献+

融合句义结构模型的微博话题摘要算法 被引量:5

Microblog topics summarization algorithm merging sentential semantic structure model
下载PDF
导出
摘要 为了更快地从海量微博中获取话题的核心内容,提出融合句义结构模型的微博话题摘要方法.该方法利用句义结构模型抽取句子的语义格得到句子的语义特征,并基于LDA主题模型使用句义结构计算句子两两之间的语义相似度构建相似度矩阵,划分子主题类,得到句子的关联特征.融合句子的语义特征和关联特征,选取子主题内信息量最大的句子作为摘要结果.当压缩比为0.5%、1.0%和1.5%时,ROUGE值均明显优于对比系统.当压缩比为1.5%时,ROUGE-1值达到51.30%,ROUGE-SU*达到25.27%.实验结果表明:融合句义结构模型的分析方法能够深化句子的语义分析层次,提取的句义特征增强了语义信息的表达能力.综合考虑句子语义特征和关联特征的句子权重计算方法能够丰富句子的特征表示,减少语义信息丢失,使同类数据的语义相关性增强,有效降低了噪声的影响,从而提升摘要与话题的相关度.此外,所提出的方法处理不同话题的泛化能力较好,适用范围较广. A new microblog summarization framework based on sentential semantic structure model was proposed in order to provide concise summarization to help users quickly grasp the essence of topics. Sentential semantic features were extracted by sentential semantic structure model. I.atent Dirichlet allocation (LDA) topic model was used to calculate the pairwise sentence similarities and construct the similarity matrix based on sentential semantic structure. Sentences were clustered into several subtopics and the sentential relationship features were obtained. The most informative sentences were extracted from each subtopic through combining both sentential semantic features and relationship features. As a result, the value of ROUGE outperforms the contrast algorithms when the the compress ratio was 0.5 %, 1.0 and 1. 5%. The value of ROUGE-1 was 51. 30%, while that of ROUGE-SU* was 25. 27% when the compress ratio was 1.5%. Results indicate that the method that introduces sentential semantic structure model can better understand sentential semantic, and the extracted semantic features can highlight the description power of sentential semantic. Meanwhile, using both sentential semantic features and relationship features can enrich the features representation and reduce information loss, increasing the semantic relevance of similar data. Moreover, the impact of noise can he reduced. Besides, the proposed method has excellent generalization ability and can be applied to various topics.
出处 《浙江大学学报(工学版)》 EI CAS CSCD 北大核心 2015年第12期2316-2325,共10页 Journal of Zhejiang University:Engineering Science
基金 国家"242"信息安全计划资助项目(2005C48) 北京理工大学科技创新计划重大项目培育专项资助项目(2011CX01015)
关键词 微博 话题摘要 句义结构模型 自然语言处理 microblog topic summarization sentential semantic structure model natural languageprocessing
  • 相关文献

参考文献28

  • 1Wikipedia. Sina Weibo [EB/OL]. (2014- 11- 10)[2015- 10-20]. https ://en. wikipedia, org/wiki/Sina Weibo.
  • 2HE Y, SU W, TIAN Y, et al. Summarizing microblogs on network hot topics [C] // Proceedings of the 2011 In- ternational Conference on lnternet Technology and Appli- cations (iTAP 2011). New York: Piscataway, 2011: 1-4.
  • 3LONG R, WANG H F, CHEN Y Q, et al. Towards effective event detection, tracking and summarization on rnicroblog data [M] // Web-Age Information Manage- ment. Berlin: Springer, 2011:652-663.
  • 4WII.LIAN H, ZHANG Y. Threshold and associative based classification for social spam profile detection on Twitter [C] // 2013 9th International Conference on Semantics, Knowledge and Grids (SKG). New York: Piscataway, 2013: 113-120.
  • 5VANDERWENDE L, SUZUKI H, BROCKETT C, et al. Beyond SumBasic: task focused summarization with sentence simplification and lexical expansion [J]. Infor- mation Processing and Management, 2007, 43 ( 6 ): 1606 - 1618.
  • 6RADEV D R, JINC- H, STYS M, et al. Centroid--based summarization of multiple documents [J]. Information Processing and Management, 2004, 40(6) : 919 - 938.
  • 7SINGH M, KHAN F U. Effect of incremental EM on document summarization using probabilistic latent se mantic analysis [C] // Proceedings of the World Congress on Engineering (WCE 2012). Hong Kong: Newswood I.imited, 2012: 2198.
  • 8GAO D, LI W, OUYANG Y, et al. l.DA-based topic formation and topic-sentence reinforcement for graph- based multi-document summarization [ M] // Informa- tion Retrieval Technology. Berlin: Springer, 2012: 376 -385.
  • 9ARORA R, RAVINDRAN B. Latent dirichlet allocation based multi-document summarization [C] // Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. Singapore: ACM, 2008:91-97.
  • 10BINTI ZAHRI N A H, FUKUMOTO F, MATSUY OSHI S. Link analysis based on rhetorical relations for multi document summarization [J]. IEICE Transactions on Information and Systems, 2013, 96(5) :1182 - 1191.

二级参考文献16

  • 1陈立民.汉语的时态和时态成分[J].语言研究,2002,22(3):14-31. 被引量:50
  • 2卢志茂,刘挺,李生.统计词义消歧的研究进展[J].电子学报,2006,34(2):333-343. 被引量:28
  • 3刘挺,车万翔,李生.基于最大熵分类器的语义角色标注[J].软件学报,2007,18(3):565-573. 被引量:73
  • 4周强.汉语基本块描述体系[J].中文信息学报,2007,21(3):21-27. 被引量:25
  • 5周强.汉语语料库的短语自动划分和标注研究[D].北京:北京大学,2002.
  • 6刘开瑛,由丽萍.汉语框架语义知识库构建工程[C].中国中文信息学会成立二十五周年学术会议论文集,2006,11:64-71.
  • 7贾彦德.汉语语义学[M].北京:北京大学出版社,2005:117-130.
  • 8Palmer M, Gildea D, Kingsbury P. The proposition bank: an annotated corpus of semantic roles[J]. Com- putational Linguistics, 2005,31 ( 1 ) : 71 - 105.
  • 9龚千言.汉语的时相时制时态[M].北京:商务印书馆,1995.
  • 10Gildea D, Jurafsky D. Automatic labeling of semantic roles[J]. Compute Linguist, 2002,28 (3): 245 - 288.

共引文献15

同被引文献31

引证文献5

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部