摘要
案件舆情时间线生成是将同一案件的舆情新闻按照时间顺序生成话题簇,对于用户了解案件的发展过程具有重要意义,本质可以看做一个时间约束下的无监督聚类任务.但是描述同一案件的舆情新闻可能存在许多相同的要素导致其在聚类空间中的表征出现重叠.为了生成更有区分度的文本表征,基于自编码框架,提出一种差异性案件要素增强的案件舆情时间线生成方法.首先构建涉案舆情时间线数据集并生成每条微博文本的差异性要素;然后将差异性要素、微博文本和案件时间作为BERT编码器的输入,基于自编码框架生成文本的低维特征向量;最后基于该特征向量和K-Means聚类的方法,使用软聚类生成案件舆情时间线.实验结果表明,在构造的涉案舆情时间线数据集上,提出的方法在ACC和NMI两个聚类指标上均有较大提升.
The generation of case public opinion timeline is to generate topic clusters of public opinion news of the same case in chronological order,which is of great significance for users to understand the development process of the case.In essence,it can be regarded as an unsupervised clustering task under time constraints.However,the public opinion news describing the same case may have many similar elements that lead to overlapping representations in the clustering space.In order to generate more discriminative text representations,based on the auto-encoding framework,a method for generating public opinion timeline of cases with enhanced elements of different cases is proposed.First,a dataset of the public opinion involved in the case is constructed and the different elements of each Weibo text is generated;then the different elements,Weibo text and the time of the case are used as the input of the BERT encoder,and the low-dimensional feature vector of the text is generated based on the auto-encoding framework;Finally,based on the feature vector and K-Means clustering method,soft clustering is used to generate the public opinion timeline of the case.The experimental results show that on the constructed timeline dataset of public opinion involved in the case,the proposed method has a great improvement in the two clustering indicators of ACC and NMI.
作者
高盛祥
赵瑶
余正涛
黄于欣
GAO Sheng-xiang;ZHAO Yao;YU Zheng-tao;HUANG Yu-xin(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第9期1902-1907,共6页
Journal of Chinese Computer Systems
基金
国家重点研发计划项目(2018YFC0830105,2018YFC0830101,2018YFC0830100)资助
国家自然科学基金项目(61972186,61762056,61472168)资助
云南省重大科技专项计划项目(202002AD080001)资助
云南省基础研究专项面上项目(202001AT070046)资助
云南省高新技术产业专项项目(201606)资助.
关键词
案件舆情时间线
差异性案件要素
自编码
软聚类
case public opinion timeline
elements of different cases
auto-encoder
soft clustering