期刊文献+

重采样和集成学习相结合的文本多标签分类

Multi-label Classification Based on Resampling and Ensemble Learning
下载PDF
导出
摘要 医患纠纷类裁判文书的多标签分类是对其进行高效检索和管理的基础,然而,医患纠纷数据集的类别不平衡和标签共生现象直接影响到文书的多标签分类效果。为此,提出了一种重采样和集成学习相结合的文本多标签分类方案。该方案首先提出一种基于标签集合平均稀疏度的样本重采样算法,用于降低标签共生对重采样的影响,从而改善数据集的类别不平衡性;然后,提出一种基于集成学习的多标签分类算法,其基于重采样后的数据集分别训练出多个基分类器,并对各基分类器以一票否决的投票策略进行组合,从而进一步提升分类器的多标签分类效果。实验结果表明,提出的多标签分类方案不仅适用于医患纠纷类裁判文书,而且适用于其他存在类别不平衡和标签共生问题的文本数据集。 The multi-label classification of medical dispute judgment documents is the basis of efficient retrieval and management,but its effect is affected directly by the class imbalance and label co-occurrence of medical dispute dataset.Therefore,this paper proposes a multi-label classification scheme based on sample resampling and ensemble learning.The scheme includes two parts:in the first part,a resampling algorithm based on the average sparsity of label set is proposed to reduce the impact of label co-occurrence on resampling,so as to improve the class imbalance of dataset;in the second part,a multi-label classification algorithm based on ensemble learning is proposed.It trains multiple base classifiers based on multiple datasets obtained after resampling,and then combines the base classifiers with the voting strategy of one vote veto,so as to further improve the multi-label classification effect of the classifier.Experimental results show that the scheme proposed in this paper is not only suitable for medical dispute judgment documents,but also available for other text datasets with class imbalanced and label co-occurrence problems.
作者 王天昊 张沛 张昭 陈西海 王晶 张柏礼 WANG Tianhao;ZHANG Pei;ZHANG Zhao;CHEN Xihai;WANG Jing;ZHANG Baili(School of Computer Science and Engineering,Southeast University,Nanjing 211189,China;State Grid Zaozhuang Power Supply Company,Zaozhuang,Shandong 277099,China;State Key Laboratory of Smart Grid Protection and Control,Nanjing 211106,China;Nari Group Corporation,Nanjing 211106,China)
出处 《计算机科学与探索》 CSCD 北大核心 2023年第4期892-901,共10页 Journal of Frontiers of Computer Science and Technology
基金 智能电网保护和运行控制国家重点实验室项目 国家重点研发计划(2021YFC3340305) 中央高校基本科研业务费专项资金(2242018S30023,2242017S30025)。
关键词 类别不平衡 多标签分类 集成学习 重采样算法 标签共生 class imbalance multi-label classification ensemble learning resampling algorithm label co-occurrence
  • 相关文献

参考文献1

二级参考文献17

  • 1李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197. 被引量:93
  • 2张庆国,薛德军,张振海,张君玉.海量数据集上基于特征组合的关键词自动抽取[J].情报学报,2006,25(5):587-593. 被引量:17
  • 3王灿辉,张敏,马少平,黄宇.基于相邻词的中文关键词自动抽取[J].广西师范大学学报(自然科学版),2007,25(2):161-164. 被引量:10
  • 4黄昌宁,赵海.由字构词--中文分词新方法[c]//中文信息处理前沿进展--中国中文信息学会二十五周年学术会议论文集,北京:清华大学出版社,2006:53-63.
  • 5Hulth A. Combining Machine Learning and Natural Language Pro- cessing for Automatic Keyword Extraction [ D ]. Stockholm : Stock- holm University, 2004.
  • 6Chu C M, O' Brien A. Subject Analysis :The Critical First Stage in In- dexing[J]. Journal oflnformation Science, 1993, 19(6) : 439 -454.
  • 7Frank E, Paynter G W, Witten I H,et al. Domain - Specific Key- phrase Extraction [ C ]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. Mor- gan Kaufmann, 1999 : 668 -673.
  • 8Zhang K, Xu H, Tang J, et al. Keyword Extraction Using SuppOrt Vector Machine [ C ]. In: Proceedings of the 7th International Con-ference on Web - Age Information Management ( WAIM2006 ), Hong Kong, China. 2006 :85 - 96.
  • 9中国科学院计算技术研究所.ICTCLAS汉语分词系统简介[EB/OL].[2011-08-13].http://ictclas.org/ictclas_introduction.html.
  • 10Kudo T. CRF + + : Yet Another CRF Toolkit[ EB/OL]. [ 2011 - 08 - 07 ]. http ://crfpp. sourceforge, net/.

共引文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部