摘要
分布不均衡的数据在通过传统聚类分析的方式进行标注时,聚类效果容易偏向于样本数多的类,从而造成标注出现误差的问题。针对此问题提出改进的含有均衡约束聚类算法的标注方法,对不均衡数据的聚类标注准确率实现了比较有效的提高,方法包含数据初始聚类、专家知识调整,数据均衡化处理,含均衡约束聚类等步骤。通过初始聚类对不均衡数据进行初始类标签分配,专家知识调整对部分数据错误标注进行标签调整修改,对数据进行均衡化处理得到均衡数据集,通过均衡约束聚类对均衡数据进行标签最终精确分配。经仿真验证表明,上述方法比较有效的提高了不均衡数据标注准确率。
When labeling on unbalanced datasets based on clustering analysis, it has a problem that clustering effect favors in ‘big’ cluster causing the errors. Focus on the problem, we proposed a labeling method based on a new clustering algorithm, the method includes initial clustering, expert knowledge modifying the error, balanced processing of the unbalanced datasets and re-clustering on balanced datasets. We got the initial clusters by the initial clustering. Then we modified the errors for a part of the data under the guidance of the expert knowledge. After the balanced processing of the unbalanced data, we proposed and used a new clustering algorithm with balancing constraint, and the data are re-labeled based on the clustering method, which finally improves the accuracy of the labeled results. Through simulation, it is proved that the proposed method can improve the accuracy of clustering and labeling.
作者
赵俊杰
黄四牛
吴正午
王帅
ZHAO Jun-jie;HUANG Si-niu;WU Zheng-wu;WANG Shuai(Science and Technology on Information System Engineering Laboratory,Beijing 100038,China)
出处
《计算机仿真》
北大核心
2020年第2期476-480,共5页
Computer Simulation
基金
国防科技创新特区项目支持。
关键词
不均衡数据
数据标注
聚类分析
均衡化处理
仿真验证
Imbalanced data
Data labeling
Clustering analysis
Balance processing
Simulation verification