摘要
针对聚类问题中的非随机性缺失数据,本文基于高斯混合聚类模型,分析了删失型数据期望最大化算法的有效性,并揭示了删失数据似然函数对模型算法的作用机制.从赤池弘次信息准则、信息散度等指标,比较了所提出方法与标准的期望最大化算法的优劣性.通过删失数据划分及指示变量,推导了聚类模型参数后验概率及似然函数,调整了参数截尾正态函数的一阶和二阶估计量.并根据估计算法的有效性理论,通过关于得分向量期望的方程得出算法估计的最优参数.对于同一删失数据集,所提出的聚类算法对数据聚类中心估计更精准.实验结果证实了所提出算法在高斯混合聚类的性能上优于标准的随机性缺失数据期望最大化算法.
To provide a solution for clustering with data of missing not at random, this paper provided the efficiency analysis on the adapted expectation-maximization(EM) algorithm for Gaussian mixture clustering model with censored data. We also revealed the impact mechanism of the likelihood function of censored data on the clustering model and its estimation algorithm. With Akaike′s information criterion and Kullback-Leibler divergence,the performance of the proposed algorithm was compared with the standard EM algorithm. Based on data partition and the indicating variables of the censored data set, the paper proposed derived the posterior and likelihood function of the parameters, and adjusted its first and second moments of the truncated normal functions. According to the principles of efficient influence function, the optimal parameters of the algorithm are obtained by the equation of the expectation of the score vector. For the censored data, the proposed clustering algorithm is more accurate in estimating its centroids. The experimental results demonstrated that the proposed algorithm in Gaussian mixture clustering outperformed the standard EM algorithm, which was designed for the data of missing at random.
作者
余海燕
陈京京
邱航
王永
王若凡
YU Hai-Yan;CHEN Jing-Jing;QIU Hang;WANG Yong;WANG Ruo-Fan(Chongqing Key Laboratory of Electronic Commerce and Mod-ern Logistics,Chongqing University of Posts and Telecomms.,Ch-ongqing 404615;School of Computer Science and Engineering,University of Electronic Science and Technology,Chengdu 611731;Big Data Research Center,University of Electronic Science and Technology,Chengdu 611731;School of Information Technology Engineering,Tianjin University of Technology a nd Education,Tianjin 300222)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2021年第6期1302-1314,共13页
Acta Automatica Sinica
基金
国家自然科学基金(71601026,61601331,71571105)
重庆市产业类重大主题专项(cstc2017zdcy-zdzxX0013)
四川省重点研发项目(2018SZ0114,2019YFS0271)
天津市自然科学基金青年项目(18JCQNJC04700)资助。
关键词
高斯混合聚类
删失数据
期望最大化算法
截尾正态函数
二阶估计量
Gaussian mixture clustering
censored data
expectation-maximization
truncated normal function
second order moment