摘要
提出基于最近邻插补和关联规则的缺失数据插补方法,将不含缺失数据的变量作为辅助变量,通过定义距离函数寻找与含缺失数据的样本单元距离较近的样本,然后利用挖掘得到的关联规则支持度和提升度乘积的倒数作为权重,对样本单元之间的距离进行加权处理,得到加权距离,再用加权距离最小的样本单元对应的属性值对缺失值进行插补。这种方法可以解决由不同最近距离样本单元得到不同插补值的问题,最后给出了该方法的实施步骤和应用范例。
This paper proposes a new missing data imputation method based on nearest neighbor imputation and the association rules . The variables w hose sample data are complete can be used as auxiliary variables ,by defining the distance function we can obtain which sample (with complete data) is nearest to the sample with missing data . Then we calculate weight using the support and lift of the association rules related to the missing data ,so that we get the weighted distance ,the weighted distance reasonably reflects the dependency relationships among samples with complete data and samples with missing data .A new completing procedure and an example are developed and presented .
出处
《统计与信息论坛》
CSSCI
北大核心
2015年第1期35-40,共6页
Journal of Statistics and Information
基金
全国统计科学研究重点项目<小微工业企业抽样调查问题研究>(2013LZ34)
北京市社科基金重点项目<基于北京市地理分布的空间抽样设计研究>(14JGA022)
北京市优博论文指导教师人文社科项目(20121000202)
关键词
关联规则
缺失数据
最近邻插补
加权距离
association rules
missing data
nearest neighbor imputation
weighted distance