摘要
针对电信欠费挖掘主题,结合电信欠费数据非平衡的特点,重点研究了缺失与离群数据对分类结果的影响,从而提出了一个面向电信欠费挖掘的数据质量评估体系(TIM-DQAS):对于缺失评估,提出了一种基于类分布差异的属性加权算法,以衡量输入属性的缺失代价;对于离群评估,分析了非平衡数据中的离群点对分类结果的影响,提出离群度的概念,以量化离群点的影响。基于某城市电信小灵通数据的对比实验,给出了评估结果的参照值,验证了评估策略的有效性。
Aiming at telecom insolvency mining, combining with the imbalance nature of telecom insolvency data, the research priority is set upon the impact on classification result caused by missing values and outliers, and thus a Data Quality Assessment System for Telecom Insolvency Mining(TIM-DQAS) is presented.In the missing evaluation sub-system, a class- distribution-based attribute weighting algorithm is presented to measure the missing costs of input attributes.In the outlier evaluation sub-system,the impact on classification result caused by oufliers in imbalance data is analyzed, and the outlier degree is proposed to measure the impact caused by outliers.Based on a series of contrast experiments on telecom personal handphone data of a city,a reference assessing result is provided,and the effectiveness of the assessing strategy is verified.
出处
《计算机工程与应用》
CSCD
北大核心
2011年第12期220-224,233,共6页
Computer Engineering and Applications
基金
国家高技术研究发展计划(863) No.2008AA042902
No.2009AA04Z162
高等学校学科创新引智(111)计划资助(No.B07031)~~
关键词
电信
数据挖掘
欠费主题
数据质量评估
缺失
非平衡
离群度
telecom
data mining
insolvency
data quality assessment
missing value
imbalance
outlier degree