摘要
基于酵母二次迁移实验中表达谱相似的五类基因表达数据 ,研究了不同相似性度量准则、数据预处理方法及质心初始化方式对 K -均值聚类效果的影响。结果表明 :若对基因表达数据进行 K-均值聚类分析 ,最好采用能反映数据结构特征的向量对质心进行初始化。若随机初始化质心 ,则采用取相对表达水平的预处理方式 ,以欧几里德距离 (Euclidean distance)作为相似性测量准则 ,可以获得最佳的聚类结果 ;在欧氏距离准则下 ,标准化处理因可能破坏原始数据的幅度特征 ,而导致聚类结果变坏。若以
Based on the five groups of genes expressed similarly during yeast diauxic shift, we studied the effects of different measuring metrics, data preprocessing and centroids initialization on K means clustering. The results illustrate that the best centroids initialization in K means clustering is to select vectors characterized the structure of the dataset. However, if the centroids are initialized randomly, clustering on the relative expression ratio under Euclidean distance metrics can obtain the best results. With Euclidean distance, normalization of the dataset only leads to worse results, for amplitude character of the dataset maybe destroyed. Meanwhile, different data preprocessing ways make unclear differences under Pearson correlation coefficient metrics.
出处
《仪器仪表学报》
EI
CAS
CSCD
北大核心
2003年第z1期189-192,209,共5页
Chinese Journal of Scientific Instrument
关键词
基因表达
聚类分析
K-均值聚类
数据预处理
Gene expression Clustering analysis K means clustering Data preprocess