期刊文献+

数据预处理和初始化方法对K-均值聚类的影响 被引量:4

Effects of Data Preprocessing and Intialization on K-means Clustering
下载PDF
导出
摘要 基于酵母二次迁移实验中表达谱相似的五类基因表达数据 ,研究了不同相似性度量准则、数据预处理方法及质心初始化方式对 K -均值聚类效果的影响。结果表明 :若对基因表达数据进行 K-均值聚类分析 ,最好采用能反映数据结构特征的向量对质心进行初始化。若随机初始化质心 ,则采用取相对表达水平的预处理方式 ,以欧几里德距离 (Euclidean distance)作为相似性测量准则 ,可以获得最佳的聚类结果 ;在欧氏距离准则下 ,标准化处理因可能破坏原始数据的幅度特征 ,而导致聚类结果变坏。若以 Based on the five groups of genes expressed similarly during yeast diauxic shift, we studied the effects of different measuring metrics, data preprocessing and centroids initialization on K means clustering. The results illustrate that the best centroids initialization in K means clustering is to select vectors characterized the structure of the dataset. However, if the centroids are initialized randomly, clustering on the relative expression ratio under Euclidean distance metrics can obtain the best results. With Euclidean distance, normalization of the dataset only leads to worse results, for amplitude character of the dataset maybe destroyed. Meanwhile, different data preprocessing ways make unclear differences under Pearson correlation coefficient metrics.
出处 《仪器仪表学报》 EI CAS CSCD 北大核心 2003年第z1期189-192,209,共5页 Chinese Journal of Scientific Instrument
关键词 基因表达 聚类分析 K-均值聚类 数据预处理 Gene expression Clustering analysis K means clustering Data preprocess
  • 相关文献

参考文献20

  • 1[1]Ermolaeva O, Rastogi M, Pruitt K, et al., Data management and analysis for gene expression arrays. Nature Genetics, 1998,20:19~23.
  • 2[2]Welford S, Gregg J, Chen E, et al., Detection of differentially expressed genes in primary tumor tissues using representational differences analysis coupled to microarray hybridization. Nucleic Acids Research, 1998, 26:3059~3065.
  • 3[3]Charlie C, Chen Y. DNA microarray technology and its applications. Biotechnology Advances, 2000, 18(1): 35~46.
  • 4[4]Brazma A, Vilo J. Gene expression data analysis. FEBS Lett., 2000, 480(1):17~24.
  • 5[5]David R, Michael G. Interative visualization and exploration of relationships between biological objects. Trends in Biotechnology, 2000,18:487~494.
  • 6[6]Tang C, Zhang L, Zhang A, et al., Interrelated Two-way Clustering:An Unsupervised Approach for Gene Expression Data Analysis. Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001),2002, 41~48.
  • 7[7]Sherlock G. Analysis of large-scale gene expression data. Curr. Opin. Immunol, 2000, 12:201~205.
  • 8[8]Petri T, Mikko K, Wonga G, et al., Analysis of gene expression data using self-organizing maps. FEBS Letters, 1999,451:142~146.
  • 9[9]Tamayo P, Slonim D, Mesirov J, et al., Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci., USA, 1999, 96(6):2907~2912.
  • 10[10]Kaski S. Learning metrics for exploratory data analysis. Neural Networks for Signal Processing XI. Proceedings of the 2001 IEEE Signal Processing Society Workshop, 2001, 53~62.

同被引文献38

引证文献4

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部