摘要
有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断间的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果.
Reliable identification of protein families is a major challenge in bioinformatics. Clustering protein sequences may help to identify novel relationships among proteins. However, many clustering algorithms cannot be readily applied to protein sequences. One of the main problems is that the similarity between two protein sequences cannot be easily defined. A similarity analysis method based on traditional sequence alignment, which assumes conservation of contiguity between homologous segments, is inconsistent with genetic recombination. Especially for remote homology protein family members which possess similar structure or related function, this method cannot achieve correct results. Information about protein motifs is very important to the analysis of protein family sequences. In this paper, a novel protein sequence family mining algorithm called ProFaM is proposed. The ProFaM algorithm is a two-step method. In the first step, conserved motifs across protein sequences are mined using efficient prefix-projected strategy without candidate, and then based on these result motifs, combined with weight of motifs, a novel similarity measure function is constructed. In the second step, protein family sequences are clustered using a shared nearest neighbor method according to new similarity measure. Experiments on protein family sequences database Pfam show that the ProFaM algorithm improves performance. The satisfactory experimental results suggest that ProFaM may be applied to protein family analysis.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2007年第7期1160-1168,共9页
Journal of Computer Research and Development
基金
国家"八六三"高技术研究发展计划基金项目(2006AA02Z329)
国家自然科学基金项目(60573093)