期刊文献+

ProFaM:一个蛋白质序列家族挖掘算法 被引量:2

ProFaM:An Efficient Algorithm for Protein Sequence Family Mining
下载PDF
导出
摘要 有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断间的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果. Reliable identification of protein families is a major challenge in bioinformatics. Clustering protein sequences may help to identify novel relationships among proteins. However, many clustering algorithms cannot be readily applied to protein sequences. One of the main problems is that the similarity between two protein sequences cannot be easily defined. A similarity analysis method based on traditional sequence alignment, which assumes conservation of contiguity between homologous segments, is inconsistent with genetic recombination. Especially for remote homology protein family members which possess similar structure or related function, this method cannot achieve correct results. Information about protein motifs is very important to the analysis of protein family sequences. In this paper, a novel protein sequence family mining algorithm called ProFaM is proposed. The ProFaM algorithm is a two-step method. In the first step, conserved motifs across protein sequences are mined using efficient prefix-projected strategy without candidate, and then based on these result motifs, combined with weight of motifs, a novel similarity measure function is constructed. In the second step, protein family sequences are clustered using a shared nearest neighbor method according to new similarity measure. Experiments on protein family sequences database Pfam show that the ProFaM algorithm improves performance. The satisfactory experimental results suggest that ProFaM may be applied to protein family analysis.
出处 《计算机研究与发展》 EI CSCD 北大核心 2007年第7期1160-1168,共9页 Journal of Computer Research and Development
基金 国家"八六三"高技术研究发展计划基金项目(2006AA02Z329) 国家自然科学基金项目(60573093)
关键词 蛋白质序列 蛋白质家族 序列模式 聚类 数据挖掘 生物信息学 protein sequence protein family motif clustering data mining bioinformatics
  • 相关文献

参考文献30

  • 1D G George.Proposal for the definition of a protein superfamily[OL].http://pir.georgetown.edu/pirwww/otherinfo/sfdef.pdf,1993-01-01.
  • 2M S Nikolski,J David.Family relationships:Should consensus reign?-Consensus clustering for protein families[J].Bioinformatics,2007,23(2):71-76.
  • 3M Lynch.Intron evolution as a population-genetic process[J].Natl Acad Sci,2002,99(9):6118-6123.
  • 4Y Zhang,K P V Vinci,K Powell,et al.Genome shuffling leads to rapid phenotypic improvement in bacteria[J].Nature,2002,415:644-646.
  • 5符维娟,汪源源,卢大儒.无比对的生物分子序列比较方法[J].生物医学工程学杂志,2005,22(3):598-601. 被引量:3
  • 6M Baron,D G Norman,et al.Protein modules[J].Trends Bioehem Sci,1991,16(1):13-17.
  • 7A Ben-Hut,D Brutlag.Remote homology detection:A motif based approach[J].Bioinformatics,2003,19(1):26-33.
  • 8X Wang,D Schroeder,D Dobbs,et al.Automated data-driven discovery of motif-based protein function classiers[J].Information Sciences,2003,155(1-2):1-18.
  • 9Pfaro[OL].http://www.sanger.ac.uk/Software/Pfam/,2007-05-30.
  • 10D Gusfield.Algorithms On Strings,Trees,and Sequences[M].New York:Cambridge University Press,1997.25-532.

二级参考文献23

  • 1[1]Blow DM.Structure and Mechanism of Chymotrypsin [J].Acc.Chem.Res.,1976,9:145-152.
  • 2[2]Barker WC,Dayhoff MO.Evolutionary and functional relationships of homologous physiological mechanisms [J].BioScience,1980,30:593-600.
  • 3[3]Murzin AG,Bemner SE,Hubbard T.Chothia C.SCOP:a structural claaaification of proteins database for the investigation of sequences and structures [DB].J.Mol.Biol.,1995,247:536-540.
  • 4[4]Taylor WR,Orengo CA.Protein Structure Alignment [J].J.Mol.Biol.,1989,208:1-22.
  • 5[5]Felsenstein J.PHYLIP (phylogeny inference package) version 3.5c.Distributed by the author,Department of Genetics,University of Washington,Seatle [CP].1993.
  • 6[6]Saitou N,Nei M.The neighbor-joining method:a new method for reconstructing phylogenetic trees [J].Mol.Biol.Evol.,1987,4 (4):406-425.
  • 7Fuchs R. From sequence to biology: the impact on bioinformatics. Bioinformatics, 2002;18 : 505.
  • 8Mount DW. Bioinformatics: sequence and genome analysis.Cold spring harbor laboratory press,NY, 2001.
  • 9Reinert G, Schbath S, Waterman MS. Probabilistic and statistical properties of words, an overview. J Comput Biol,2000;7(1):1.
  • 10Zharkikh AA and Rzhetsky A. Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies.Biosystems, 1993;30:93.

共引文献17

同被引文献10

  • 1何婷婷,戴文华,焦翠珍.基于混合并行遗传算法的文本聚类研究[J].中文信息学报,2007,21(4):55-60. 被引量:11
  • 2[1]Frederic M R.The protein folding problem.Scientific American,1991:1:31-34
  • 3[2]Backofen R,Will S,Clote P.Algorithmic approach to quantifying the hydrophobin force contribution in protein folding.In:Klein R,Ahman B.Pacific Symposim on Biocomputing,2000.Singapo re:World Scientific publishing Co Pto Ltd.2000;95-106
  • 4[3]Baker D.A surprising simplicity to protein folding.Nature,2000;405:39-42
  • 5[4]Wu C H,Artificial neural networks for molecular sequence analysis.Comput Chem,2006;21(4):21-24
  • 6[5]Clumdonia J.The importance of larger data sets for protein secondary structure prediction with neural networks.Protein Sci,2006;5(4):12-16
  • 7[6]Pedersen JT,Moult J.Ab initio protein folding simulations with genefic algorithms:simulations on the complete seqllence of small protoius.Proteins,2005;(1):35-39
  • 8[12]Foster I,Kesselman C,Nick JM,et al.Grid services for distributed systems integration.IEEE Computer,2002;35(6):36-38
  • 9李菁,相秉仁.基于结构分类的BP神经网络预测蛋白质二级结构[J].药学进展,2003,27(2):110-113. 被引量:7
  • 10张海霞,唐焕文,张立震,靳利霞,唐一源.蛋白质二级结构预测方法的评价[J].计算机与应用化学,2003,20(6):735-740. 被引量:21

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部