ProFaM:一个蛋白质序列家族挖掘算法被引量：2

ProFaM:An Efficient Algorithm for Protein Sequence Family Mining

下载PDF

导出

摘要有效分析蛋白质家族是生物信息学的一项重要挑战,聚类成为解决这一问题的主要途径之一.基于传统序列比对方法定义蛋白质序列间相似关系时,假设了同源片断间的邻接保守性,与遗传重组相冲突.为更好地识别蛋白质家族,提出了一种蛋白质序列家族挖掘算法ProFaM.ProFaM首先采用前缀投影策略挖掘表征蛋白质序列的模式,然后基于模式及其权重信息构造相似度度量函数,并采用共享最近邻方法,实现了蛋白质序列家族聚类.解决了以往方法在蛋白质模式挖掘及相似度设计中的不足.在蛋白质家族数据库Pfam上的实验结果证实了ProFaM算法在蛋白质家族分析上有良好的结果. Reliable identification of protein families is a major challenge in bioinformatics. Clustering protein sequences may help to identify novel relationships among proteins. However, many clustering algorithms cannot be readily applied to protein sequences. One of the main problems is that the similarity between two protein sequences cannot be easily defined. A similarity analysis method based on traditional sequence alignment, which assumes conservation of contiguity between homologous segments, is inconsistent with genetic recombination. Especially for remote homology protein family members which possess similar structure or related function, this method cannot achieve correct results. Information about protein motifs is very important to the analysis of protein family sequences. In this paper, a novel protein sequence family mining algorithm called ProFaM is proposed. The ProFaM algorithm is a two-step method. In the first step, conserved motifs across protein sequences are mined using efficient prefix-projected strategy without candidate, and then based on these result motifs, combined with weight of motifs, a novel similarity measure function is constructed. In the second step, protein family sequences are clustered using a shared nearest neighbor method according to new similarity measure. Experiments on protein family sequences database Pfam show that the ProFaM algorithm improves performance. The satisfactory experimental results suggest that ProFaM may be applied to protein family analysis.

作者熊赟陈越朱扬勇

机构地区复旦大学计算机与信息技术系

出处《计算机研究与发展》 EI CSCD 北大核心 2007年第7期1160-1168,共9页 Journal of Computer Research and Development

基金国家"八六三"高技术研究发展计划基金项目(2006AA02Z329) 国家自然科学基金项目(60573093)

关键词蛋白质序列蛋白质家族序列模式聚类数据挖掘生物信息学 protein sequence protein family motif clustering data mining bioinformatics

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献30

1D G George.Proposal for the definition of a protein superfamily[OL].http://pir.georgetown.edu/pirwww/otherinfo/sfdef.pdf,1993-01-01.
2M S Nikolski,J David.Family relationships:Should consensus reign?-Consensus clustering for protein families[J].Bioinformatics,2007,23(2):71-76.
3M Lynch.Intron evolution as a population-genetic process[J].Natl Acad Sci,2002,99(9):6118-6123.
4Y Zhang,K P V Vinci,K Powell,et al.Genome shuffling leads to rapid phenotypic improvement in bacteria[J].Nature,2002,415:644-646.
5符维娟,汪源源,卢大儒.无比对的生物分子序列比较方法[J].生物医学工程学杂志,2005,22(3):598-601. 被引量：3
6M Baron,D G Norman,et al.Protein modules[J].Trends Bioehem Sci,1991,16(1):13-17.
7A Ben-Hut,D Brutlag.Remote homology detection:A motif based approach[J].Bioinformatics,2003,19(1):26-33.
8X Wang,D Schroeder,D Dobbs,et al.Automated data-driven discovery of motif-based protein function classiers[J].Information Sciences,2003,155(1-2):1-18.
9Pfaro[OL].http://www.sanger.ac.uk/Software/Pfam/,2007-05-30.
10D Gusfield.Algorithms On Strings,Trees,and Sequences[M].New York:Cambridge University Press,1997.25-532.

二级参考文献23

1[1]Blow DM.Structure and Mechanism of Chymotrypsin [J].Acc.Chem.Res.,1976,9:145-152.
2[2]Barker WC,Dayhoff MO.Evolutionary and functional relationships of homologous physiological mechanisms [J].BioScience,1980,30:593-600.
3[3]Murzin AG,Bemner SE,Hubbard T.Chothia C.SCOP:a structural claaaification of proteins database for the investigation of sequences and structures [DB].J.Mol.Biol.,1995,247:536-540.
4[4]Taylor WR,Orengo CA.Protein Structure Alignment [J].J.Mol.Biol.,1989,208:1-22.
5[5]Felsenstein J.PHYLIP (phylogeny inference package) version 3.5c.Distributed by the author,Department of Genetics,University of Washington,Seatle [CP].1993.
6[6]Saitou N,Nei M.The neighbor-joining method:a new method for reconstructing phylogenetic trees [J].Mol.Biol.Evol.,1987,4 (4):406-425.
7Fuchs R. From sequence to biology: the impact on bioinformatics. Bioinformatics, 2002;18 : 505.
8Mount DW. Bioinformatics: sequence and genome analysis.Cold spring harbor laboratory press,NY, 2001.
9Reinert G, Schbath S, Waterman MS. Probabilistic and statistical properties of words, an overview. J Comput Biol,2000;7(1):1.
10Zharkikh AA and Rzhetsky A. Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies.Biosystems, 1993;30:93.

共引文献17

1王勇超,刘志华,安柳迪,张丛兴,王金杰.绿木霉Gv29-8丝氨酸蛋白酶S8/S53超家族基因特性及功能[J].微生物学通报,2020,47(2):532-541. 被引量：1
2王正义,蔡冬冬,贾义平,李春,张辉,刘珈均,刘瑞瑛.蜱螨丝氨酸蛋白酶研究进展[J].当代畜牧,2023(5):53-57.
3马磊,杨明夏,孙艳,李秀兰,朱昌亮.抗牛胰蛋白酶(TRY)抗体的制备和对杀虫剂抗性的初步检测[J].中国人兽共患病学报,2007,23(9):907-909. 被引量：1
4汪世华,王文勇,黄益洲,林琳,沙莉.丝氨酸蛋白酶研究进展[J].福建农业学报,2007,22(4):453-456. 被引量：19
5陈超英.一种基于信息理论的距离系数[J].生物数学学报,2007,22(4):725-730.
6周鲁明,刘殿辰,孙欢欢,赵博生.东亚三角涡虫胰蛋白酶Djtry的表达和酶活分析[J].遗传,2012,34(5):609-614. 被引量：2
7张华,谢利娟,刘晓娇,崔士超.月季花瓣丝氨酸蛋白酶基因RhSep1的克隆及生物信息学分析[J].东北林业大学学报,2012,40(8):40-46.
8李玲,南旭莹,姚玉华.生物序列比较的几种数学方法及其应用[J].渤海大学学报（自然科学版）,2013,34(1):1-7. 被引量：1
9王琦,宋飏,王荣,雷海英.中华绒螯蟹丝氨酸蛋白酶及其同源物的比较分析[J].长治学院学报,2013,30(2):7-13.
10金凯,牛东红,王劦,李家乐.缢蛏丝氨酸蛋白酶基因的序列特征及其表达分析[J].上海海洋大学学报,2013,22(4):481-487. 被引量：2

同被引文献10

1何婷婷,戴文华,焦翠珍.基于混合并行遗传算法的文本聚类研究[J].中文信息学报,2007,21(4):55-60. 被引量：11
2[1]Frederic M R.The protein folding problem.Scientific American,1991:1:31-34
3[2]Backofen R,Will S,Clote P.Algorithmic approach to quantifying the hydrophobin force contribution in protein folding.In:Klein R,Ahman B.Pacific Symposim on Biocomputing,2000.Singapo re:World Scientific publishing Co Pto Ltd.2000;95-106
4[3]Baker D.A surprising simplicity to protein folding.Nature,2000;405:39-42
5[4]Wu C H,Artificial neural networks for molecular sequence analysis.Comput Chem,2006;21(4):21-24
6[5]Clumdonia J.The importance of larger data sets for protein secondary structure prediction with neural networks.Protein Sci,2006;5(4):12-16
7[6]Pedersen JT,Moult J.Ab initio protein folding simulations with genefic algorithms:simulations on the complete seqllence of small protoius.Proteins,2005;(1):35-39
8[12]Foster I,Kesselman C,Nick JM,et al.Grid services for distributed systems integration.IEEE Computer,2002;35(6):36-38
9李菁,相秉仁.基于结构分类的BP神经网络预测蛋白质二级结构[J].药学进展,2003,27(2):110-113. 被引量：7
10张海霞,唐焕文,张立震,靳利霞,唐一源.蛋白质二级结构预测方法的评价[J].计算机与应用化学,2003,20(6):735-740. 被引量：21

引证文献2

1梁欢.网格中基于结构分类和位矩阵编码并行遗传算法的蛋白质二级结构预测[J].科学技术与工程,2008,8(5):1141-1145. 被引量：2
2胡耀炜,段磊,李岭,韩超.基于密度感知模式的生物序列分类算法[J].计算机应用,2018,38(2):427-432.

二级引证文献2

1孟翔燕,孟军,葛家麒.基于遗传算法的蛋白质二级结构预测方法研究进展[J].农机化研究,2009,31(5):72-75. 被引量：2
2李昕,马利,王金甲,赵春.特征选择(FS)算法在生物信息学中的应用[J].生物医学工程学杂志,2011,28(2):410-414. 被引量：6

1陈雄峰.一种序列家族Profile HMM寻优的PSO[J].昆明理工大学学报（理工版）,2007,32(1):50-53.
2董萍.序列模式挖掘算法在生物序列的应用研究[J].长春师范学院学报（自然科学版）,2008,27(1):35-37. 被引量：2
3陈雄峰.HMM在生物序列分析中的应用[J].闽江学院学报,2007,28(5):52-55.
4常磊玲,朱春鹤.一种新的生物序列模式挖掘算法[J].电脑知识与技术,2010,6(7):5140-5142.
5王淼,尚学群,薛贺.基于相邻模式段组合的生物序列模式挖掘算法[J].计算机工程与应用,2008,44(2):190-193. 被引量：1
6师鸣若.一种网络流量的序列模式挖掘方法[J].微计算机信息,2011,27(3):230-232.
7罗泽举,宋丽红.隐马尔可夫模型的多序列比对研究[J].计算机工程与应用,2010,46(7):171-174. 被引量：2
8秦兆文,刘嘉勇.基于PrefixSpan的应用层协议特征串提取算法[J].信息安全与通信保密,2014,12(6):105-108. 被引量：1
9张巍,刘峰,滕少华.改进的PrefixSpan算法及其在序列模式挖掘中的应用[J].广东工业大学学报,2013,30(4):49-54. 被引量：11
10何红洲,周明天.一种基于哈夫曼判定的蛋白质分类方法[J].计算机工程,2013,39(12):181-185.

计算机研究与发展

2007年第7期

浏览历史

内容加载中请稍等...

ProFaM:一个蛋白质序列家族挖掘算法被引量：2

参考文献30

二级参考文献23

共引文献17

同被引文献10

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

ProFaM:一个蛋白质序列家族挖掘算法 被引量：2

参考文献30

二级参考文献23

共引文献17

同被引文献10

引证文献2

二级引证文献2

相关作者

相关机构

相关主题

浏览历史

ProFaM:一个蛋白质序列家族挖掘算法被引量：2