期刊文献+

基于生物信息学特征的DNA序列数据压缩算法 被引量:8

Bioinformatics Features Based DNA Sequence Data Compression Algorithm
下载PDF
导出
摘要 本文通过将生物学特征和生物学含义引入DNA序列数据的压缩处理中,提出了基于生物信息学特征的BioLZMA压缩算法.在BioLZMA算法中,DNA序列根据组成部分生物学含义的不同切分重组为四个集合:编码序列CDS集合、内含子序列集合、RNA序列集合以及剩余序列的集合.根据各集合中序列的具体生物学特征分别使用针对性的压缩策略进行预处理,并通过LZMA算法进行压缩编码.实验结果表明,BioLZMA算法在基准测试序列上的压缩性能优于原有的DNA序列压缩方法.特别是对于生物信息学特征清晰的长序列,算法能够在较短的时间内获得较高的压缩率. A novel bioinformatics features based DNA Sequence data compression algorithm of BioLZMA is proposed in this paper.In BioLZMA,the DNA sequence data is sliced and reformed into 4 clusters according with biological meanings:the coding sequence cluster,the intron cluster,the RNA cluster and the residual cluster.By employing pointed compression strategies in data pre-processing,the clusters are compressed separately with LZMA algorithm.Experimental results demonstrated the better performance of BioLZMA than original DNA compression algorithms on benchmark sequences.Especially on long DNA sequence with significant bioinformatics features,BioLZMA algorithm can achieve higher compression ratio with little computation time.
出处 《电子学报》 EI CAS CSCD 北大核心 2011年第5期991-995,共5页 Acta Electronica Sinica
基金 国家自然科学基金(No.60872125) 霍英东教育基金会高等院校青年教师基金基础性研究课题 深圳市基础研究项目(杰青奖) 广东省自然科学基金
关键词 DNA数据压缩 生物信息学 序列重组 近似重复片段 LZMA DNA sequence data compression bioinformatics sequence regroup approximate repeat fragment Lempel-Ziv-Markov chain algorithm(LZMA)
  • 相关文献

参考文献15

  • 1Galperin M Y,Cochrane G R. Petabyte-scale innovations at the european nucleotide archive E J ], Nucleic Acids Research, 2009,37:D1-D4.
  • 2Srinivasa K G,Jagadish M,et al.Efficient compression of non- repetitive DNA sequences using dynamic programming [A]. Proc of International Conference on Advanced Computing and Communications[C]. Mangalore: ADCOM, 2006.569 - 574.
  • 3Grumbach S, Tahi F. Compression of DNA sequences[ A ]. Proc of Data Compression Conference [C]. Snowbird: DCC, 1993. 340 - 350.
  • 4Chen X, Kwong S, et al. A compression algorithm for DNA sequences and its applications in genome comparison [ A]. Proc of the 10th Workshop on Genome Informatics [ C ]. Tokyo: GIW, 1999.51 - 61.
  • 5Matsumoto T, Sadakane K, et al. Biological sequence compres- algorithms [ A ]. Proc of Genome Informatics Workshop [ C]. Tokyo: CIW, 2000.43 - 52.
  • 6Chen X, Li M, et al. DNACompress: Fast and effective DNA sequence compression[J].Bioinformatics,2002,18 (12) : 1696 - 1698.
  • 7Korodi G, Tabus I. An efficient normalized maximum likelihood algorithm for DNA sequence compression [J].ACM Transactions on Information Systems,2005,23 ( 1 ) :3 - 34.
  • 8林毅申,林丕源,彭宏.基于字典的DNA序列压缩算法研究及应用[J].计算机应用研究,2007,24(6):265-267. 被引量:4
  • 9Baxevanis A D, Ouellette B F F. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Third Edition [M]. United States: Wiley Publishing House,2005.
  • 10王玉,饶妮妮,匡斌,袁祚涌.基于小波变换技术预测DNA序列的编码区[J].电子学报,2007,35(1):141-144. 被引量:6

二级参考文献22

  • 1邢仲璟,林丕源,林毅申.基于Bioperl的生物二次数据库建立及应用[J].计算机系统应用,2004,13(11):58-60. 被引量:7
  • 2R B Farber,A S Lapedes,Sirotkin K M.Determination of eukaryotic protein coding regions using neural networks and information theory[J].J Mol Biol,1992,226(2):471-479.
  • 3S V Buldyrev,et al.Long-range correlation properties of coding and noncoding DNA sequences:Genbank analysis[J].Phys Rev E,1995,51(5):5084-5094.
  • 4S Dong,D B Searls.Gene structure prediction by linguistic methods[J].Genomics,1994,23(3):540-551.
  • 5W Lee,L Luo.Periodicity of base correlation in nucleotide sequence[J].Phys Rev E,1997,56(1):848-851.
  • 6John A Berger,Sanjit K Mitra,Marco Carli,et al.Visualization and analysis of DNA sequences using DNA walks[J].Journal of the Franklin Institute,2004,341(1-2):37-53.
  • 7D Anasstassiou.Frequency-domain analysis of bio-molecular sequences[J].J.Bioinformatics,2000,16(12):1073-1081.
  • 8Stephane Mallat.A Wavelet Tour of Signal Processing.Academic Press[M].Sept.15,1999.
  • 9S Tiwari,S Ramachandran,A Bhattacharya,et al.Prediction of probable genes by Fourier analysis of genomic sequences[J].CABIOS,1997,13(3):263-270.
  • 10M Burset,R Guigó.Evaluation of Gene Structure prediction program[J].Genomics,1996,34(3):353-367.

共引文献6

同被引文献86

  • 1王玉,饶妮妮,匡斌,袁祚涌.基于小波变换技术预测DNA序列的编码区[J].电子学报,2007,35(1):141-144. 被引量:6
  • 2林毅申,林丕源,彭宏.基于字典的DNA序列压缩算法研究及应用[J].计算机应用研究,2007,24(6):265-267. 被引量:4
  • 3周家锐,纪震,等.基于自适应智能单粒子优化算法的Gabor人脸识别方法[A].全国模式识别学术会议[C].重庆:CCPR,2010.359-363.
  • 4张丽霞,张义青,林丕源,刘吉平.基于字符和0/1码的DNA压缩模式匹配算法[J].计算机应用研究,2007,24(9):22-24. 被引量:3
  • 5Ferreira P J S G, Neves A J R, et al. Explorin three-base periodicity for DNA compression and modeling. Proceeding of the IEEE Confer- ence on Acoustics ,Speech and Signal Processing. Toulouse ,2006: 877-880.
  • 6Chen X, Kwong S, et al. A compression algorithm for DNA se- quences and its applications in genome comparison. Procceeding of the 10th Workshop on Genome Informatics. Tokyo: GIW, 1999:51 - 61.
  • 7Korodi G, Tabus I, et al. DNA sequence compression-based on the normalized maximum likelihood model IEEE Signal Processing Maga- zine, 2007 ; 24 ( 1 ) :47-53.
  • 8Galpedn M Y, Cochrane G R. Petabyte-scale innovations at the European nucleotide archive[J]. Nucleic Acids Research, 2009, 37:D1 - D4.
  • 9Srinivasa K G, Jagadish M, et al. Efficient compression of non- repetitive DNA sequences using dynamic programming [ A ]. Proceeding of International Conference on Advanced Comput- ing and Communications [ C ]. Mangalore: ADCOM, 2006. 569 - 574.
  • 10Chen X, Kwong S, et al. A compression algorithm for DNA se- quences and its applications in genome comparison[ A ]. Pro- ceeding of the 10th Workshop on Genome Informafics[ C]. Tokyo: GIW, 1999.51 - 61.

引证文献8

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部