This article is part of the series Information Theoretic Methods for Bioinformatics.

Open Access Research Article

Aligning Sequences by Minimum Description Length

JohnS Conery

Author Affiliations

Department of Computer and Information Science, University of Oregon, Eugene, OR 97403, USA

EURASIP Journal on Bioinformatics and Systems Biology 2007, 2007:72936 doi:10.1155/2007/72936


The electronic version of this article is the complete one and can be found online at: http://bsb.eurasipjournals.com/content/2007/1/72936


Received:26 February 2007
Revisions received:6 August 2007
Accepted:16 November 2007
Published:2 January 2008

© 2007 John S. Conery.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from . A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.

Research Article

References

  1. EW Myers, The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005). PubMed Abstract | Publisher Full Text OpenURL

  2. SF Altschul, TL Madden, AA Schaffer, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. AJ Phillips, Homology assessment and molecular sequence alignment. Journal of Biomedical Informatics 39(1), 18–33 (2006). PubMed Abstract | Publisher Full Text OpenURL

  4. JO Wrabl, NV Grishin, Gaps in structurally similar proteins: towards improvement of multiple sequence alignment. Proteins 54(1), 71–87 (2004). PubMed Abstract | Publisher Full Text OpenURL

  5. K Sjölander, Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20(2), 170–179 (2004). PubMed Abstract | Publisher Full Text OpenURL

  6. B-JM Webb, JS Liu, CE Lawrence, BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Research 30(5), 1268–1277 (2002). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  7. J Rissanen, Modelling by the shortest data description. Automatica 14(5), 465–471 (1978). Publisher Full Text OpenURL

  8. P Grünwald, A minimum description length approach to grammar inference. Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Computer Science (Springer, Berlin, Germany, 1996) 1040, pp. 203–216

  9. A Brazma, I Jonassen, J Vilo, E Ukkonen, Pattern discovery in biosequences. in International Conference on Grammar Inference (ICGI '98), Lecture Notes in Artificial Intelligence, vol. 1433, ed. by Honavar V, Slutski G (Springer, Ames, Iowa, USA, 1998), pp. 257–270

  10. L Cai, RL Malmberg, Y Wu, Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 19(suppl. 1), i66–i73 (2003). PubMed Abstract | Publisher Full Text OpenURL

  11. DB Searls, The computational linguistics of biological sequences. Artificial Intelligence and Molecular Biology, Menlo Park, Calif, USA (American Association for Artificial Intelligence, 1993), pp. 47–120

  12. D Bsearls, Linguistic approaches to biological sequences. Computer Applications in the Biosciences 13(4), 333–344 (1997). PubMed Abstract OpenURL

  13. A Bairoch, PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. M Vingron, MS Waterman, Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of Molecular Biology 235(1), 1–12 (1994). PubMed Abstract | Publisher Full Text OpenURL

  15. S Henikoff, Scores for sequence searches and alignments. Current Opinion in Structural Biology 6(3), 353–360 (1996). PubMed Abstract | Publisher Full Text OpenURL

  16. G Giribet, WC Wheeler, On gaps. Molecular Phylogenetics and Evolution 13(1), 132–143 (1999). PubMed Abstract | Publisher Full Text OpenURL

  17. Y Nozaki, M Bellgard, Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties. Bioinformatics 21(8), 1421–1428 (2005). PubMed Abstract | Publisher Full Text OpenURL

  18. JT Reese, WR Pearson, Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18(11), 1500–1507 (2002). PubMed Abstract | Publisher Full Text OpenURL

  19. L Allison, CS Wallace, CN Yee, Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 35(1), 77–89 (1992). PubMed Abstract | Publisher Full Text OpenURL

  20. JP Schmidt, An information theoretic view of gapped and other alignments. Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998, 561–572

  21. T Aynechi, ID Kuntz, An information theoretic approach to macromolecular modeling: I. Sequence alignments. Biophysical Journal 89(5), 2998–3007 (2005). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. B Morgenstern, DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15(3), 211–218 (1999). PubMed Abstract | Publisher Full Text OpenURL

  23. M Brudno, M Chapman, B Göttgens, S Batzoglou, B Morgenstern, Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4, 66 (2003). PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  24. TD Schneider, Information content of individual genetic sequences. Journal of Theoretical Biology 189(4), 427–441 (1997). PubMed Abstract | Publisher Full Text OpenURL

  25. N Krasnogor, DA Pelta, Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004). PubMed Abstract | Publisher Full Text OpenURL

  26. JS Conery, Realign: grammar-based sequence alignment (University of Oregon, http://teleost), . cs.uoregon.edu/realign webcite

  27. MO Dayhoff, RM Schwartz, BC Orcutt, A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, Washington, DC, USA, 1978 5(suppl. 3), 345–352

  28. DW Mount, Bioinformatics: Sequence and Genome Analysis, 2nd edn. (Cold Spring Harbor Laboratory Press, New York, NY, USA, 2004)

  29. S Henikoff, JG Henikoff, Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89(22), 10915–10919 (1992). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  30. GH Gonnet, MA Cohen, SA Benner, Exhaustive matching of the entire protein sequence database. Science 256(5062), 1443–1445 (1992). PubMed Abstract | Publisher Full Text OpenURL

  31. S Karlin, SF Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 87(6), 2264–2268 (1990). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  32. SR Eddy, Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22(8), 1035–1036 (2004). PubMed Abstract | Publisher Full Text OpenURL

  33. C Aurrecoechea, M Heiges, H Wang, et al. ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 35, D427–D430 (2007). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. JD Thompson, DG Higgins, TJ Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  35. JD Thompson, F Plewniak, O Poch, A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27(13), 2682–2690 (1999). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  36. R Carter, Speculations on the origins of Plasmodium vivax malaria. Trends in Parasitology 19(5), 214–219 (2003). PubMed Abstract | Publisher Full Text OpenURL

  37. M Cline, R Hughey, K Karplus, Predicting reliable regions in protein sequence alignments. Bioinformatics 18(2), 306–314 (2002). PubMed Abstract | Publisher Full Text OpenURL

  38. JS Conery, M Lynch, Nucleotide substitutions and the evolution of duplicate genes. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), Big Island of Hawaii, Hawaii, USA, January 2001, 167–178

  39. WR Pearson, Comparison of methods for searching protein sequence databases. Protein Science 4(6), 1145–1160 (1995). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  40. N Hulo, A Bairoch, V Bulliard, et al. The PROSITE database. Nucleic Acids Research 34, D227–D230 (2006). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL