This paper presents a new information theoretic framework for aligning sequences in
bioinformatics. A transmitter compresses a set of sequences by constructing a regular
expression that describes the regions of similarity in the sequences. To retrieve
the original set of sequences, a receiver generates all strings that match the expression.
An alignment algorithm uses minimum description length to encode and explore alternative
expressions; the expression with the shortest encoding provides the best overall alignment.
When two substrings contain letters that are similar according to a substitution matrix,
a code length function based on conditional probabilities defined by the matrix will
encode the substrings with fewer bits. In one experiment, alignments produced with
this new method were found to be comparable to alignments from
. A second experiment measured the accuracy of the new method on pairwise alignments
of sequences from the BAliBASE alignment benchmark.
Research Article
References
-
EW Myers, The fragment assembly string graph. Bioinformatics 21(suppl. 2), ii79–ii85 (2005). PubMed Abstract | Publisher Full Text
-
SF Altschul, TL Madden, AA Schaffer, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
AJ Phillips, Homology assessment and molecular sequence alignment. Journal of Biomedical Informatics 39(1), 18–33 (2006). PubMed Abstract | Publisher Full Text
-
JO Wrabl, NV Grishin, Gaps in structurally similar proteins: towards improvement of multiple sequence alignment. Proteins 54(1), 71–87 (2004). PubMed Abstract | Publisher Full Text
-
K Sjölander, Phylogenomic inference of protein molecular function: advances and challenges. Bioinformatics 20(2), 170–179 (2004). PubMed Abstract | Publisher Full Text
-
B-JM Webb, JS Liu, CE Lawrence, BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Research 30(5), 1268–1277 (2002). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
J Rissanen, Modelling by the shortest data description. Automatica 14(5), 465–471 (1978). Publisher Full Text
-
P Grünwald, A minimum description length approach to grammar inference. Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Computer Science (Springer, Berlin, Germany, 1996) 1040, pp. 203–216
-
A Brazma, I Jonassen, J Vilo, E Ukkonen, Pattern discovery in biosequences. in International Conference on Grammar Inference (ICGI '98), Lecture Notes in Artificial Intelligence, vol. 1433, ed. by Honavar V, Slutski G (Springer, Ames, Iowa, USA, 1998), pp. 257–270
-
L Cai, RL Malmberg, Y Wu, Stochastic modeling of RNA pseudoknotted structures: a grammatical approach. Bioinformatics 19(suppl. 1), i66–i73 (2003). PubMed Abstract | Publisher Full Text
-
DB Searls, The computational linguistics of biological sequences. Artificial Intelligence and Molecular Biology, Menlo Park, Calif, USA (American Association for Artificial Intelligence, 1993), pp. 47–120
-
D Bsearls, Linguistic approaches to biological sequences. Computer Applications in the Biosciences 13(4), 333–344 (1997). PubMed Abstract
-
A Bairoch, PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
M Vingron, MS Waterman, Sequence alignment and penalty choice. Review of concepts, case studies and implications. Journal of Molecular Biology 235(1), 1–12 (1994). PubMed Abstract | Publisher Full Text
-
S Henikoff, Scores for sequence searches and alignments. Current Opinion in Structural Biology 6(3), 353–360 (1996). PubMed Abstract | Publisher Full Text
-
G Giribet, WC Wheeler, On gaps. Molecular Phylogenetics and Evolution 13(1), 132–143 (1999). PubMed Abstract | Publisher Full Text
-
Y Nozaki, M Bellgard, Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties. Bioinformatics 21(8), 1421–1428 (2005). PubMed Abstract | Publisher Full Text
-
JT Reese, WR Pearson, Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 18(11), 1500–1507 (2002). PubMed Abstract | Publisher Full Text
-
L Allison, CS Wallace, CN Yee, Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 35(1), 77–89 (1992). PubMed Abstract | Publisher Full Text
-
JP Schmidt, An information theoretic view of gapped and other alignments. Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98), Maui, Hawaii, USA, January 1998, 561–572
-
T Aynechi, ID Kuntz, An information theoretic approach to macromolecular modeling: I. Sequence alignments. Biophysical Journal 89(5), 2998–3007 (2005). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
B Morgenstern, DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15(3), 211–218 (1999). PubMed Abstract | Publisher Full Text
-
M Brudno, M Chapman, B Göttgens, S Batzoglou, B Morgenstern, Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 4, 66 (2003). PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text
-
TD Schneider, Information content of individual genetic sequences. Journal of Theoretical Biology 189(4), 427–441 (1997). PubMed Abstract | Publisher Full Text
-
N Krasnogor, DA Pelta, Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004). PubMed Abstract | Publisher Full Text
-
JS Conery, Realign: grammar-based sequence alignment (University of Oregon, http://teleost), . cs.uoregon.edu/realign webcite
-
MO Dayhoff, RM Schwartz, BC Orcutt, A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure, Washington, DC, USA, 1978 5(suppl. 3), 345–352
-
DW Mount, Bioinformatics: Sequence and Genome Analysis, 2nd edn. (Cold Spring Harbor Laboratory Press, New York, NY, USA, 2004)
-
S Henikoff, JG Henikoff, Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89(22), 10915–10919 (1992). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
GH Gonnet, MA Cohen, SA Benner, Exhaustive matching of the entire protein sequence database. Science 256(5062), 1443–1445 (1992). PubMed Abstract | Publisher Full Text
-
S Karlin, SF Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences of the United States of America 87(6), 2264–2268 (1990). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
SR Eddy, Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 22(8), 1035–1036 (2004). PubMed Abstract | Publisher Full Text
-
C Aurrecoechea, M Heiges, H Wang, et al. ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 35, D427–D430 (2007). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
JD Thompson, DG Higgins, TJ Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
JD Thompson, F Plewniak, O Poch, A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27(13), 2682–2690 (1999). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
R Carter, Speculations on the origins of Plasmodium vivax malaria. Trends in Parasitology 19(5), 214–219 (2003). PubMed Abstract | Publisher Full Text
-
M Cline, R Hughey, K Karplus, Predicting reliable regions in protein sequence alignments. Bioinformatics 18(2), 306–314 (2002). PubMed Abstract | Publisher Full Text
-
JS Conery, M Lynch, Nucleotide substitutions and the evolution of duplicate genes. Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01), Big Island of Hawaii, Hawaii, USA, January 2001, 167–178
-
WR Pearson, Comparison of methods for searching protein sequence databases. Protein Science 4(6), 1145–1160 (1995). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
N Hulo, A Bairoch, V Bulliard, et al. The PROSITE database. Nucleic Acids Research 34, D227–D230 (2006). PubMed Abstract | Publisher Full Text | PubMed Central Full Text




