EURASIP BSB


This article is part of the series Information Theoretic Methods for Bioinformatics.

Open Access Research Article

Compressing Proteomes: The Relevance of Medium Range Correlations

Dario Benedetto1, Emanuele Caglioti1 and Claudia Chica2*

Author Affiliations

1 Dipartimento di Matematica, Università di Roma "La Sapienza", Piazzale Aldo Moro 5, Roma 00185, Italy

2 Structural and Computational Biology Unit, EMBL Heidelberg, Meyerhofstraße 1, Heidelberg 69117, Germany

For all author emails, please log on.

EURASIP Journal on Bioinformatics and Systems Biology 2007, 2007:60723 doi:10.1155/2007/60723


The electronic version of this article is the complete one and can be found online at: http://bsb.eurasipjournals.com/content/2007/1/60723


Received:14 January 2007
Revisions received:28 May 2007
Accepted:10 September 2007
Published:30 October 2007

© 2007 Dario Benedetto et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.

Research Article

[12345678910111213141516171819202122232425262728293031]

References

  1. JC Wootton, Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & Chemistry 18(3), 269–285 (1994). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. BE Blaisdell, A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences. Journal of Molecular Evolution 19(2), 122–133 (1983). PubMed Abstract | Publisher Full Text OpenURL

  3. Y Almirantis, A Provata, An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. BioEssays 23(7), 647–656 (2001). PubMed Abstract | Publisher Full Text OpenURL

  4. O Weiss, MA Jiménez-Montaño, H Herzel, Information content of protein sequences. Journal of Theoretical Biology 206(3), 379–386 (2000). PubMed Abstract | Publisher Full Text OpenURL

  5. CG Nevill-Manning, IH Witten, Protein is incompressible. Proceedings of the Data Compression Conference (DCC '99), Snowbird, Utah, USA, March 1999, 257–266

  6. T Matsumoto, K Sadakane, H Imai, Biological sequence compression algorithms. Genome Informatics 11, 43–52 (2000). PubMed Abstract | Publisher Full Text OpenURL

  7. MD Cao, TI Dix, L Allison, C Mears, A simple statistical algorithm for biological sequence compression. Proceedings of the Data Compression Conference (DCC '07), Snowbird, Utah, USA, March 2007, 43–52

  8. A Hategan, I Tabus, Protein is compressible. Proceedings of the 6th Nordic Signal Processing Symposium (NORSIG '04), Espoo, Finland, June 2004, 192–195

  9. D Adjeroh, F Nan, On compressibility of protein sequences. Proceedings of the Data Compression Conference (DCC '06), Snowbird, Utah, USA, March 2006, 422–434

  10. G Sampath, A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae. Proceedings of the IEEE Bioinformatics Conference (CSB '03), Stanford, Calif, USA, August 2003, 287–293

  11. CE Shannon, A mathematical theory of communication. Bell System Technical Journal 27, 379–423 and 623–656 (1948)

  12. J Cleary, I Witten, Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)

  13. FMJ Willems, YM Shtarkov, TJ Tjalkens, The context-tree weighting method: basic properties. IEEE Transactions on Information Theory 41(3), 653–664 (1995). Publisher Full Text OpenURL

  14. Integr8 web portal. [ftp://ftp.ebi.ac.uk/pub/databases/integr8/] webcite

  15. J Abel, The data compression resource on the internet. [http://www.datacompression.info/] webcite

  16. CA Orengo, JM Thornton, Protein families and their evolution—a structural perspective. Annual Review of Biochemistry 74, 867–900 (2005). PubMed Abstract | Publisher Full Text OpenURL

  17. J Heringa, The evolution and recognition of protein sequence repeats. Computers & Chemistry 18(3), 233–243 (1994). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. MA Andrade, C Petosa, SI O'Donoghue, CW Müller, P Bork, Comparison of ARM and HEAT protein repeats. Journal of Molecular Biology 309(1), 1–18 (2001). PubMed Abstract | Publisher Full Text OpenURL

  19. S Kirkpatrick, CD Gelatt Jr., MP Vecchi, Optimization by simulated annealing. Science 220(4598), 671–680 (1983). PubMed Abstract | Publisher Full Text OpenURL

  20. LA Mirny, EI Shakhnovich, Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. Journal of Molecular Biology 291(1), 177–196 (1999). PubMed Abstract | Publisher Full Text OpenURL

  21. MA Huynen, PF Stadler, W Fontana, Smoothness within ruggedness: the role of neutrality in adaptation. Proceedings of the National Academy of Sciences of the United States of America 93(1), 397–401 (1996). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. S Karlin, Statistical signals in bioinformatics. Proceedings of the National Academy of Sciences of the United States of America 102(38), 13355–13362 (2005). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  23. KA Dill, Dominant forces in protein folding. Biochemistry 29(31), 7133–7155 (1990). PubMed Abstract | Publisher Full Text OpenURL

  24. B Rost, Did evolution leap to create the protein universe? Current Opinion in Structural Biology 12(3), 409–416 (2002). PubMed Abstract | Publisher Full Text OpenURL

  25. J Rissanen, GG Langdon Jr.., Arithmetic Coding. IBM Journal of Research and Development 23(2), 149–162 (1979)

  26. SL Salzberg, AL Delcher, S Kasif, O White, Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26(2), 544–548 (1998). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  27. VP Turutina, AA Laskin, NA Kudryashov, KG Skryabin, EV Korotkov, Identification of latent periodicity in amino acid sequences of protein families. Biochemistry (Moscow) 71(1), 18–31 (2006). Publisher Full Text OpenURL

  28. EV Korotkov, MA Korotkova, Enlarged similarity of nucleic acid sequences. DNA Research 3(3), 157–164 (1996). PubMed Abstract | Publisher Full Text OpenURL

  29. AC Camproux, P Tufféry, Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity. Biochimica et Biophysica Acta 1724(3), 394–403 (2005). PubMed Abstract | Publisher Full Text OpenURL

  30. SD Bentley, J Parkhill, Comparative genomic structure of prokaryotes. Annual Review of Genetics 38, 771–791 (2004). PubMed Abstract | Publisher Full Text OpenURL

  31. J Raes, JO Korbel, MJ Lercher, C von Mering, P Bork, Prediction of effective genome size in metagenomic samples. Genome Biology 8(1), R10 (2007). PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL