This article is part of the series Information Theoretic Methods for Bioinformatics.

Open Access Research Article

Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates

Hasan M Aktulga1*, Ioannis Kontoyiannis2, L Alex Lyznik3, Lukasz Szpankowski4, Ananth Y Grama1 and Wojciech Szpankowski1

Author Affiliations

1 Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA

2 Department of Informatics, Athens University of Economics & Business, Patission 76, Athens 10434, Greece

3 Pioneer Hi-Breed International, Johnston, IA, USA

4 Bioinformatics Program, University of California, San Diego, CA 92093, USA

For all author emails, please log on.

EURASIP Journal on Bioinformatics and Systems Biology 2007, 2007:14741 doi:10.1155/2007/14741


The electronic version of this article is the complete one and can be found online at: http://bsb.eurasipjournals.com/content/2007/1/14741


Received:26 February 2007
Accepted:25 September 2007
Published:5 December 2007

© 2007 Hasan Metin Aktulga et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.

Research Article

References

  1. R Steuer, J Kurths, CO Daub, J Weise, J Selbig, The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 18(supplement 2), S231–S240 (2002). PubMed Abstract | Publisher Full Text OpenURL

  2. Z Dawy, B Goebel, J Hagenauer, C Andreoli, T Meitinger, JC Mueller, Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3(1), 47–56 (2006). PubMed Abstract | Publisher Full Text OpenURL

  3. E Segal, Y Fondufe-Mittendorf, L Chen, et al. A genomic code for nucleosome positioning. Nature 442(7104), 772–778 (2006). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  4. Y Osada, R Saito, M Tomita, Comparative analysis of base correlations in untranslated regions of various species. Gene 375(1-2), 80–86 (2006). PubMed Abstract | Publisher Full Text OpenURL

  5. M Kozak, Initiation of translation in prokaryotes and eukaryotes. Gene 234(2), 187–208 (1999). PubMed Abstract | Publisher Full Text OpenURL

  6. DA Reddy, CK Mitra, Comparative analysis of transcription start sites using mutual information. Genomics, Proteomics and Bioinformatics 4(3), 189–195 (2006). Publisher Full Text OpenURL

  7. DA Reddy, BVLS Prasad, CK Mitra, Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices. Computational Biology and Chemistry 30(1), 58–62 (2006). PubMed Abstract | Publisher Full Text OpenURL

  8. SA Shabalina, AY Ogurtsov, IB Rogozin, EV Koonin, DJ Lipman, Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals. Nucleic Acids Research 32(5), 1774–1782 (2004). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. P Baldi, S Brunak, P Frasconi, G Soda, G Pollastri, Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15(11), 937–946 (1999). PubMed Abstract | Publisher Full Text OpenURL

  10. G Battail, Should genetics get an information-theoretic education? Genomes as error-correcting codes. IEEE Engineering in Medicine and Biology Magazine 25(1), 34–45 (2006). PubMed Abstract OpenURL

  11. H Gao, WJ Gordon-Kamm, LA Lyznik, ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced. Gene 339(1-2), 25–37 (2004). PubMed Abstract | Publisher Full Text OpenURL

  12. TM Cover, JA Thomas, Elements of Information Theory (John Wiley & Sons, New York, NY, USA, 1991)

  13. PI Good, Resampling Methods (Birkhäuser, Boston, Mass, USA, 2005)

  14. B Manly, Randomization, Bootstrap and Monte Carlo Methods in Biology (Chapman & Hall/CRC, Boca Raton, Fla, USA, 1977)

  15. EL Lehmann, JP Romano, Testing Statistical Hypotheses, 3rd edn. (Springer, New York, NY, USA, 2005)

  16. MJ Schervish, Theory of Statistics (Springer, New York, NY, USA, 1995)

  17. J Hagenauer, Z Dawy, B Göbel, P Hanus, J Mueller, Genomic analysis using methods from information theory. Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA, October 2004, 55–59

  18. B Goebel, Z Dawy, J Hagenauer, JC Mueller, An approximation to the distribution of finite sample size mutual information estimates. Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea, May 2005 2, 1102–1106

  19. M Hutter, Distribution of mutual information. Advances in Neural Information Processing Systems 14 (MIT Press, Cambridge, Mass, USA, 2002), pp. 399–406

  20. TA Hughes, Regulation of gene expression by alternative untranslated regions. Trends in Genetics 22(3), 119–122 (2006). PubMed Abstract | Publisher Full Text OpenURL

  21. J Åberg, YuM Shtarkov, BJM Smeets, Multialphabet coding with separate alphabet description. Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy, June 1997, 56–65

  22. A Orlitsky, NP Santhanam, K Viswanathan, J Zhang, Limit results on pattern entropy. IEEE Transactions on Information Theory 52(7), 2954–2964 (2006)