Questions of understanding and quantifying the representation and amount of information
in organisms have become a central part of biological research, as they potentially
hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic
tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically
correlated. We develop a precise and reliable methodology, based on the notion of
mutual information, for finding and extracting statistical as well as structural dependencies. A simple
threshold function is defined, and its use in quantifying the level of significance
of dependencies between biological segments is explored. These tools are used in two
specific applications. First, they are used for the identification of correlations
between different parts of the maize zmSRp32 gene. There, we find significant dependencies
between the
untranslated region in zmSRp32 and its alternatively spliced exons. This observation
may indicate the presence of as-yet unknown alternative splicing mechanisms or structural
scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we
demonstrate that our approach is particularly well suited for the problem of discovering
short tandem repeats—an application of importance in genetic profiling.
Research Article
References
-
R Steuer, J Kurths, CO Daub, J Weise, J Selbig, The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 18(supplement 2), S231–S240 (2002). PubMed Abstract | Publisher Full Text
-
Z Dawy, B Goebel, J Hagenauer, C Andreoli, T Meitinger, JC Mueller, Gene mapping and marker clustering using Shannon's mutual information. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3(1), 47–56 (2006). PubMed Abstract | Publisher Full Text
-
E Segal, Y Fondufe-Mittendorf, L Chen, et al. A genomic code for nucleosome positioning. Nature 442(7104), 772–778 (2006). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
Y Osada, R Saito, M Tomita, Comparative analysis of base correlations in
untranslated regions of various species. Gene 375(1-2), 80–86 (2006). PubMed Abstract | Publisher Full Text -
M Kozak, Initiation of translation in prokaryotes and eukaryotes. Gene 234(2), 187–208 (1999). PubMed Abstract | Publisher Full Text
-
DA Reddy, CK Mitra, Comparative analysis of transcription start sites using mutual information. Genomics, Proteomics and Bioinformatics 4(3), 189–195 (2006). Publisher Full Text
-
DA Reddy, BVLS Prasad, CK Mitra, Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices. Computational Biology and Chemistry 30(1), 58–62 (2006). PubMed Abstract | Publisher Full Text
-
SA Shabalina, AY Ogurtsov, IB Rogozin, EV Koonin, DJ Lipman, Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals. Nucleic Acids Research 32(5), 1774–1782 (2004). PubMed Abstract | Publisher Full Text | PubMed Central Full Text
-
P Baldi, S Brunak, P Frasconi, G Soda, G Pollastri, Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15(11), 937–946 (1999). PubMed Abstract | Publisher Full Text
-
G Battail, Should genetics get an information-theoretic education? Genomes as error-correcting codes. IEEE Engineering in Medicine and Biology Magazine 25(1), 34–45 (2006). PubMed Abstract
-
H Gao, WJ Gordon-Kamm, LA Lyznik, ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced. Gene 339(1-2), 25–37 (2004). PubMed Abstract | Publisher Full Text
-
TM Cover, JA Thomas, Elements of Information Theory (John Wiley & Sons, New York, NY, USA, 1991)
-
PI Good, Resampling Methods (Birkhäuser, Boston, Mass, USA, 2005)
-
B Manly, Randomization, Bootstrap and Monte Carlo Methods in Biology (Chapman & Hall/CRC, Boca Raton, Fla, USA, 1977)
-
EL Lehmann, JP Romano, Testing Statistical Hypotheses, 3rd edn. (Springer, New York, NY, USA, 2005)
-
MJ Schervish, Theory of Statistics (Springer, New York, NY, USA, 1995)
-
J Hagenauer, Z Dawy, B Göbel, P Hanus, J Mueller, Genomic analysis using methods from information theory. Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA, October 2004, 55–59
-
B Goebel, Z Dawy, J Hagenauer, JC Mueller, An approximation to the distribution of finite sample size mutual information estimates. Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea, May 2005 2, 1102–1106
-
M Hutter, Distribution of mutual information. Advances in Neural Information Processing Systems 14 (MIT Press, Cambridge, Mass, USA, 2002), pp. 399–406
-
TA Hughes, Regulation of gene expression by alternative untranslated regions. Trends in Genetics 22(3), 119–122 (2006). PubMed Abstract | Publisher Full Text
-
J Åberg, YuM Shtarkov, BJM Smeets, Multialphabet coding with separate alphabet description. Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy, June 1997, 56–65
-
A Orlitsky, NP Santhanam, K Viswanathan, J Zhang, Limit results on pattern entropy. IEEE Transactions on Information Theory 52(7), 2954–2964 (2006)




