Open Access Research

Integrating multi-platform genomic data using hierarchical Bayesian relevance vector machines

Sanvesh Srivastava1, Wenyi Wang2, Ganiraju Manyam2, Carlos Ordonez3 and Veerabhadran Baladandayuthapani4*

Author Affiliations

1 Department of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN 47907, USA

2 Department of Bioinformatics and Computational Biology, Division of Quantitative Sciences, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Unit 1411, Houston, Texas, USA

3 Department of Computer Science, University of Houston, 4800 Calhoun, Houston, Texas, USA

4 Department of Biostatistics, Division of Quantitative Sciences, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Unit 1411, Houston, Texas, USA

For all author emails, please log on.

EURASIP Journal on Bioinformatics and Systems Biology 2013, 2013:9  doi:10.1186/1687-4153-2013-9

Published: 28 June 2013

Abstract

Background

Recent advances in genome technologies and the subsequent collection of genomic information at various molecular resolutions hold promise to accelerate the discovery of new therapeutic targets. A critical step in achieving these goals is to develop efficient clinical prediction models that integrate these diverse sources of high-throughput data. This step is challenging due to the presence of high-dimensionality and complex interactions in the data. For predicting relevant clinical outcomes, we propose a flexible statistical machine learning approach that acknowledges and models the interaction between platform-specific measurements through nonlinear kernel machines and borrows information within and between platforms through a hierarchical Bayesian framework. Our model has parameters with direct interpretations in terms of the effects of platforms and data interactions within and across platforms. The parameter estimation algorithm in our model uses a computationally efficient variational Bayes approach that scales well to large high-throughput datasets.

Results

We apply our methods of integrating gene/mRNA expression and microRNA profiles for predicting patient survival times to The Cancer Genome Atlas (TCGA) based glioblastoma multiforme (GBM) dataset. In terms of prediction accuracy, we show that our non-linear and interaction-based integrative methods perform better than linear alternatives and non-integrative methods that do not account for interactions between the platforms. We also find several prognostic mRNAs and microRNAs that are related to tumor invasion and are known to drive tumor metastasis and severe inflammatory response in GBM. In addition, our analysis reveals several interesting mRNA and microRNA interactions that have known implications in the etiology of GBM.

Conclusions

Our approach gains its flexibility and power by modeling the non-linear interaction structures between and within the platforms. Our framework is a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers. We have a freely available software at: http://odin.mdacc.tmc.edu/~vbaladan webcite.

Keywords:
Bayesian modeling; Multiple kernel learning; Genomics; High-dimensional data analysis; Prediction; Variational inference