Open Access Research Article

Extraction of Protein Interaction Data: A Comparative Analysis of Methods in Use

Hena Jose, Thangavel Vadivukarasi and Jyothi Devakumar*

Author Affiliations

Jubilant Biosys Ltd., #96, Industrial Suburb, 2nd Stage, Yeshwanthpur, Bangalore 560 022, India

For all author emails, please log on.

EURASIP Journal on Bioinformatics and Systems Biology 2007, 2007:53096 doi:10.1155/2007/53096


The electronic version of this article is the complete one and can be found online at: http://bsb.eurasipjournals.com/content/2007/1/53096


Received:31 March 2007
Accepted:8 October 2007
Published:9 December 2007

© 2007 Hena Jose et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Several natural language processing tools, both commercial and freely available, are used to extract protein interactions from publications. Methods used by these tools include pattern matching to dynamic programming with individual recall and precision rates. A methodical survey of these tools, keeping in mind the minimum interaction information a researcher would need, in comparison to manual analysis has not been carried out. We compared data generated using some of the selected NLP tools with manually curated protein interaction data (PathArt and IMaps) to comparatively determine the recall and precision rate. The rates were found to be lower than the published scores when a normalized definition for interaction is considered. Each data point captured wrongly or not picked up by the tool was analyzed. Our evaluation brings forth critical failures of NLP tools and provides pointers for the development of an ideal NLP tool.

References

  1. L Hunter, KB Cohen, Biomedical language processing: what's beyond PubMed? Molecular Cell 21(5), 589–594 (2006). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. K Fukuda, A Tamura, T Tsunoda, T Takagi, Toward information extraction: identifying protein names from biological papers. Pacific Symposium on Biocomputing, 707–718 (1998)

  3. M Stephens, M Palakal, S Mukhopadhyay, R Raje, J Mostafa, Detecting gene relations from Medline abstracts. Pacific Symposium on Biocomputing, 483–495 (2001)

  4. T Sekimizu, HS Park, J Tsujii, Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. Genome informatics 9, 62–71 (1998). PubMed Abstract | Publisher Full Text OpenURL

  5. S Novichkova, S Egorov, N Daraselia, MedScan, a natural language processing engine for Medline abstracts. Bioinformatics 19(13), 1699–1706 (2003). PubMed Abstract | Publisher Full Text OpenURL

  6. A Yakushiji, Y Tateisi, Y Miyao, J Tsujii, Event extraction from biomedical papers using a full parser. Pacific Symposium on Biocomputing, 408–419 (2001)

  7. J Thomas, D Milward, C Ouzounis, S Pulman, M Carroll, Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing, 541–552 (2000)

  8. M Huang, X Zhu, Y Hao, DG Payan, K Qu, M Li, Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics 20(18), 3604–3612 (2004). PubMed Abstract | Publisher Full Text OpenURL

  9. ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, CH Wu, Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics 21(11), 2759–2765 (2005). PubMed Abstract | Publisher Full Text OpenURL

  10. T-K Jenssen, A Lægreid, J Komorowski, E Hovig, A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics 28(1), 21–28 (2001). PubMed Abstract | Publisher Full Text OpenURL

  11. C Friedman, P Kra, H Yu, M Krauthammer, A Rzhetsky, GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(1), S74–S82 (2001). PubMed Abstract | Publisher Full Text OpenURL

  12. DPA Corney, BF Buxton, WB Langdon, DT Jones, BioRAT: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004). PubMed Abstract | Publisher Full Text OpenURL

  13. ST Ahmed, D Chidambaram, H Davulcu, C Baral, IntEx: a syntactic role driven protein-protein interaction extractor for bio-medical text. Association for Computational Linguistics, 54–61 (2005)

  14. J Eom, B Zhang, PubMiner: machine learning-based text mining for biomedical information analysis. Genomics & Informatics 2(2), 99–106 (2004). PubMed Abstract | Publisher Full Text OpenURL

  15. I Donaldson, J Martin, B de Bruijn, et al. PreBIND and Textomy—mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(1), 11–23 (2003). PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  16. N Daraselia, A Yuryev, S Egorov, S Novichkova, A Nikitin, I Mazo, Extracting human protein interactions from Medline using a full-sentence parser. Bioinformatics 20(5), 604–611 (2004). PubMed Abstract | Publisher Full Text OpenURL

  17. H Jang, J Lim, J-H Lim, S-J Park, K-C Lee, S-H Park, Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics 22(14), e220–e226 (2006). PubMed Abstract | Publisher Full Text OpenURL

  18. DPA Corney, BF Buxton, WB Langdon, DT Jones, BioRAT: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004). PubMed Abstract | Publisher Full Text OpenURL