SpringerOpen Newsletter

Receive periodic news and updates relating to SpringerOpen.

Open Access Review

Scientific knowledge is possible with small-sample classification

Edward R Dougherty12* and Lori A Dalton3

Author Affiliations

1 Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA

2 Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA

3 Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA

For all author emails, please log on.

EURASIP Journal on Bioinformatics and Systems Biology 2013, 2013:10  doi:10.1186/1687-4153-2013-10

Published: 20 August 2013

Abstract

A typical small-sample biomarker classification paper discriminates between types of pathology based on, say, 30,000 genes and a small labeled sample of less than 100 points. Some classification rule is used to design the classifier from this data, but we are given no good reason or conditions under which this algorithm should perform well. An error estimation rule is used to estimate the classification error on the population using the same data, but once again we are given no good reason or conditions under which this error estimator should produce a good estimate, and thus we do not know how well the classifier should be expected to perform. In fact, virtually, in all such papers the error estimate is expected to be highly inaccurate. In short, we are given no justification for any claims.

Given the ubiquity of vacuous small-sample classification papers in the literature, one could easily conclude that scientific knowledge is impossible in small-sample settings. It is not that thousands of papers overtly claim that scientific knowledge is impossible in regard to their content; rather, it is that they utilize methods that preclude scientific knowledge. In this paper, we argue to the contrary that scientific knowledge in small-sample classification is possible provided there is sufficient prior knowledge. A natural way to proceed, discussed herein, is via a paradigm for pattern recognition in which we incorporate prior knowledge in the whole classification procedure (classifier design and error estimation), optimize each step of the procedure given available information, and obtain theoretical measures of performance for both classifiers and error estimators, the latter being the critical epistemological issue. In sum, we can achieve scientific validation for a proposed small-sample classifier and its error estimate.