<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1687-4153-2007-16354</ui>
   <ji>1687-4153</ji>
   <fm>
      <dochead>Research Article</dochead>
      <bibl>
         <title>
            <p>Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation</p>
         </title>
         <aug>
            <au ca="yes" id="A1"><snm>Xiao</snm><fnm>Yufei</fnm><insr iid="I1"/><email>fei@neo.tamu.edu</email></au>
            <au id="A2"><snm>Hua</snm><fnm>Jianping</fnm><insr iid="I2"/><email>jhua@tgen.org</email></au>
            <au id="A3"><snm>Dougherty</snm><mi>R</mi><fnm>Edward</fnm><insr iid="I1"/><insr iid="I2"/><email>e-dougherty@tamu.edu</email></au>
         </aug>
         <insg>
            <ins id="I1"><p>Department of Electrical and Computer Engineering, Texas A&amp;M University, College Station, TX 77843, USA</p></ins>
            <ins id="I2"><p>Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ 85004, USA</p></ins>
         </insg>
         <source>EURASIP Journal on Bioinformatics and Systems Biology</source>
         <issn>1687-4153</issn>
         <pubdate>2007</pubdate>
         <volume>2007</volume>
         <issue>1</issue>
         <fpage>16354</fpage>
         <url>http://bsb.eurasipjournals.com/content/2007/1/16354</url>
         <xrefbib><pubid idtype="doi">10.1155/2007/16354</pubid></xrefbib>
      </bibl>
      <history><rec><date><day>7</day><month>8</month><year>2006</year></date></rec><revrec><date><day>21</day><month>12</month><year>2006</year></date></revrec><acc><date><day>26</day><month>12</month><year>2006</year></date></acc><pub><date><day>19</day><month>2</month><year>2007</year></date></pub></history>
      <cpyrt><year>2007</year><collab>Yufei Xiao et al.</collab><note>This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
      <abs>
         <sec>
            <st>
               <p/>
            </st>
            <p>Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the <inline-formula><graphic file="1687-4153-2007-16354-i1.gif"/></inline-formula>-test for feature selection; and <inline-formula><graphic file="1687-4153-2007-16354-i2.gif"/></inline-formula>-fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p/>
         </st>
         <p>[<abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr>]</p>
      </sec>
   </bdy>
   <bm>
      <refgrp><bibl id="B1"><aug><au><snm>Devroye</snm><fnm>L</fnm></au><au><snm>Gyorfi</snm><fnm>L</fnm></au><au><snm>Lugosi</snm><fnm>G</fnm></au></aug><source>A Probabilistic Theory of Pattern Recognition</source><publisher>Springer, New York, NY, USA</publisher><pubdate>1996</pubdate></bibl><bibl id="B2"><title><p>Is cross-validation valid for small-sample microarray classification?</p></title><aug><au><snm>Braga-Neto</snm><fnm>U</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Bioinformatics</source><pubdate>2004</pubdate><volume>20</volume><issue>3</issue><fpage>374</fpage><lpage>380</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btg419</pubid><pubid idtype="pmpid" link="fulltext">14960464</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Bolstered error estimation</p></title><aug><au><snm>Braga-Neto</snm><fnm>U</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Pattern Recognition</source><pubdate>2004</pubdate><volume>37</volume><issue>6</issue><fpage>1267</fpage><lpage>1281</lpage><xrefbib><pubid idtype="doi">10.1016/j.patcog.2003.08.017</pubid></xrefbib></bibl><bibl id="B4"><title><p>Superior feature-set ranking for small samples using bolstered error estimation</p></title><aug><au><snm>Sima</snm><fnm>C</fnm></au><au><snm>Braga-Neto</snm><fnm>U</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>7</issue><fpage>1046</fpage><lpage>1054</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti081</pubid><pubid idtype="pmpid" link="fulltext">15514003</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Impact of error estimation on feature selection</p></title><aug><au><snm>Sima</snm><fnm>C</fnm></au><au><snm>Attoor</snm><fnm>S</fnm></au><au><snm>Brag-Neto</snm><fnm>U</fnm></au><au><snm>Lowey</snm><fnm>J</fnm></au><au><snm>Suh</snm><fnm>E</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Pattern Recognition</source><pubdate>2005</pubdate><volume>38</volume><issue>12</issue><fpage>2472</fpage><lpage>2482</lpage><xrefbib><pubid idtype="doi">10.1016/j.patcog.2005.03.026</pubid></xrefbib></bibl><bibl id="B6"><title><p>Prediction error estimation: a comparison of resampling methods</p></title><aug><au><snm>Molinaro</snm><fnm>AM</fnm></au><au><snm>Simon</snm><fnm>R</fnm></au><au><snm>Pfeiffer</snm><fnm>RM</fnm></au></aug><source>Bioinformatics</source><pubdate>2005</pubdate><volume>21</volume><issue>15</issue><fpage>3301</fpage><lpage>3307</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/bti499</pubid><pubid idtype="pmpid" link="fulltext">15905277</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Floating search methods in feature selection</p></title><aug><au><snm>Pudil</snm><fnm>P</fnm></au><au><snm>Novovicova</snm><fnm>J</fnm></au><au><snm>Kittler</snm><fnm>J</fnm></au></aug><source>Pattern Recognition Letters</source><pubdate>1994</pubdate><volume>15</volume><issue>11</issue><fpage>1119</fpage><lpage>1125</lpage><xrefbib><pubid idtype="doi">10.1016/0167-8655(94)90127-9</pubid></xrefbib></bibl><bibl id="B8"><title><p>Feature selection increases cross-validation imprecision</p></title><aug><au><snm>Xiao</snm><fnm>Y</fnm></au><au><snm>Hua</snm><fnm>J</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Proceedings of the 4th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS &apos;06), College Station, Tex, USA, May 2006</source></bibl><bibl id="B9"><title><p>Gene expression profiling predicts clinical outcome of breast cancer</p></title><aug><au><snm>van&apos;t Veer</snm><fnm>LJ</fnm></au><au><snm>Dai</snm><fnm>H</fnm></au><au><snm>van de Vijver</snm><fnm>MJ</fnm></au><etal/></aug><source>Nature</source><pubdate>2002</pubdate><volume>415</volume><issue>6871</issue><fpage>530</fpage><lpage>536</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/415530a</pubid><pubid idtype="pmpid" link="fulltext">11823860</pubid></pubidlist></xrefbib></bibl><bibl id="B10"><title><p>A gene-expression signature as a predictor of survival in breast cancer</p></title><aug><au><snm>van de Vijver</snm><fnm>MJ</fnm></au><au><snm>He</snm><fnm>YD</fnm></au><au><snm>van&apos;t Veer</snm><fnm>LJ</fnm></au><etal/></aug><source>New England Journal of Medicine</source><pubdate>2002</pubdate><volume>347</volume><issue>25</issue><fpage>1999</fpage><lpage>2009</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1056/NEJMoa021967</pubid><pubid idtype="pmpid" link="fulltext">12490681</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Genetic test bed for feature selection</p></title><aug><au><snm>Choudhary</snm><fnm>A</fnm></au><au><snm>Brun</snm><fnm>M</fnm></au><au><snm>Hua</snm><fnm>J</fnm></au><au><snm>Lowey</snm><fnm>J</fnm></au><au><snm>Suh</snm><fnm>E</fnm></au><au><snm>Dougherty</snm><fnm>ER</fnm></au></aug><source>Bioinformatics</source><pubdate>2006</pubdate><volume>22</volume><issue>7</issue><fpage>837</fpage><lpage>842</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btl008</pubid><pubid idtype="pmpid" link="fulltext">16428263</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><title><p>Feature selection: evaluation, application, and small sample performance</p></title><aug><au><snm>Jain</snm><fnm>A</fnm></au><au><snm>Zongker</snm><fnm>D</fnm></au></aug><source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source><pubdate>1997</pubdate><volume>19</volume><issue>2</issue><fpage>153</fpage><lpage>158</lpage><xrefbib><pubid idtype="doi">10.1109/34.574797</pubid></xrefbib></bibl><bibl id="B13"><title><p>Comparison of algorithms that select features for pattern classifiers</p></title><aug><au><snm>Kudo</snm><fnm>M</fnm></au><au><snm>Sklansky</snm><fnm>J</fnm></au></aug><source>Pattern Recognition</source><pubdate>2000</pubdate><volume>33</volume><issue>1</issue><fpage>25</fpage><lpage>41</lpage><xrefbib><pubid idtype="doi">10.1016/S0031-3203(99)00041-2</pubid></xrefbib></bibl></refgrp>
   </bm>
</art>