An empirical study of feature selection in binary classification with DNA microarray data

dc.contributor.advisorHess, Kenneth
dc.creatorLecocke, Michael Louis
dc.date.accessioned2009-06-04T08:09:34Z
dc.date.available2009-06-04T08:09:34Z
dc.date.issued2005
dc.description.abstractMotivation. Binary classification is a common problem in many types of research including clinical applications of gene expression microarrays. This research is comprised of a large-scale empirical study that involves a rigorous and systematic comparison of classifiers, in terms of supervised learning methods and both univariate and multivariate feature selection approaches. Other principle areas of investigation involve the use of cross-validation (CV) and how to guard against the effects of optimism and selection bias when assessing candidate classifiers via CV. This is taken into account by ensuring that the feature selection is performed during training of the classification rule at each stage of a CV process ("external CV"), which to date has not been the traditional approach to performing cross-validation. Results. A large-scale empirical comparison study is presented, in which a 10-fold CV procedure is applied internally and externally to a univariate as well as two genetic algorithm-(GA-) based feature selection processes. These procedures are used in conjunction with six supervised learning algorithms across six published two-class clinical microarray datasets. It was found that external CV generally provided more realistic and honest misclassification error rates than those from using internal CV. Also, although the more sophisticated multivariate FSS approaches were able to select gene subsets that went undetected via the combination of genes from even the top 100 univariately ranked gene list, neither of the two GA-based methods led to significantly better 10-fold internal nor external CV error rates. Considering all the selection bias estimates together across all subset sizes, learning algorithms, and datasets, the average bias estimates from each of the GA-based methods were roughly 2.5 times that of the univariate-based method. Ultimately, this research has put to test the more traditional implementations of the statistical learning aspects of cross-validation and feature selection and has provided a solid foundation on which these issues can and should be further investigated when performing limited-sample classification studies using high-dimensional gene expression data.
dc.format.extent200 p.en_US
dc.format.mimetypeapplication/pdf
dc.identifier.callnoTHESIS STAT. 2005 LECOCKE
dc.identifier.citationLecocke, Michael Louis. "An empirical study of feature selection in binary classification with DNA microarray data." (2005) Diss., Rice University. <a href="https://hdl.handle.net/1911/18776">https://hdl.handle.net/1911/18776</a>.
dc.identifier.urihttps://hdl.handle.net/1911/18776
dc.language.isoeng
dc.rightsCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
dc.subjectStatistics
dc.titleAn empirical study of feature selection in binary classification with DNA microarray data
dc.typeThesis
dc.type.materialText
thesis.degree.departmentStatistics
thesis.degree.disciplineEngineering
thesis.degree.grantorRice University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
3168098.PDF
Size:
12.62 MB
Format:
Adobe Portable Document Format