An empirical study of feature selection in binary classification with DNA microarray data

Lecocke, Michael Louis

An empirical study of feature selection in binary classification with DNA microarray data

dc.contributor.advisor	Hess, Kenneth	en_US
dc.creator	Lecocke, Michael Louis	en_US
dc.date.accessioned	2009-06-04T08:09:34Z	en_US
dc.date.available	2009-06-04T08:09:34Z	en_US
dc.date.issued	2005	en_US
dc.description.abstract	Motivation. Binary classification is a common problem in many types of research including clinical applications of gene expression microarrays. This research is comprised of a large-scale empirical study that involves a rigorous and systematic comparison of classifiers, in terms of supervised learning methods and both univariate and multivariate feature selection approaches. Other principle areas of investigation involve the use of cross-validation (CV) and how to guard against the effects of optimism and selection bias when assessing candidate classifiers via CV. This is taken into account by ensuring that the feature selection is performed during training of the classification rule at each stage of a CV process ("external CV"), which to date has not been the traditional approach to performing cross-validation. Results. A large-scale empirical comparison study is presented, in which a 10-fold CV procedure is applied internally and externally to a univariate as well as two genetic algorithm-(GA-) based feature selection processes. These procedures are used in conjunction with six supervised learning algorithms across six published two-class clinical microarray datasets. It was found that external CV generally provided more realistic and honest misclassification error rates than those from using internal CV. Also, although the more sophisticated multivariate FSS approaches were able to select gene subsets that went undetected via the combination of genes from even the top 100 univariately ranked gene list, neither of the two GA-based methods led to significantly better 10-fold internal nor external CV error rates. Considering all the selection bias estimates together across all subset sizes, learning algorithms, and datasets, the average bias estimates from each of the GA-based methods were roughly 2.5 times that of the univariate-based method. Ultimately, this research has put to test the more traditional implementations of the statistical learning aspects of cross-validation and feature selection and has provided a solid foundation on which these issues can and should be further investigated when performing limited-sample classification studies using high-dimensional gene expression data.	en_US
dc.format.extent	200 p.	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.callno	THESIS STAT. 2005 LECOCKE	en_US
dc.identifier.citation	Lecocke, Michael Louis. "An empirical study of feature selection in binary classification with DNA microarray data." (2005) Diss., Rice University. <a href="https://hdl.handle.net/1911/18776">https://hdl.handle.net/1911/18776</a>.	en_US
dc.identifier.uri	https://hdl.handle.net/1911/18776	en_US
dc.language.iso	eng	en_US
dc.rights	Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.	en_US
dc.subject	Statistics	en_US
dc.title	An empirical study of feature selection in binary classification with DNA microarray data	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Statistics	en_US
thesis.degree.discipline	Engineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 3168098.PDF
Size:: 12.62 MB
Format:: Adobe Portable Document Format

Download

Collections

Rice University Theses and Dissertations