Browsing by Author "Scott, David W"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Association Studies in Human Cancers: Metabolic Expression Subtypes and Somatic Mutations/Germline Variations(2019-07-29) Chen, Zhongyuan; Scott, David W; Liang, Han; Wei, PengCancer is a highly complex genetic disease caused by certain gene mutations. This thesis focuses on two critical categories of association studies for human cancers: associations between tumor metabolic subtypes and various other cancer aspects, and associations between somatic mutations and germline variations. In the first category, we classify metabolic expression subtypes in multiple TCGA (the Cancer Genome Atlas) cancer types, identify consistent prognostic patterns, and analyze master regulators of metabolic subtypes. We apply various statistical methods to study the associations between the metabolic expression subtypes and patients' survival, somatic mutations, copy number variations, and hallmark pathways. The results show that the metabolic expression subtypes are extensively correlated with patients' survival. The work gives a systematic view of metabolic heterogeneity and indicates the values of metabolic expression subtypes as predictive, prognostic, and therapeutic markers. In the second category, we design data-adaptive and pathway-based large-sample score test methods for association studies between somatic mutations and germline variations. A combination of multiple statistical techniques is used. Extensive information aggregation at both SNP and gene levels is involved. p-values from different parameters are combined to yield data-adaptive tests for somatic mutations and germline variations. To avoid using too many parameters so as to reduce costs, a randomized low-rank parameter preselection strategy is proposed to predict parameters that are likely more effective. In comparison with some commonly used methods, our data-adaptive somatic mutations/germline variations test methods are much more flexible, can apply to multiple germline SNPs/genes/pathways, and generally have much higher statistical powers. The test models are applied to both simulations and real-world ICGC (International Cancer Genome Consortium) datasets. For the ICGC data, a sequence of filtering, screening, and processing techniques is applied, followed by extensive association studies with our models. Our studies systematically identify the associations between various germline variations and somatic mutations across different cancer types. Our research provides valuable statistical tools for cancer risk prediction. The work leads to deeper understanding of molecular mechanisms of specific cancer genes and brings new insights into the development of novel cancer therapy.Item Forecast Aggregation and Binned Sequential Testing in a Streaming Environment(2017-08-08) Cross, Daniel Mishael; Scott, David WThis thesis report covers two separate projects. The sequential probability ratio test is a statistical test of one simple hypothesis against another. Oftentimes a parametric form is assumed for the underlying density or (discrete) probability function, and the two hypotheses are specified by two different values of the parameter. In this case the sequential probability ratio test consists of taking observations sequentially and after each observation comparing the updated likelihood ratio to two constants that are chosen to specify type I and type II error probabilities. When the likelihood ratio crosses one of the constants the corresponding density to that constant is chosen to be the true density. For more closely mixed proposed densities the expected number of steps to decision may be large, but generally is less than a fixed n design with the same type I and II errors. In this situation data compression could be a necessity in order to reduce data storage requirements. In other situations, the data may arrive or be transformed into bins. In this paper we explore the effects of binning sequential data in two cases: (1) the exact binned (histogram) of densities is known; and (2) finite sample approximations of the exact histogram densities are known. We show the effects of binning in both cases on the expected number of steps to decision, and type I and type II error. Optimal binning parameter choices for common densities as well as formulae for general densities are also given. The Good Judgment Team led by psychologists P. Tetlock and B. Mellers of the University of Pennsylvania was the most successful of five research projects sponsored through 2015 by IARPA to develop improved group forecast aggregation algorithms. Each team had at least ten algorithms under continuous development and evaluation over the four year project. The mean Brier score was used to rank the algorithms on approximately 130 questions concerning categorical geopolitical events each year. An algorithm would return aggregate probabilities for each question based on the prob- abilities provided per question by thousands of individuals, who had been recruited by the Good Judgment Team. This paper summarizes the theorized basis and implementation of one of the two most accurate algorithms at the conclusion of the Good Judgment Project. The algorithm incorporated a number of pre- and post-processing steps, and relied upon a minimum distance robust regression method called L2E; see Scott (2001). The algorithm was just edged out by a variation of logistic regression, which has been described elsewhere; see Mellers et al. (2014) and GJP (2015a). Work since the official conclusion of the project has led to an even smaller gap.Item Functional Data Analysis on Spectroscopic Data(2016-01-28) Wang, Lu; Cox, Dennis D.; Scott, David W; Zhang, YinCervical cancer is a very common type of cancer that is highly curable if treated early. We are investigating spectroscopic devices that make in-vivo cervical tissue measurements to detect pre-cancerous and cancerous lesions. This dissertation is focused on new methods and algorithms to improve the performance of the device, treating the spectroscopic measurements as functional data. The first project is to calibrate the device measurements using correction factors from a log additive model, based on results from a carefully designed experiment. The second project is a peak finding algorithm using local polynomial regression to get accurate peak location and height estimates of one of the standards (Rhodamine B) measurements from the experiment. We propose a plug-in bandwidth selection method to estimate curve peak location and height. Simulation results and asymptotic properties are presented. The third project is based on patient measurements, particularly when the diseased and non-diseased cases are highly unbalanced. A marginalized corruption methodology is introduced to improve the classification results. Performance of several classification methods is compared.Item Skewers, the Carnegie Classification, and the Hybrid Bootstrap(2017-11-30) Kosar, Robert; Scott, David WPrincipal component analysis is an important statistical technique for dimension reduction and exploratory data analysis. However, it is not robust to outliers and may obfuscate important data structure such as clustering. We propose a version of principal component analysis based on the robust L2E method. The technique seeks to find the principal components of potentially highly non-spherical distribution components of a Gaussian mixture model. The algorithm requires neither specification of the number of clusters nor estimation of a full covariance matrix in order to run. The Carnegie classification is a decades-old (updated approximately every five years) taxonomy for research universities. However, it is based on questionable statistical methodology and suffers from a number of issues. We present a criticism of the Carnegie methodology, and offer two alternatives that are designed to be consistent with Carnegie's goals but also more statistically sound. We also present a visualization application where users can explore both the Carnegie system and our proposed systems. Preventing overfitting is an important topic in the field of machine learning, where it is common or even mundane to fit models with millions of parameters. One of the most popular algorithms for preventing overfitting is dropout. We present a drop-in replacement for dropout that offers superior performance on standard benchmark datasets and is relatively insensitive to hyperparameter choice.