Browsing by Author "Jermaine, Chris M"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item A SNP Calling And Genotyping Method For Single-cell Sequencing Data(2015-04-23) Zafar, Hamim; Nakhleh, Luay K.; Kavraki, Lydia E; Jermaine, Chris M; Chen, KenIn this thesis, we propose a single nucleotide polymorphism (SNP) calling and genotyping algorithm for single-cell sequencing data generated by the recently developed single-cell sequencing (SCS) technologies. SCS methods promise to address several key issues in cancer research which previously could not be resolved with data obtained from second generation or next-generation sequencing (NGS) technologies. SCS has the power to resolve cancer genome at a single-cell level and can characterize the genomic alterations that might differ from one cell to another. SNPs are the most commonly occurring genomic variations that alter the gene functions in cancer. Several methods exist for calling SNPs from NGS data. However, these methods are not suitable in the SCS scenario because they do not account for the various amplification errors associated with the SCS data. As a result, the existing SNP calling methods perform poorly, producing a large number of false positives when applied on SCS data. To the best of our knowledge, no SNP calling method exists that is specifically designed for SCS data. Our SNP calling algorithm is specifically designed for SCS data and the underlying statistical model deals with the inherent errors of SCS like allelic dropout, high bias for C : G > T : A and other amplification errors. This results in ~50% reduction in the number of false positives and ~30% increase in precision in calling SNPs as compared to GATK, a state-of-the-art SNP calling method for NGS data. Our algorithm also employs an improved genotyping method to properly genotype the individual cells by avoiding the sequencing errors (e.g., base calling error). Our method is the first SCS-specific SNP calling method and it can be used to characterize the SNPs present in individual cancer cells. Potentially, it can be applied as a first step in the genealogical analysis of tumor cells for tracing the evolutionary history of a tumor.Item Declarative Machine Learning with Einsummable(2024-07-23) Bourgeois, Daniel Christopher; Jermaine, Chris MModern tensor-based machine learning (ML) systems such as PyTorch and TensorFlow have high performance but significant limitations for large-scale ML. These systems require a programmer to manually decompose ML computations so that they can run on multiple machines. Not only is this challenging for end-users, but moving from one hardware setup to the next requires writing a lot of code. We introduce a new end-to-end ML system called ``Einsummble'' that automatically decomposes computations to match the available hardware. Unlike existing systems, we are guided by one fundamental design principle: at all costs, the user may only say what they want to compute, not how it is to be computed. Instead of painstakingly building a ``model parallel'' or ``data parallel'' implementation, a user of Einsummable needs only build their computation in our Einsummable language. To make Einsummable a reality, we designed an Einsummable language for users to interact with, to create what we call EinGraphs. The Einsummable language is built on the extended Einstein summation notation, familiar to many ML practitioners. Our language is expressive enough to represent state of the art generative ML models, including Llama. In addition, we support automatic differentiation. On the other end of the abstraction spectrum, we created a compute graph specification for machines to execute, called TaskGraphs. TaskGraphs are designed to be executed by distributed, asynchronous compute engines. For our experiments, we built a distributed CPU execution engine, scaling to 32 machines, each with 64 processors. Even though we targeted CPU clusters, the TaskGraph abstraction is also suitable for clusters of GPUs. Most importantly, given hardware parameters, we compile EinGraphs into \\ TaskGraphs without user intervention. The discovered TaskGraph solution may very well include the common model or data parallel solutions. Our main algorithm for this is called EinDecomp, which decomposes EinGraphs so that the computation exposes enough parallelism to keep all processors busy without also introducing undue communication burden.