Browsing by Author "Chen, Ken"
Now showing 1 - 11 of 11
Results Per Page
Sort Options
Item A SNP Calling And Genotyping Method For Single-cell Sequencing Data(2015-04-23) Zafar, Hamim; Nakhleh, Luay K.; Kavraki, Lydia E; Jermaine, Chris M; Chen, KenIn this thesis, we propose a single nucleotide polymorphism (SNP) calling and genotyping algorithm for single-cell sequencing data generated by the recently developed single-cell sequencing (SCS) technologies. SCS methods promise to address several key issues in cancer research which previously could not be resolved with data obtained from second generation or next-generation sequencing (NGS) technologies. SCS has the power to resolve cancer genome at a single-cell level and can characterize the genomic alterations that might differ from one cell to another. SNPs are the most commonly occurring genomic variations that alter the gene functions in cancer. Several methods exist for calling SNPs from NGS data. However, these methods are not suitable in the SCS scenario because they do not account for the various amplification errors associated with the SCS data. As a result, the existing SNP calling methods perform poorly, producing a large number of false positives when applied on SCS data. To the best of our knowledge, no SNP calling method exists that is specifically designed for SCS data. Our SNP calling algorithm is specifically designed for SCS data and the underlying statistical model deals with the inherent errors of SCS like allelic dropout, high bias for C : G > T : A and other amplification errors. This results in ~50% reduction in the number of false positives and ~30% increase in precision in calling SNPs as compared to GATK, a state-of-the-art SNP calling method for NGS data. Our algorithm also employs an improved genotyping method to properly genotype the individual cells by avoiding the sequencing errors (e.g., base calling error). Our method is the first SCS-specific SNP calling method and it can be used to characterize the SNPs present in individual cancer cells. Potentially, it can be applied as a first step in the genealogical analysis of tumor cells for tracing the evolutionary history of a tumor.Item Calculating Variant Allele Fraction of Structural Variation in Next Generation Sequencing by Maximum Likelihood(2015-04-23) Fan, Xian; Nakhleh, Luay K.; Kavraki, Lydia; Jermaine, Chris; Chen, KenCancer cells are intrinsically heterogeneous. Multiple clones with their unique variants co-exist in tumor tissues. The variants include point mutations and structural variations. Point mutations, or single nucleotide variants are those variants on one base; structural variations are variations involving sequence with length not smaller than 50 bases. Approaches to estimate the number of clones and their respective percentages from point mutations have been recently proposed. However, structural variations, although involving more reads than point mutations, have not been quantitatively studied in characterizing cancer heterogeneity. I describe in this thesis a maximum likelihood approach to estimate variant allele fraction of a putative structural variation, as a step towards the characterization of tumor heterogeneity. A software tool, BreakDown, implemented in Perl realizing this statistical model is publicly available. I studied the performance of BreakDown through both simulated and real data, and found BreakDown outperformed other methods such as THetA in estimating variant allele fractions.Item Decoupling Lineage-Associated Genes in Acute Myeloid Leukemia Reveals Inflammatory and Metabolic Signatures Associated With Outcomes(Frontiers, 2021) Abbas, Hussein A.; Mohanty, Vakul; Wang, Ruiping; Huang, Yuefan; Liang, Shaoheng; Wang, Feng; Zhang, Jianhua; Qiu, Yihua; Hu, Chenyue W.; Qutub, Amina A.; Dail, Monique; Bolen, Christopher R.; Daver, Naval; Konopleva, Marina; Futreal, Andrew; Chen, Ken; Wang, Linghua; Kornblau, Steven M.Acute myeloid leukemia (AML) is a heterogeneous disease with variable responses to therapy. Cytogenetic and genomic features are used to classify AML patients into prognostic and treatment groups. However, these molecular characteristics harbor significant patient-to-patient variability and do not fully account for AML heterogeneity. RNA-based classifications have also been applied in AML as an alternative approach, but transcriptomic grouping is strongly associated with AML morphologic lineages. We used a training cohort of newly diagnosed AML patients and conducted unsupervised RNA-based classification after excluding lineage-associated genes. We identified three AML patient groups that have distinct biological pathways associated with outcomes. Enrichment of inflammatory pathways and downregulation of HOX pathways were associated with improved outcomes, and this was validated in 2 independent cohorts. We also identified a group of AML patients who harbored high metabolic and mTOR pathway activity, and this was associated with worse clinical outcomes. Using a comprehensive reverse phase protein array, we identified higher mTOR protein expression in the highly metabolic group. We also identified a positive correlation between degree of resistance to venetoclax and mTOR activation in myeloid and lymphoid cell lines. Our approach of integrating RNA, protein, and genomic data uncovered lineage-independent AML patient groups that share biologic mechanisms and can inform outcomes independent of commonly used clinical and demographic variables; these groups could be used to guide therapeutic strategies.Item Detecting Structural Variations with Illumina, PacBio and Optical Maps Data by Computational Approaches(2018-04-20) Fan, Xian; Nakhleh, Luay; Chen, KenDetecting structural variations (SV) is important in deciphering variations in human DNA and the cause of genetic disease such as cancer. Computational approaches to detect SVs are made possible by sequencing technologies. As different sequencing technologies render data with different characteristics, computational approaches are designed in a way that is specific to a certain technology. In this thesis I studied three technologies: Illumina, PacBio and Optical Maps. As Illumina and PacBio reads have complementary advantages and disadvantages of read length and error rate, I proposed a new approach, HySA, that combines Illumina and PacBio to detect SV. HySA was able to detect SVs that cannot be detected by the approaches for either only Illumina or only PacBio. However, due to the repetitiveness of the human DNA as well as the existence of complex SVs, it is still challenging for HySA to detect some SVs on the repetitive regions or complex SVs. To overcome that, I proposed a new approach to detect SVs by Optical Maps data, which is advantageous over Illumina and PacBio in read length, despite its lack of sequence and unique error profile. The SVs detected by Optical Maps alone complement those from Illumina and PacBio. In all, the two approaches I proposed help push towards a more complete characterization of SVs in human DNA.Item Interpretable and Efficient Machine Learning in Cancer Biology(2022-12-01) Liang, Shaoheng; Nakhleh, Luay; Chen, KenThe past decade witnessed the advance of machine learning and cancer biology. In therapeutics, chimeric antigen receptor (CAR) treatments and cancer vaccines give new hope for ending cancer. Single-cell sequencing and mass spectrometry enable personalized high-resolution observations of cancer cell behavior and immune response. Computational cancer biology is no different; the continuous evolution of machine learning models, especially neural networks, provides unprecedented potential in making predictions. However, efforts are still needed to tailor the models to interpret specific biological processes. My research explores how knowledge-informed adaptation of machine learning techniques, such as neural networks, metric learning, and probabilistic classifiers helps answer questions in cancer biology. For example, periodicity in the cell cycle and other biological processes inspired our use of a sinusoidal activation function in an autoencoder to discover the periodicity in single-cell transcriptomic data. To efficiently predict biomarkers driving tumorigenesis and immune cell differentiation, we adapted UMAP with L1 regularization and our implementation of OWLQN (Orthant-Wise Limited-memory Quasi-Newton) optimizer. Inspired by structural motifs in antigen presentation, our white-box positive-example-only classifier based on Naïve Bayes formulation and mutual-information-based combinatorial feature selection achieves state-of-the-art accuracy in antigen presentation prediction, helping design cancer vaccines and understand the antigen presentation process. The differences among patient samples, referred to as the batch effect, informed the development of a power analysis web and a differential expression analysis tool to better identify changes in cell type abundances and omics features. Increasingly large omics data also call for more efficient computational methods. My research utilized multiple modeling and computing techniques, such as conjugate priors, quasi-newton method, parallelism, and GPU acceleration, to address this need. For wider usage by different user groups including method developers, bench scientists, and clinicians, we developed the tools as Python or R packages, or web applications. Overall, my research shows that knowledge-informed interpretable modeling of complex biological processes helps make accurate clinical-relevant predictions and generate new knowledge, both important for cancer biology and broader biomedical applications.Item Monovar: single-nucleotide variant detection in single cells(Springer Nature, 2016) Zafar, Hamim; Wang, Yong; Nakhleh, Luay; Navin, Nicholas; Chen, KenCurrent variant callers are not suitable for single-cell DNA sequencing, as they do not account for allelic dropout, false-positive errors and coverage nonuniformity. We developed Monovar (https://bitbucket.org/hamimzafar/monovar), a statistical method for detecting and genotyping single-nucleotide variants in single-cell data. Monovar exhibited superior performance over standard algorithms on benchmarks and in identifying driver mutations and delineating clonal substructure in three different human tumor data sets.Item SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data(Cold Spring Harbor Laboratory Press, 2019) Zafar, Hamim; Navin, Nicholas; Chen, Ken; Nakhleh, LuayAccumulation and selection of somatic mutations in a Darwinian framework result in intra-tumor heterogeneity (ITH) that poses significant challenges to the diagnosis and clinical therapy of cancer. Identification of the tumor cell populations (clones) and reconstruction of their evolutionary relationship can elucidate this heterogeneity. Recently developed single-cell DNA sequencing (SCS) technologies promise to resolve ITH to a single-cell level. However, technical errors in SCS data sets, including false-positives (FP) and false-negatives (FN) due to allelic dropout, and cell doublets, significantly complicate these tasks. Here, we propose a nonparametric Bayesian method that reconstructs the clonal populations as clusters of single cells, genotypes of each clone, and the evolutionary relationship between the clones. It employs a tree-structured Chinese restaurant process as the prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-site model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental data sets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-site model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.Item SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models(BioMed Central, 2017-09-19) Zafar, Hamim; Tzen, Anthony; Navin, Nicholas; Chen, Ken; Nakhleh, LuayAbstract Single-cell sequencing enables the inference of tumor phylogenies that provide insights on intra-tumor heterogeneity and evolutionary trajectories. Recently introduced methods perform this task under the infinite-sites assumption, violations of which, due to chromosomal deletions and loss of heterozygosity, necessitate the development of inference methods that utilize finite-sites models. We propose a statistical inference method for tumor phylogenies from noisy single-cell sequencing data under a finite-sites model. The performance of our method on synthetic and experimental data sets from two colorectal cancer patients to trace evolutionary lineages in primary and metastatic tumors suggests that employing a finite-sites model leads to improved inference of tumor phylogenies.Item Statistical Methods for Elucidating Tumor Heterogeneity and Evolution from Single-cell DNA Sequencing Data(2018-08-08) Zafar, Hamim; Nakhleh, Luay; Chen, KenIntra-tumor heterogeneity, as caused by a combination of mutation and selection, poses significant challenges to the diagnosis and clinical therapy of cancer. Resolving this heterogeneity to identify the tumor cell populations (clones) and delineate their evolutionary history is of critical importance in improving cancer diagnosis and therapy. This heterogeneity can be readily elucidated and understood through the reconstruction of the clonal genotypes and evolutionary history of the tumor cells. These tasks are challenging since genomic data is most often collected from one snapshot during the evolution of the tumor's constituent cells. Consequently, using computational methods that infer the tumor phylogeny and tumor subpopulations from sequence data is the approach of choice. Recently emerged single-cell DNA sequencing (SCS) technologies promise to resolve intra-tumor heterogeneity to a single-cell level. However, inherent technical errors in SCS datasets, including false-positive (FP) errors, false-negatives (FN) due to allelic dropout, cell doublets and coverage non-uniformity significantly complicate these tasks. In this thesis, we first develop a likelihood-based approach for inferring tumor trees from imperfect SCS genotype data with potentially missing entries, under a finite-sites model of evolution. Our model of evolution introduces a continuous time Markov chain that accounts for the effects of different events in tumor evolution including point mutations, loss of heterozygosity, deletion and recurrent mutations on genomic sites. Our method probabilistically accounts for false positive and false negative errors and missing entries in SCS datasets. With the help of a heuristic search algorithm, our method finds a maximum-likelihood solution for the phylogenetic tree that best describes the evolutionary history of the tumor cells in the SCS dataset. In doing so, our method also estimates the error rates associated with the datasets. Another contribution of this method is to infer the order of the mutations on the branches of the inferred tumor phylogeny. This is done using a maximum-likelihood-based dynamic programming algorithm. The performance of our method on synthetic and experimental datasets from two colorectal cancer patients to trace evolutionary lineages in primary and metastatic tumors suggests that employing a finite-sites model leads to an improved inference of tumor phylogenies. Secondly, we develop a non-parametric Bayesian method that simultaneously reconstructs the clonal populations as clusters of single cells, mutations associated with each clone, and the genealogical relationships between the clonal populations. It employs a tree-structured Chinese restaurant process as a prior on the number and composition of clonal populations. The evolution of the clonal populations is modeled by a clonal phylogeny and a finite-sites model of evolution to account for potential mutation recurrence and losses. We probabilistically account for FP and FN errors, and cell doublets are modeled by employing a Beta-binomial distribution. We develop a Gibbs sampling algorithm comprising of partial reversible-jump and partial Metropolis-Hastings updates to explore the joint posterior space of all parameters. The performance of our method on synthetic and experimental datasets suggests that joint reconstruction of tumor clones and clonal phylogeny under a finite-sites model of evolution leads to more accurate inferences. Our method is the first to enable this joint reconstruction in a fully Bayesian framework, thus providing measures of support of the inferences it makes.Item The International Conference on Intelligent Biology and Medicine (ICIBM) 2016: summary and innovation in genomics(BioMed Central, 2017-10-03) Zhao, Zhongming; Liu, Zhandong; Chen, Ken; Guo, Yan; Allen, Genevera I; Zhang, Jiajie; Jim Zheng, W.; Ruan, JianhuaAbstract In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8–10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.Item Towards accurate characterization of clonal heterogeneity based on structural variation(BioMed Central, 2014) Fan, Xian; Zhou, Wanding; Chong, Zechen; Nakhleh, Luay; Chen, KenRecent advances in deep digital sequencing have unveiled an unprecedented degree of clonal heterogeneity within a single tumor DNA sample. Resolving such heterogeneity depends on accurate estimation of fractions of alleles that harbor somatic mutations. Unlike substitutions or small indels, structural variants such as deletions, duplications, inversions and translocations involve segments of DNAs and are potentially more accurate for allele fraction estimations. However, no systematic method exists that can support such analysis. In this paper, we present a novel maximum-likelihood method that estimates allele fractions of structural variants integratively from various forms of alignment signals. We develop a tool, BreakDown, to estimate the allele fractions of most structural variants including medium size (from 1 kilobase to 1 megabase) deletions and duplications, and balanced inversions and translocations. Evaluation based on both simulated and real data indicates that our method systematically enables structural variants for clonal heterogeneity analysis and can greatly enhance the characterization of genomically instable tumors.