Browsing by Author "Nakhleh, Luay"
Now showing 1 - 20 of 40
Results Per Page
Sort Options
Item An Evaluation of Methods for Inferring Boolean Networks from Time-Series Data(Public Library of Science, 2013) Berestovsky, Natalie; Nakhleh, LuayRegulatory networks play a central role in cellular behavior and decision making. Learning these regulatory networks is a major task in biology, and devising computational methods and mathematical models for this task is a major endeavor in bioinformatics. Boolean networks have been used extensively for modeling regulatory networks. In this model, the state of each gene can be either ‘on’ or ‘off’ and that next-state of a gene is updated, synchronously or asynchronously, according to a Boolean rule that is applied to the current-state of the entire system. Inferring a Boolean network from a set of experimental data entails two main steps: first, the experimental time-series data are discretized into Boolean trajectories, and then, a Boolean network is learned from these Boolean trajectories. In this paper, we consider three methods for data discretization, including a new one we propose, and three methods for learning Boolean networks, and study the performance of all possible nine combinations on four regulatory systems of varying dynamics complexities. We find that employing the right combination of methods for data discretization and network learning results in Boolean networks that capture the dynamics well and provide predictive power. Our findings are in contrast to a recent survey that placed Boolean networks on the low end of the ‘‘faithfulness to biological reality’’ and ‘‘ability to model dynamics’’ spectra. Further, contrary to the common argument in favor of Boolean networks, we find that a relatively large number of time points in the timeseries data is required to learn good Boolean networks for certain data sets. Last but not least, while methods have been proposed for inferring Boolean networks, as discussed above, missing still are publicly available implementations thereof. Here, we make our implementation of the methods available publicly in open source at http://bioinfo.cs.rice.edu/.Item Annotation-free delineation of prokaryotic homology groups(Public Library of Science, 2022) Yin, Yongze; Ogilvie, Huw A.; Nakhleh, LuayPhylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences (MHGs) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa.Item Approximate Modeling of Recombination within the Multispecies Coalescent(2017-04-19) Elworth, Ryan Leo; Nakhleh, LuayThe coalescent with recombination is a powerful stochastic process for modeling genome evolution. However, statistical inference under this process, particularly sampling the graphical structures that arise due to recombination, is very challenging. To address this challenge, approximations of this stochastic process have been introduced based on a process that operates along the genomes and that can be naturally captured by a hidden Markov model. Parameterizing such hidden Markov models based on the coalescent process and population parameters is very challenging. In this thesis, we propose using gene tree topologies with integrated likelihoods for the states, and parameterize the transition probabilities based on topological differences of the gene trees. This approximation, which overcomes the issues of introducing too many states and has an automated procedure for parameterizing transitions, provides good results, as we demonstrate on simulated and biological data. Furthermore, we show how the approximation can be modified slightly to account for cases of gene flow. The work in this thesis provides a general framework for approximating coalescent-based computations.Item Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data(Public Library of Science, 2020) Mallory, Xian F.; Edrisi, Mohammadamin; Navin, Nicholas; Nakhleh, LuaySingle-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.Item Bayesian Inference of Phylogenetic Networks(2016-04-08) Wen, Dingqiao Ellie; Nakhleh, LuayThe multispecies coalescent (MSC) is a statistical framework that models how gene genealogies grow within the branches of a species tree. The field of computational phylogenetics has witnessed an explosion in the development of methods for species tree inference under the MSC, owing mainly to the accumulating evidence of incomplete lineage sorting in phylogenomic analyses. However, the evolutionary history of a set of genomes, or species, could be reticulate due to the occurrence of evolutionary processes such as hybridization or horizontal gene transfer. We devised a novel method for Bayesian inference of genome and species phylogenies under the multispecies network coalescent (MSNC). This framework models gene evolution within the branches of a phylogenetic network, thus incorporating reticulate evolutionary processes, such as hybridization, in addition to incomplete lineage sorting. As phylogenetic networks with different numbers of reticulation events correspond to points of different dimensions in the space of models, we devised a reversible-jump Markov chain Monte Carlo (RJMCMC) technique for sampling the posterior distribution of phylogenetic networks under the MSNC. Given the reticulate evolutionary histories for the whole genome, we devised a method to quantify introgression which would elucidate how each gene evolves. We implemented the methods in the publicly available, open-source software package PhyloNet and studied their performance on simulated and biological data. The work extends the reach of Bayesian inference to phylogenetic networks and enables new evolutionary analyses that account for reticulation.Item Boosting forward-time population genetic simulators through genotype compression(BioMed Central, 2013) Ruths, Troy; Nakhleh, LuayBackground: Forward-time population genetic simulations play a central role in deriving and testing evolutionary hypotheses. Such simulations may be data-intensive, depending on the settings to the various param- eters controlling them. In particular, for certain settings, the data footprint may quickly exceed the memory of a single compute node. Results: We develop a novel and general method for addressing the memory issue inherent in forward-time simulations by compressing and decompressing, in real-time, active and ancestral genotypes, while carefully accounting for the time overhead. We propose a general graph data structure for compressing the genotype space explored during a simulation run, along with efficient algorithms for constructing and updating compressed genotypes which support both mutation and recombination. We tested the performance of our method in very large-scale simulations. Results show that our method not only scales well, but that it also overcomes memory issues that would cripple existing tools. Conclusions: As evolutionary analyses are being increasingly performed on genomes, pathways, and networks, particularly in the era of systems biology, scaling population genetic simulators to handle large-scale simulations is crucial. We believe our method offers a significant step in that direction. Further, the techniques we provide are generic and can be integrated with existing population genetic simulators to boost their performance in terms of memory usage.Item Co-estimating Reticulate Phylogenies and Gene Trees from Multi-locus Sequence Data(2017-10-31) Wen, Dingqiao Ellie; Nakhleh, LuayThe multispecies network coalescent (MSNC) is a stochastic process that captures how gene trees grow within the branches of a phylogenetic network. Coupling the MSNC with a stochastic mutational process that operates along the branches of the gene trees gives rise to a generative model of how multiple loci from within and across species evolve in the presence of both incomplete lineage sorting (ILS) and reticulation (e.g., hybridization). We report on a Bayesian method for sampling the parameters of this generative model, including the species phylogeny, gene trees, divergence times, and population sizes, from DNA sequences of multiple independent loci. We demonstrate the utility of our method by analyzing simulated data and reanalyzing three biological data sets. Our results demonstrate the significance of not only co-estimating species phylogenies and gene trees, but also accounting for reticulation and ILS simultaneously. In particular, we show that when gene flow occurs, our method accurately estimates the evolutionary histories, coalescence times, and divergence times. Tree inference methods, on the other hand, underestimate divergence times and overestimate coalescence times when the evolutionary history is reticulate. While the MSNC corresponds to an abstract model of “intermixture,” we study the performance of the model and method on simulated data generated under a gene flow model. We show that the method accurately infers the most recent time at which gene flow occurs. For genotype data, our method adopts a phasing procedure that integrates over all possible phasing of diploid genotypes, providing accurate estimates of divergence times and parameters. In contrast, the common practice random phasing would result in failure detection of intermixture events, inaccurate divergence times and population sizes, especially at low time scales, as demonstrate by our simulation results.Item Computational approaches to species phylogeny inference and gene tree reconciliation(Elsevier, 2013) Nakhleh, LuayAn intricate relationship exists between gene trees and species phylogenies, due to evolutionary processes that act on the genes within and across the branches of the species phylogeny. From an analytical perspective, gene trees serve as character states for inferring accurate species phylogenies, and species phylogenies serve as a backdrop against which gene trees are contrasted for elucidating evolutionary processes and parameters. In a 1997 paper, Maddison discussed this relationship, reviewed the signatures left by three major evolutionary processes on the gene trees, and surveyed parsimony and likelihood criteria for utilizing these signatures to computationally elucidate this relationship. Here, we review progress that has been made on developing computational methods for analyses under these two criteria, and survey remaining challenges.Item Computational Methods for Analyses of Single-cell DNA Sequencing Data in Cancer(2024-04-16) Edrisi, Mohammadamin; Nakhleh, LuayThe study of cancer using single-cell sequencing technology has opened up exciting new avenues for understanding the genomic complexity and heterogeneity of this disease. However, the analysis of such data presents computational challenges both in terms of designing novel mathematical models for biological discovery as well as devising new methods that are scalable to the newly emerged large-scale single-cell sequencing data. Throughout my Ph.D. studies, I focused on multiple research projects, each of which aimed to address such computational challenges in analyzing single-cell sequencing data in the context of cancer. In this thesis, I present my contributions to three studies and their corresponding methods, including Phylovar for phylogeny-aware detection of single-nucleotide variations (SNVs), MoTERNN for classifying the mode of cancer evolution, and MaCroDNA for integrating high-throughput single-cell DNA and RNA sequencing data. In Phylovar, I improved the joint inference of cancer cells' SNVs (a common type of mutation in cancer) and their phylogeny, an approach known as phylogeny-aware SNV detection. Although this approach is highly accurate, its scalability to large-scale single-cell sequencing datasets was limited. To address this, I introduced a novel vectorized formulation for computing the likelihood function of this model, achieving very good improvement in calculation speed, enabling us to scale up accurate SNV detection from hundreds to millions of genomic loci suitable for the fast-expanding datasets from single-cell whole-genome and whole-exome sequencing technologies. MoTERNN is aimed at determining modes of cancer evolution—linear, branching, neutral, or punctuated—each indicative of specific evolution patterns critical for diagnosis, prognosis, and treatment strategies. I treated this as a graph classification problem, using phylogenetic trees as graphs and evolution modes as classes, and employed Recursive Neural Networks (RvNNs) for classification. As the first application of RvNNs to phylogenetics, MoTERNN demonstrated very high accuracy in both the training and testing phases, showcasing the potential of RvNNs for learning on phylogenetic trees. In the MaCroDNA project, I aimed to link DNA mutations to their impacts on RNA changes by pairing the cells that have been sequenced for either DNA or RNA data alone. In this work, I employed a maximum weighted bipartite matching algorithm for assigning the cells from the two data domains so that the sum of the Pearson correlation between all pairs is maximized. MaCroDNA achieved very good accuracy and outperformed the state-of-the-art method by a large margin.Item Convergent evolution of modularity in metabolic networks through different community structures(BioMed Central, 2012) Zhou, Wanding; Nakhleh, Luay; Bioengineering; Computer ScienceBackground: It has been reported that the modularity of metabolic networks of bacteria is closely related to the variability of their living habitats. However, given the dependency of themodularity score on the community structure, it remains unknown whether organisms achieve certain modularity via similar or different community structures. Results: In this work, we studied the relationship between similarities in modularity scores and similarities in community structures of the metabolic networks of 1021 species. Both similarities are then compared against the genetic distances. We revisited the association between modularity and variability of the microbial living environments and extended the analysis to other aspects of their life style such as temperature and oxygen requirements. We also tested both topological and biological intuition of the community structures identified and investigated the extent of their conservation with respect to the taxomony. Conclusions: We find that similar modularities are realized by different community structures. We find that such convergent evolution of modularity is closely associated with the number of (distinct) enzymes in the organism�s metabolome, a consequence of different life styles of the species. We find that the order of modularity is the same as the order of the number of the enzymes under the classification based on the temperature preference but not on the oxygen requirement. Besides, inspection of modularity-based communities reveals that these communities are graph-theoretically meaningful yet not reflective of specific biological functions. From an evolutionary perspective, we find that the community structures are conserved only at the level of kingdoms. Our results call for more investigation into the interplay between evolution and modularity: how evolution shapes modularity, and how modularity affects evolution (mainly in terms of fitness and evolvability). Further, our results call for exploring new measures of modularity and network communities that better correspond to functional categorizations.Item Current progress and open challenges for applying deep learning across the biosciences(Springer Nature, 2022) Sapoval, Nicolae; Aghazadeh, Amirali; Nute, Michael G.; Antunes, Dinler A.; Balaji, Advait; Baraniuk, Richard; Barberan, C.J.; Dannenfelser, Ruth; Dun, Chen; Edrisi, Mohammadamin; Elworth, R.A. Leo; Kille, Bryce; Kyrillidis, Anastasios; Nakhleh, Luay; Wolfe, Cameron R.; Yan, Zhi; Yao, Vicky; Treangen, Todd J.; Bioengineering; Computer ScienceDeep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.Item Detecting Structural Variations with Illumina, PacBio and Optical Maps Data by Computational Approaches(2018-04-20) Fan, Xian; Nakhleh, Luay; Chen, KenDetecting structural variations (SV) is important in deciphering variations in human DNA and the cause of genetic disease such as cancer. Computational approaches to detect SVs are made possible by sequencing technologies. As different sequencing technologies render data with different characteristics, computational approaches are designed in a way that is specific to a certain technology. In this thesis I studied three technologies: Illumina, PacBio and Optical Maps. As Illumina and PacBio reads have complementary advantages and disadvantages of read length and error rate, I proposed a new approach, HySA, that combines Illumina and PacBio to detect SV. HySA was able to detect SVs that cannot be detected by the approaches for either only Illumina or only PacBio. However, due to the repetitiveness of the human DNA as well as the existence of complex SVs, it is still challenging for HySA to detect some SVs on the repetitive regions or complex SVs. To overcome that, I proposed a new approach to detect SVs by Optical Maps data, which is advantageous over Illumina and PacBio in read length, despite its lack of sequence and unique error profile. The SVs detected by Optical Maps alone complement those from Illumina and PacBio. In all, the two approaches I proposed help push towards a more complete characterization of SVs in human DNA.Item Distributed Algorithms for Computing Very Large Thresholded Covariance Matrices(2014-09-26) Gao, Zekai; Jermaine, Christopher; Nakhleh, Luay; Subramanian, DevikaComputation of covariance matrices from observed data is an important problem, as such matrices are used in applications such as PCA, LDA, and increasingly in the learning and application of probabilistic graphical models. One of the most challenging aspects of constructing and managing covariance matrices is that they can be huge and the size makes then expensive to compute. For a p-dimensional data set with n rows, the covariance matrix will have p(p-1)/2 entries and the naive algorithm to compute the matrix will take O(np^2) time. For large p (greater than 10,000) and n much greater than p, this is debilitating. In this thesis, we consider the problem of computing a large covariance matrix efficiently in a distributed fashion over a large data set. We begin by considering the naive algorithm in detail, pointing out where it will and will not be feasible. We then consider reducing the time complexity using sampling-based methods to compute to compute an approximate, thresholded version of the covariance matrix. Here “thresholding” means that all of the unimportant values in the matrix have been dropped and replaced with zeroes. Our algorithms have probabilistic bounds which imply that with high probability, all of the top K entries in the matrix have been retained.Item Evaluation of Existing Methods and Development of New Ones for Phylogenomic Analyses(2024-04-17) Yan, Zhi; Nakhleh, LuayDespite the revolution brought by phylogenomics, accurately reconstructing the Tree of Life remains a challenge due to discordance between gene and species histories. These incongruences arise from biological processes like incomplete lineage sorting (ILS). The multispecies coalescent (MSC) model, a cornerstone for species tree inference, accounts for ILS but assumes strict orthology, no recombination within loci, and free recombination between loci. The multispecies network coalescent (MSNC) extends the MSC to accommodate diploid hybridization, but real biological complexities often require further refinement. This thesis addresses these limitations by investigating the impact of MSC assumption violations on phylogenetic inference. We explore three key areas: (1) the potential of utilizing paralogs for species tree reconstruction, (2) the influence of recombination on population parameter estimation, and (3) the effectiveness of existing gene tree correction methods. We then introduce two novel methods, MPAllopp and Polyphest, specifically designed to infer phylogenetic networks that account for both ILS and polyploidy, a prevalent phenomenon in evolution. These methods are validated through extensive simulations and real data analyses. Overall, this thesis contributes to enhancing the accuracy of phylogenetic inference by critically evaluating existing methods and developing novel approaches that can handle the complexities of real-world data.Item Evolution After Whole-genome Duplication: A Network Perspective(Genetics Society of America, 2013) Zhu, Yun; Lin, Zhenguo; Nakhleh, LuayGene duplication plays an important role in the evolution of genomes and interactomes. Elucidating how evolution after gene duplication interplays at the sequence and network level is of great interest. In this paper, we analyze a data set of gene pairs that arose through whole-genome duplication (WGD) in yeast. All these pairs have the same duplication time, making them ideal for evolutionary investigation. We investigated the interplay between evolution after WGD at the sequence and network levels, and correlated these two levels of divergence with gene expression and tness data. We nd that molecular interactions involving WGD genes evolve at rates that are three orders of magnitude slower than the rates of evolution of the corresponding sequences. Further, we nd that divergence of WGD pairs correlates strongly with gene expression and tness data. Owing to the role of gene duplication in determining redundancy in biological systems and particularly at the network level, we investigated the role of interaction networks in elucidating the evolutionary fate of duplicated genes. We nd that gene neighborhoods in interaction networks provide a mechanism for inferring these fates, and we developed an algorithm for achieving this task. Further epistasis analysis of WGD pairs categorized by their inferred evolutionary fates demonstrated the utility of these techniques. Finally, we nd that WGD pairs and other pairs of paralogous genes of small-scale duplication origin share similar properties, giving good support for generalizing our results from WGD pairs to evolution after gene duplication in general.Item Gene Duplicability-Connectivity-Complexity across Organisms and a Neutral Evolutionary Explanation(Public Library of Science, 2012) Zhu, Yun; Du, Peng; Nakhleh, LuayGene duplication has long been acknowledged by biologists as a major evolutionary force shaping genomic architectures and characteristics across the Tree of Life. Major research has been conducting on elucidating the fate of duplicated genes in a variety of organisms, as well as factors that affect a geneメs duplicabilityヨthat is, the tendency of certain genes to retain more duplicates than others. In particular, two studies have looked at the correlation between gene duplicability and its degree in a protein-protein interaction network in yeast, mouse, and human, and another has looked at the correlation between gene duplicability and its complexity (length, number of domains, etc.) in yeast. In this paper, we extend these studies to six species, and two trends emerge. There is an increase in the duplicability-connectivity correlation that agrees with the increase in the genome size as well as the phylogenetic relationship of the species. Further, the duplicabilitycomplexity correlation seems to be constant across the species. We argue that the observed correlations can be explained by neutral evolutionary forces acting on the genomic regions containing the genes. For the duplicability-connectivity correlation, we show through simulations that an increasing trend can be obtained by adjusting parameters to approximate genomic characteristics of the respective species. Our results call for more research into factors, adaptive and non-adaptive alike, that determine a geneメs duplicability.Item Gene Tree Distributions under Duplication, Loss and Deep Coalescence(2017-01-05) Ye, Dan; Nakhleh, LuayGene duplication and loss are two evolutionary processes that occur across all three domains of life. These two processes result in different loci, across a set of related genomes, having different gene trees. Inferring the phylogeny of the genomes from data sets of such gene trees is a central task in phylogenomics. Furthermore, when the evolutionary history of the genomes includes relatively close divergence events, as in cases of closely related organisms or rapid radiations, deep coalescence of gene copies could be at play, in addition to duplication and loss, further adding to the complexity of gene/genome relationships. In this work, we develop a probabilistic model of gene evolution that incorporates duplications and loss, and accounts for deep coalescence. We formulate the models in terms of Markov chains, and provide algorithms for computing gene tree distributions for the two cases of gene trees with and without branch lengths. We illustrate the use of our work on simulated and biological data by assessing the accuracy of species tree inferences under our models (topology and branch lengths) and contrasting them to inferences under cases of deep coalescence alone. It is important to highlight that our models sidestep the issue of hidden paralogy by ``integrating out" the possible orthology assignments of gene copies. Our work enables new statistical phylogenomic analyses, particularly when hidden paralogy and deep coalescence could be at play.Item An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes(Public Library of Science, 2014) Liu, Kevin J.; Dai, Jingxuan; Truong, Kathy; Song, Ying; Kohn, Michael H.; Nakhleh, LuayOne outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the integration of genetic material from one species into the genome of an individual in another species. The evolution of several groups of eukaryotic species has involved hybridization, and cases of adaptation through introgression have been already established. In this work, we report on PhyloNet-HMM?a new comparative genomic framework for detecting introgression in genomes. PhyloNet-HMM combines phylogenetic networks with hidden Markov models (HMMs) to simultaneously capture the (potentially reticulate) evolutionary history of the genomes and dependencies within genomes. A novel aspect of our work is that it also accounts for incomplete lineage sorting and dependence across loci. Application of our model to variation data from chromosome 7 in the mouse (Mus musculus domesticus) genome detected a recently reported adaptive introgression event involving the rodent poison resistance gene Vkorc1, in addition to other newly detected introgressed genomic regions. Based on our analysis, it is estimated that about 9% of all sites within chromosome 7 are of introgressive origin (these cover about 13 Mbp of chromosome 7, and over 300 genes). Further, our model detected no introgression in a negative control data set. We also found that our model accurately detected introgression and other evolutionary processes from synthetic data sets simulated under the coalescent model with recombination, isolation, and migration. Our work provides a powerful framework for systematic analysis of introgression while simultaneously accounting for dependence across sites, point mutations, recombination, and ancestral polymorphism.Item Inference of reticulate evolutionary histories by maximum likelihood: the performance of information criteria(BioMed Central, 2012) Park, Hyun Jung; Nakhleh, LuayBackground: Maximum likelihood has been widely used for over three decades to infer phylogenetic trees from molecular data. When reticulate evolutionary events occur, several genomic regions may have conflicting evolutionary histories, and a phylogenetic network may provide a more adequate model for representing the evolutionary history of the genomes or species. A maximum likelihood (ML) model has been proposed for this case and accounts for both mutation within a genomic region and reticulation across the regions. However, the performance of this model in terms of inferring information about reticulate evolution and properties that affect this performance have not been studied. Results: In this paper, we study the effect of the evolutionary diameter and height of a reticulation event on its identifiability under ML. We find both of them, particularly the diameter, have a significant effect. Further, we find that the number of genes (which can be generalized to the concept of "non-recombining genomic regions") that are transferred across a reticulation edge affects its detectability. Last but not least, a fundamental challenge with phylogenetic networks is that they allow an arbitrary level of complexity, giving rise to the model selection problem. We investigate the performance of two information criteria, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), for addressing this problem. We find that BIC performs well in general for controlling the model complexity and preventing ML from grossly overestimating the number of reticulation events. Conclusion: Our results demonstrate that BIC provides a good framework for inferring reticulate evolutionary histories. Nevertheless, the results call for caution when interpreting the accuracy of the inference particularly for data sets with particular evolutionary features.Item Interpretable and Efficient Machine Learning in Cancer Biology(2022-12-01) Liang, Shaoheng; Nakhleh, Luay; Chen, KenThe past decade witnessed the advance of machine learning and cancer biology. In therapeutics, chimeric antigen receptor (CAR) treatments and cancer vaccines give new hope for ending cancer. Single-cell sequencing and mass spectrometry enable personalized high-resolution observations of cancer cell behavior and immune response. Computational cancer biology is no different; the continuous evolution of machine learning models, especially neural networks, provides unprecedented potential in making predictions. However, efforts are still needed to tailor the models to interpret specific biological processes. My research explores how knowledge-informed adaptation of machine learning techniques, such as neural networks, metric learning, and probabilistic classifiers helps answer questions in cancer biology. For example, periodicity in the cell cycle and other biological processes inspired our use of a sinusoidal activation function in an autoencoder to discover the periodicity in single-cell transcriptomic data. To efficiently predict biomarkers driving tumorigenesis and immune cell differentiation, we adapted UMAP with L1 regularization and our implementation of OWLQN (Orthant-Wise Limited-memory Quasi-Newton) optimizer. Inspired by structural motifs in antigen presentation, our white-box positive-example-only classifier based on Naïve Bayes formulation and mutual-information-based combinatorial feature selection achieves state-of-the-art accuracy in antigen presentation prediction, helping design cancer vaccines and understand the antigen presentation process. The differences among patient samples, referred to as the batch effect, informed the development of a power analysis web and a differential expression analysis tool to better identify changes in cell type abundances and omics features. Increasingly large omics data also call for more efficient computational methods. My research utilized multiple modeling and computing techniques, such as conjugate priors, quasi-newton method, parallelism, and GPU acceleration, to address this need. For wider usage by different user groups including method developers, bench scientists, and clinicians, we developed the tools as Python or R packages, or web applications. Overall, my research shows that knowledge-informed interpretable modeling of complex biological processes helps make accurate clinical-relevant predictions and generate new knowledge, both important for cancer biology and broader biomedical applications.