Scalable Methods for Phylogenetic Network Inference

Nakhleh, Luay2019-05-172019-05-172019-052019-04-10May 2019Zhu, Jiafan. "Scalable Methods for Phylogenetic Network Inference." (2019) Diss., Rice University. <a href="https://hdl.handle.net/1911/105957">https://hdl.handle.net/1911/105957</a>.https://hdl.handle.net/1911/105957Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks, which take the shape of rooted, directed, acyclic graphs. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other evolutionary processes, such as incomplete lineage sorting (ILS). These methods use two different types of input: unlinked bi-allelic markers (e.g., single nucleotide polymorphism data), and sequence alignments of multiple, unlinked loci. While these methods have good accuracy in terms of estimating the network and its parameters, likelihood computations and convergence remain major computational bottlenecks and limit the methods’ applicability and scalability. The contributions of this thesis are threefold. First, I explore the challenge with viewing a phylogenetic network as an underlying phylogenetic tree with an additional set of “horizontal” edges. Furthermore, I demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. Second, I develop an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. I demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data, and I demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Third, I introduce a novel divide-and-conquer method for scalable inference of phylogenetic networks from the sequence data of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of subproblems on which to infer subnetworks, a Hitting Set version of the problem of finding a small number of subsets is formulated, and a simple heuristic is implemented to solve it. I demonstrate the performance of the two-step algorithm, in terms of both running time and accuracy, on simulated as well as on biological data sets. The divide-and-conquer method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. I implemented and made availably to the community all the algorithms in the publicly available software package PhyloNet. The contributions of my thesis provide a significant and promising step towards accurate, large-scale phylogenetic network inference.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.Phylogenetic networksScalabilityDivide-and-conquerBayesian inferencemulti-locus phylogenomicsScalable Methods for Phylogenetic Network InferenceThesis2019-05-17