Phylogeny Inference in the Presence of Incomplete Lineage Sorting, Gene Duplication and Loss and Hybridization

Date
2019-04-10
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

A species phylogeny captures how a set of extant species split and diverged from their most recent common ancestral species. A gene tree captures the evolutionary history of an individual gene or, more generally, non-recombining genomic region. A very complex relationship exists between the phylogeny of a set of species and the trees of genes in the genomes of those species. The complexity arises because of processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and hybridization, all of which can give rise to gene trees whose topologies disagree with each other as well as with that of the species phylogeny. Species phylogeny inference in the post-genomic era, also known as phylogenomic inference, requires developing models and methods that account for these processes in order to relate how individual loci (genomic regions) evolve within and across the branches of species phylogenies. For example, the multispecies coalescent (MSC) has been introduced to model ILS, and statistical species tree inference methods based on it have been developed. This model was later extended to allow for reticulation events (e.g., hybridization), and statistical methods for inferring phylogenetic networks were developed. Birth-death models of gene evolution have also been introduced to capture gene duplications and losses, and species tree inference methods that utilize them have been developed. In this thesis, I address two computational problems that arise in this domain. The first problem concerns the inference of species trees from multiple loci assuming that only ILS and GDL are at play, but not reticulation. The second problem concerns the inference of species (phylogenetic) networks from multiple loci when all three processes ILS, GDL, and reticulation are at play. My contribution for the first problem is twofold. First, I developed and implemented a heuristic for maximum a posteriori (MAP) estimate of the species tree from the sequence alignments of multiple independent loci. Second, based on a study of the accuracy of MSC-based inference methods on data where GDL is at play, I proposed a method for efficient inference of the topology of a species tree in the presence of both ILS and GDL. My contribution for the second problem is twofold as well. I first developed the first three-piece model of phylogenetic network / locus network / gene tree, which accurately captures the three aforementioned processes and yields a generative model of genomic sequence data from a phylogenetic network. I then developed a heuristic for inferring phylogenetic networks from multi-locus data under this generative model. I studied the accuracy of all methods on both simulated and biological data sets. The contributions of my thesis provide further advances in the field of phylogenomics by providing methods that incorporate more of the biological complexity in evolution than existing methods do. Consequently, my methods allow for utilizing more of the genomic data (and signal) for a more accurate inference of not only the species phylogeny, but also the processes that acted upon the individual loci within the genomes of those species.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Phylogenetics, Gene duplication and Loss, Incomplete Lineage Sorting, Hybridzation, Statistical Inference
Citation

Du, Peng. "Phylogeny Inference in the Presence of Incomplete Lineage Sorting, Gene Duplication and Loss and Hybridization." (2019) Diss., Rice University. https://hdl.handle.net/1911/105380.

Has part(s)
Forms part of
Published Version
Rights
Copyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.
Link to license
Citable link to this page