Browsing by Author "Zhu, Jiafan"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item A divide-and-conquer method for scalable phylogenetic network inference from multilocus data(Oxford University Press, 2019) Zhu, Jiafan; Liu, Xinhao; Ogilvie, Huw A.; Nakhleh, Luay K.Motivation: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. Results: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.Item Bayesian inference of phylogenetic networks from bi-allelic genetic markers(Public Library of Science, 2018) Zhu, Jiafan; Wen, Dingqiao; Yu, Yun; Meudt, Heidi M.; Nakhleh, Luay K.Phylogenetic networks are rooted, directed, acyclic graphs that model reticulate evolutionary histories. Recently, statistical methods were devised for inferring such networks from either gene tree estimates or the sequence alignments of multiple unlinked loci. Bi-allelic markers, most notably single nucleotide polymorphisms (SNPs) and amplified fragment length polymorphisms (AFLPs), provide a powerful source of genome-wide data. In a recent paper, a method called SNAPP was introduced for statistical inference of species trees from unlinked bi-allelic markers. The generative process assumed by the method combined both a model of evolution for the bi-allelic markers, as well as the multispecies coalescent. A novel component of the method was a polynomial-time algorithm for exact computation of the likelihood of a fixed species tree via integration over all possible gene trees for a given marker. Here we report on a method for Bayesian inference of phylogenetic networks from bi-allelic markers. Our method significantly extends the algorithm for exact computation of phylogenetic network likelihood via integration over all possible gene trees. Unlike the case of species trees, the algorithm is no longer polynomial-time on all instances of phylogenetic networks. Furthermore, the method utilizes a reversible-jump MCMC technique to sample the posterior of phylogenetic networks given bi-allelic marker data. Our method has a very good performance in terms of accuracy and robustness as we demonstrate on simulated data, as well as a data set of multiple New Zealand species of the plant genus Ourisia (Plantaginaceae). We implemented the method in the publicly available, open-source PhyloNet software package.Item In the light of deep coalescence: revisiting trees within networks(BioMed Central, 2016) Zhu, Jiafan; Yu, Yun; Nakhleh, Luay K.Abstract Background Phylogenetic networks model reticulate evolutionary histories. The last two decades have seen an increased interest in establishing mathematical results and developing computational methods for inferring and analyzing these networks. A salient concept underlying a great majority of these developments has been the notion that a network displays a set of trees and those trees can be used to infer, analyze, and study the network. Results In this paper, we show that in the presence of coalescence effects, the set of displayed trees is not sufficient to capture the network. We formally define the set of parental trees of a network and make three contributions based on this definition. First, we extend the notion of anomaly zone to phylogenetic networks and report on anomaly results for different networks. Second, we demonstrate how coalescence events could negatively affect the ability to infer a species tree that could be augmented into the correct network. Third, we demonstrate how a phylogenetic network can be viewed as a mixture model that lends itself to a novel inference approach via gene tree clustering. Conclusions Our results demonstrate the limitations of focusing on the set of trees displayed by a network when analyzing and inferring the network. Our findings can form the basis for achieving higher accuracy when inferring phylogenetic networks and open up new venues for research in this area, including new problem formulations based on the notion of a network’s parental trees.Item Inference of species phylogenies from bi-allelic markers using pseudo-likelihood(Oxford University Press, 2018) Zhu, Jiafan; Nakhleh, Luay K.MOTIVATION: Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g. single nucleotide polymorphism data) and allows for exact likelihood computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method's applicability. RESULTS: In this article, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data. The proposed method allows for analyzing larger datasets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss. AVAILABILITY AND IMPLEMENTATION: The methods have been implemented in PhyloNet (http://bioinfocs.rice.edu/phylonet).Item Scalable Methods for Phylogenetic Network Inference(2019-04-10) Zhu, Jiafan; Nakhleh, LuayReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks, which take the shape of rooted, directed, acyclic graphs. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other evolutionary processes, such as incomplete lineage sorting (ILS). These methods use two different types of input: unlinked bi-allelic markers (e.g., single nucleotide polymorphism data), and sequence alignments of multiple, unlinked loci. While these methods have good accuracy in terms of estimating the network and its parameters, likelihood computations and convergence remain major computational bottlenecks and limit the methods’ applicability and scalability. The contributions of this thesis are threefold. First, I explore the challenge with viewing a phylogenetic network as an underlying phylogenetic tree with an additional set of “horizontal” edges. Furthermore, I demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. Second, I develop an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. I demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data, and I demonstrate aspects of robustness of the method to violations in the underlying assumptions of the employed statistical model. Third, I introduce a novel divide-and-conquer method for scalable inference of phylogenetic networks from the sequence data of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of subproblems on which to infer subnetworks, a Hitting Set version of the problem of finding a small number of subsets is formulated, and a simple heuristic is implemented to solve it. I demonstrate the performance of the two-step algorithm, in terms of both running time and accuracy, on simulated as well as on biological data sets. The divide-and-conquer method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. I implemented and made availably to the community all the algorithms in the publicly available software package PhyloNet. The contributions of my thesis provide a significant and promising step towards accurate, large-scale phylogenetic network inference.Item Statistical Inference of Phylogenetic Networks from Unlinked Bi-allelic Markers(2018-04-18) Zhu, Jiafan; Nakhleh, LuayPhylogenetic networks are rooted, directed, acyclic graphs that model reticulate evolutionary histories. Recently, statistical methods were devised for inferring such networks from either gene tree estimates or the sequence alignments of multiple unlinked loci. Bi-allelic markers, most notably single nucleotide polymorphisms (SNPs) and amplified fragment length polymorphisms (AFLPs), provide a powerful source of genome-wide data. In a recent paper, a method called SNAPP was introduced for statistical inference of species trees from unlinked bi-allelic markers. The generative process assumed by the method combined both a model of evolution for the bi-allelic markers, as well as the multispecies coalescent. A novel component of the method was a polynomial-time algorithm for exact computation of the likelihood of a fixed species tree via integration over all possible gene trees for a given marker. Here we report on a method for Bayesian inference of phylogenetic networks from bi-allelic markers. Our method significantly extends the algorithm for exact computation of phylogenetic network likelihood via integration over all possible gene trees. Unlike the case of species trees, the algorithm is no longer polynomial-time on all instances of phylogenetic networks. Furthermore, the method utilizes a reversible-jump MCMC technique to sample the posterior of phylogenetic networks given bi-allelic marker data. Our method has a very good performance in terms of accuracy and robustness as we demonstrate on simulated data, as well as a data set of multiple New Zealand species of the plant genus Ourisia (Plantaginaceae). We implemented the method in the publicly available, open-source PhyloNet software package.