Towards Accurate and Scalable Phylogenetic Inference Under Complex Evolutionary Models Using Neural Networks

Nakhleh, Luay KOgilvie, Huw A2023-08-092023-052023-04-21May 2023Yakici, Berk Alp. "Towards Accurate and Scalable Phylogenetic Inference Under Complex Evolutionary Models Using Neural Networks." (2023) Master’s Thesis, Rice University. <a href="https://hdl.handle.net/1911/115163">https://hdl.handle.net/1911/115163</a>.https://hdl.handle.net/1911/115163Classical phylogenetic tree inference methods assume sequences evolve under a stationary, reversible, and homogeneous (SRH) model. This assumption is often violated in real data, sometimes severely, depending on the studied biological system. Furthermore, the inference of species trees also needs to account for population-level processes, such as recombination, which poses unique challenges. With a deluge of whole-genome data sets becoming increasingly available, accurately inferring species tree topologies under complex evolutionary models remains an open problem. In this thesis, I introduce a supervised learning (SL) method that uses multi-layer perceptrons (MLPs) for accurately inferring phylogenetic tree topologies from genome-scale data under complex evolutionary processes. We train our model with sequences simulated under the multispecies coalescent model with recombination (MSC-R), and we vary both clock rate and base frequency content across lineages. This enables our model to account for the effects and interactions of multiple complex processes. Utilizing a divide-and-conquer and supertree construction approach, we demonstrate that the inference scales beyond five taxa while remaining accurate. Using a simulation study, we show that our model can outperform classical supermatrix methods, such as neighbor-joining, maximum parsimony, and maximum likelihood, when the SRH assumption of sequence evolution is violated. Additionally, we demonstrate that the amalgamation of quintets is more accurate than that of quartets. Further, we re-analyze a whole-genome alignment of 33 avian species using our MLP model, estimating a species tree that supports the hypothesis of a single origin of non-galliform waterbirds. The accuracy and scalability of our model demonstrate that supervised learning methods can be an important tool for phylogenetic analyses.application/pdfengCopyright is held by the author, unless otherwise indicated. Permission to reuse, publish, or reproduce the work beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.species treessupertreesphylogenetic inferencelikelihood-free inferenceneural networksmultilayer perceptrondivide-and-conquermultispecies coalescentrecombinationincomplete lineage sortingnon-stationaritycompositional biasTowards Accurate and Scalable Phylogenetic Inference Under Complex Evolutionary Models Using Neural NetworksThesis2023-08-09