S.O.S. for cost-effective and computationally efficient targeted microbial pathogen detection in clinical and public health settings

dc.contributor.advisorTreangen, Todd J.en_US
dc.contributor.advisorVeiseh, Omiden_US
dc.creatorWang, Michael Xiangjiangen_US
dc.date.accessioned2025-05-29T19:46:21Zen_US
dc.date.created2025-05en_US
dc.date.issued2025-04-25en_US
dc.date.submittedMay 2025en_US
dc.date.updated2025-05-29T19:46:21Zen_US
dc.description.abstractThe COVID-19 pandemic forever underscored the importance of rapid, accurate, and cost-effective pathogen diagnostics for public health and clinical settings. Thanks to recent advancements in DNA sequencing, microbial pathogens are now studied at unprecedented scale and depth. However, in environmental (e.g. wastewater) and clinical (e.g. blood) samples, microbial pathogens often represent a tiny fraction of the total nucleic acids, rendering untargeted sequencing impractical with respect to cost. Additionally, untargeted approaches require 24 to 72 hours for sequencing and thus cannot fully replace fluorescence probe-based assays due to their rapid turnaround times, simple experimental workflows, lower sample quality requirements, and accessible instrumentation. Targeted amplification techniques, such as polymerase chain reaction (PCR) or hybrid capture, address these limitations by selectively enriching nucleic acid sequences of interest by millions- or billions-fold, significantly reducing sequencing costs and enabling compatibility with fluorescence probe-based detection technologies. In the first part of this thesis I present SADDLE, a stochastic algorithm for the design of multiplex PCR primer sets that minimizes primer dimer formation. One major challenge in the design of highly multiplexed PCR primer sets is the large number of potential primer dimer species that grows quadratically with the number of primers to be designed. Simultaneously, there are exponentially many choices for multiplex primer sequence selection, resulting in systematic evaluation approaches being computationally intractable. SADDLE tackles these problems by implementing a novel algorithm that estimates the dimer likelihood of the whole primer set in linear time. Combined with simulated annealing, SADDLE efficiently explores the potential primer sets and reduces dimer formation by more than 10-fold in real experimental settings. Second, I present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Existing methods take a one-shot approach for tiled amplicon design, resulting in semi-optimized primer sets that need further manual optimization. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided, including variations and repetitive sequences. It then evaluates thousands of possible primer combinations with a highly efficient loss function. In a direct comparison with the most widely used primer set for SARS-CoV-2 sequencing in ARTIC, Olivar has up to 3-fold higher mapping rates on real wastewater samples while retaining similar coverage. Lastly, I present Seqwin, an annotation-free method for relaxed search of clade-specific marker sequences based on minimizer graphs. These markers are critical for pathogen identification, disease surveillance and taxonomy classification. Earlier methods search for maximal unique matches with suffix trees, but they are susceptible to sequence variations and are limited by the rapidly expanding genomic databases. More recent solutions improve scalability through clustering protein-coding genes, but they require genome annotation and restrict marker discovery to coding regions. Seqwin takes a novel approach of clustering minimizers into graph nodes, eliminating the expensive annotation step while achieving linear scalability, allowing it to run on computing systems with limited resources. In summary, the primary contributions of this thesis are open-source computational approaches for designing targeted assays for pathogen detection, minimizing manual labor, automating optimization, and simplifying the end-to-end design to deployment process. All of these contributions are open source and freely available to public health and clinical labs across the globe, providing an S.O.S for rapid and accurate targeted detection of microbial pathogens.en_US
dc.embargo.lift2026-05-01en_US
dc.embargo.terms2026-05-01en_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.urihttps://hdl.handle.net/1911/118430en_US
dc.language.isoenen_US
dc.subjecttargeted detection, pathogen, nucleic aciden_US
dc.titleS.O.S. for cost-effective and computationally efficient targeted microbial pathogen detection in clinical and public health settingsen_US
dc.typeThesisen_US
dc.type.materialTexten_US
thesis.degree.departmentBioengineeringen_US
thesis.degree.disciplineBioengineeringen_US
thesis.degree.grantorRice Universityen_US
thesis.degree.levelDoctoralen_US
thesis.degree.nameDoctor of Philosophyen_US
Files
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
5.84 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.98 KB
Format:
Plain Text
Description: