S.O.S. for cost-effective and computationally efficient targeted microbial pathogen detection in clinical and public health settings

Wang, Michael Xiangjiang

S.O.S. for cost-effective and computationally efficient targeted microbial pathogen detection in clinical and public health settings

dc.contributor.advisor	Treangen, Todd J.	en_US
dc.contributor.advisor	Veiseh, Omid	en_US
dc.creator	Wang, Michael Xiangjiang	en_US
dc.date.accessioned	2025-05-29T19:46:21Z	en_US
dc.date.created	2025-05	en_US
dc.date.issued	2025-04-25	en_US
dc.date.submitted	May 2025	en_US
dc.date.updated	2025-05-29T19:46:21Z	en_US
dc.description.abstract	The COVID-19 pandemic forever underscored the importance of rapid, accurate, and cost-effective pathogen diagnostics for public health and clinical settings. Thanks to recent advancements in DNA sequencing, microbial pathogens are now studied at unprecedented scale and depth. However, in environmental (e.g. wastewater) and clinical (e.g. blood) samples, microbial pathogens often represent a tiny fraction of the total nucleic acids, rendering untargeted sequencing impractical with respect to cost. Additionally, untargeted approaches require 24 to 72 hours for sequencing and thus cannot fully replace fluorescence probe-based assays due to their rapid turnaround times, simple experimental workflows, lower sample quality requirements, and accessible instrumentation. Targeted amplification techniques, such as polymerase chain reaction (PCR) or hybrid capture, address these limitations by selectively enriching nucleic acid sequences of interest by millions- or billions-fold, significantly reducing sequencing costs and enabling compatibility with fluorescence probe-based detection technologies. In the first part of this thesis I present SADDLE, a stochastic algorithm for the design of multiplex PCR primer sets that minimizes primer dimer formation. One major challenge in the design of highly multiplexed PCR primer sets is the large number of potential primer dimer species that grows quadratically with the number of primers to be designed. Simultaneously, there are exponentially many choices for multiplex primer sequence selection, resulting in systematic evaluation approaches being computationally intractable. SADDLE tackles these problems by implementing a novel algorithm that estimates the dimer likelihood of the whole primer set in linear time. Combined with simulated annealing, SADDLE efficiently explores the potential primer sets and reduces dimer formation by more than 10-fold in real experimental settings. Second, I present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Existing methods take a one-shot approach for tiled amplicon design, resulting in semi-optimized primer sets that need further manual optimization. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided, including variations and repetitive sequences. It then evaluates thousands of possible primer combinations with a highly efficient loss function. In a direct comparison with the most widely used primer set for SARS-CoV-2 sequencing in ARTIC, Olivar has up to 3-fold higher mapping rates on real wastewater samples while retaining similar coverage. Lastly, I present Seqwin, an annotation-free method for relaxed search of clade-specific marker sequences based on minimizer graphs. These markers are critical for pathogen identification, disease surveillance and taxonomy classification. Earlier methods search for maximal unique matches with suffix trees, but they are susceptible to sequence variations and are limited by the rapidly expanding genomic databases. More recent solutions improve scalability through clustering protein-coding genes, but they require genome annotation and restrict marker discovery to coding regions. Seqwin takes a novel approach of clustering minimizers into graph nodes, eliminating the expensive annotation step while achieving linear scalability, allowing it to run on computing systems with limited resources. In summary, the primary contributions of this thesis are open-source computational approaches for designing targeted assays for pathogen detection, minimizing manual labor, automating optimization, and simplifying the end-to-end design to deployment process. All of these contributions are open source and freely available to public health and clinical labs across the globe, providing an S.O.S for rapid and accurate targeted detection of microbial pathogens.	en_US
dc.embargo.lift	2026-05-01	en_US
dc.embargo.terms	2026-05-01	en_US
dc.format.mimetype	application/pdf	en_US
dc.identifier.uri	https://hdl.handle.net/1911/118430	en_US
dc.language.iso	en	en_US
dc.subject	targeted detection, pathogen, nucleic acid	en_US
dc.title	S.O.S. for cost-effective and computationally efficient targeted microbial pathogen detection in clinical and public health settings	en_US
dc.type	Thesis	en_US
dc.type.material	Text	en_US
thesis.degree.department	Bioengineering	en_US
thesis.degree.discipline	Bioengineering	en_US
thesis.degree.grantor	Rice University	en_US
thesis.degree.level	Doctoral	en_US
thesis.degree.name	Doctor of Philosophy	en_US

Files

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 5.84 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 2.98 KB
Format:: Plain Text
Description:

Download

Collections

Rice University Theses and Dissertations