Browse
Recent Submissions
Now showing 1 - 20 of 183
Item Impact and characterization of serial structural variations across humans and great apes(Springer Nature, 2024) Höps, Wolfram; Rausch, Tobias; Jendrusch, Michael; Korbel, Jan O.; Sedlazeck, Fritz J.Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals (https://github.com/WHops/NAHRwhals), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.Item Machine Learning to Enhance Electronic Detection of Diagnostic Errors(American Medical Association, 2024) Zimolzak, Andrew J.; Wei, Li; Mir, Usman; Gupta, Ashish; Vaghani, Viralkumar; Subramanian, Devika; Singh, HardeepItem Polyphest: fast polyploid phylogeny estimation(Oxford University Press, 2024) Yan, Zhi; Cao, Zhen; Nakhleh, LuayDespite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies.We introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes.Polyphest is available at https://github.com/NakhlehLab/Polyphest.Item Reference-free structural variant detection in microbiomes via long-read co-assembly graphs(Oxford University Press, 2024) Curry, Kristen D; Yu, Feiqiao Brian; Vance, Summer E; Segarra, Santiago; Bhaya, Devaki; Chikhi, Rayan; Rocha, Eduardo P C; Treangen, Todd JMotivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining.Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux.Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.Item CrysFormer: Protein structure determination via Patterson maps, deep learning, and partial structure attention(AIP Publishing LLC, 2024) Pan, Tom; Dun, Chen; Jin, Shikai; Miller, Mitchell D.; Kyrillidis, Anastasios; Phillips, George N., Jr.Determining the atomic-level structure of a protein has been a decades-long challenge. However, recent advances in transformers and related neural network architectures have enabled researchers to significantly improve solutions to this problem. These methods use large datasets of sequence information and corresponding known protein template structures, if available. Yet, such methods only focus on sequence information. Other available prior knowledge could also be utilized, such as constructs derived from x-ray crystallography experiments and the known structures of the most common conformations of amino acid residues, which we refer to as partial structures. To the best of our knowledge, we propose the first transformer-based model that directly utilizes experimental protein crystallographic data and partial structure information to calculate electron density maps of proteins. In particular, we use Patterson maps, which can be directly obtained from x-ray crystallography experimental data, thus bypassing the well-known crystallographic phase problem. We demonstrate that our method, CrysFormer, achieves precise predictions on two synthetic datasets of peptide fragments in crystalline forms, one with two residues per unit cell and the other with fifteen. These predictions can then be used to generate accurate atomic models using established crystallographic refinement programs.Item Profiling complex repeat expansions in RFC1 in Parkinson’s disease(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Miano-Burkhardt, Abigail; Iwaki, Hirotaka; Malik, Laksh; Cogan, Guillaume; Makarious, Mary B.; Sullivan, Roisin; Vandrovcova, Jana; Ding, Jinhui; Gibbs, J. Raphael; Markham, Androo; Nalls, Mike A.; Kesharwani, Rupesh K.; Sedlazeck, Fritz J.; Casey, Bradford; Hardy, John; Houlden, Henry; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.A biallelic (AAGGG) expansion in the poly(A) tail of an AluSx3 transposable element within the gene RFC1 is a frequent cause of cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS), and more recently, has been reported as a rare cause of Parkinson’s disease (PD) in the Finnish population. Here, we investigate the prevalence of RFC1 (AAGGG) expansions in PD patients of non-Finnish European ancestry in 1609 individuals from the Parkinson’s Progression Markers Initiative study. We identified four PD patients carrying the biallelic RFC1 (AAGGG) expansion and did not identify any carriers in controls.Item A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study(JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, IlaBackground: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.Item MethPhaser: methylation-based long-read haplotype phasing of human genomes(Springer Nature, 2024) Fu, Yilei; Aganezov, Sergey; Mahmoud, Medhat; Beaulaurier, John; Juul, Sissel; Treangen, Todd J.; Sedlazeck, Fritz J.The assignment of variants across haplotypes, phasing, is crucial for predicting the consequences, interaction, and inheritance of mutations and is a key step in improving our understanding of phenotype and disease. However, phasing is limited by read length and stretches of homozygosity along the genome. To overcome this limitation, we designed MethPhaser, a method that utilizes methylation signals from Oxford Nanopore Technologies to extend Single Nucleotide Variation (SNV)-based phasing. We demonstrate that haplotype-specific methylations extensively exist in Human genomes and the advent of long-read technologies enabled direct report of methylation signals. For ONT R9 and R10 cell line data, we increase the phase length N50 by 78%-151% at a phasing accuracy of 83.4-98.7% To assess the impact of tissue purity and random methylation signals due to inactivation, we also applied MethPhaser on blood samples from 4 patients, still showing improvements over SNV-only phasing. MethPhaser further improves phasing across HLA and multiple other medically relevant genes, improving our understanding of how mutations interact across multiple phenotypes. The concept of MethPhaser can also be extended to non-human diploid genomes. MethPhaser is available at https://github.com/treangenlab/methphaser.Item Olivar: towards automated variant aware primer design for multiplex tiled amplicon sequencing of pathogens(Springer Nature, 2024) Wang, Michael X.; Lou, Esther G.; Sapoval, Nicolae; Kim, Eddie; Kalvapalle, Prashant; Kille, Bryce; Elworth, R. A. Leo; Liu, Yunxi; Fu, Yilei; Stadler, Lauren B.; Treangen, Todd J.Tiled amplicon sequencing has served as an essential tool for tracking the spread and evolution of pathogens. Over 15 million complete SARS-CoV-2 genomes are now publicly available, most sequenced and assembled via tiled amplicon sequencing. While computational tools for tiled amplicon design exist, they require downstream manual optimization both computationally and experimentally, which is slow and costly. Here we present Olivar, a first step towards a fully automated, variant-aware design of tiled amplicons for pathogen genomes. Olivar converts each nucleotide of the target genome into a numeric risk score, capturing undesired sequence features that should be avoided. In a direct comparison with PrimalScheme, we show that Olivar has fewer mismatches overlapping with primers and predicted PCR byproducts. We also compare Olivar head-to-head with ARTIC v4.1, the most widely used primer set for SARS-CoV-2 sequencing, and show Olivar yields similar read mapping rates (~90%) and better coverage to the manually designed ARTIC v4.1 amplicons. We also evaluate Olivar on real wastewater samples and found that Olivar has up to 3-fold higher mapping rates while retaining similar coverage. In summary, Olivar automates and accelerates the generation of tiled amplicons, even in situations of high mutation frequency and/or density. Olivar is available online as a web application at https://olivar.rice.edu and can be installed locally as a command line tool with Bioconda. Source code, installation guide, and usage are available at https://github.com/treangenlab/Olivar.Item Detection of diffusely abnormal white matter in multiple sclerosis on multiparametric brain MRI using semi-supervised deep learning(Springer Nature, 2024) Musall, Benjamin C.; Gabr, Refaat E.; Yang, Yanyu; Kamali, Arash; Lincoln, John A.; Jacobs, Michael A.; Ly, Vi; Luo, Xi; Wolinsky, Jerry S.; Narayana, Ponnada A.; Hasan, Khader M.In addition to focal lesions, diffusely abnormal white matter (DAWM) is seen on brain MRI of multiple sclerosis (MS) patients and may represent early or distinct disease processes. The role of MRI-observed DAWM is understudied due to a lack of automated assessment methods. Supervised deep learning (DL) methods are highly capable in this domain, but require large sets of labeled data. To overcome this challenge, a DL-based network (DAWM-Net) was trained using semi-supervised learning on a limited set of labeled data for segmentation of DAWM, focal lesions, and normal-appearing brain tissues on multiparametric MRI. DAWM-Net segmentation performance was compared to a previous intensity thresholding-based method on an independent test set from expert consensus (N = 25). Segmentation overlap by Dice Similarity Coefficient (DSC) and Spearman correlation of DAWM volumes were assessed. DAWM-Net showed DSC > 0.93 for normal-appearing brain tissues and DSC > 0.81 for focal lesions. For DAWM-Net, the DAWM DSC was 0.49 ± 0.12 with a moderate volume correlation (ρ = 0.52, p < 0.01). The previous method showed lower DAWM DSC of 0.26 ± 0.08 and lacked a significant volume correlation (ρ = 0.23, p = 0.27). These results demonstrate the feasibility of DL-based DAWM auto-segmentation with semi-supervised learning. This tool may facilitate future investigation of the role of DAWM in MS.Item Characterizing a complex CT-rich haplotype in intron 4 of SNCA using large-scale targeted amplicon long-read sequencing(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Grenn, Francis P.; Malik, Laksh; Miano-Burkhardt, Abigail; Makarious, Mary B.; Ding, Jinhui; Gibbs, J. Raphael; Moore, Anni; Reed, Xylena; Nalls, Mike A.; Shah, Syed; Mahmoud, Medhat; Sedlazeck, Fritz J.; Dolzhenko, Egor; Park, Morgan; Iwaki, Hirotaka; Casey, Bradford; Ryten, Mina; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.Parkinson’s disease (PD) is a common neurodegenerative disorder with a significant risk proportion driven by genetics. While much progress has been made, most of the heritability remains unknown. This is in-part because previous genetic studies have focused on the contribution of single nucleotide variants. More complex forms of variation, such as structural variants and tandem repeats, are already associated with several synucleinopathies. However, because more sophisticated sequencing methods are usually required to detect these regions, little is understood regarding their contribution to PD. One example is a polymorphic CT-rich region in intron 4 of the SNCA gene. This haplotype has been suggested to be associated with risk of Lewy Body (LB) pathology in Alzheimer’s Disease and SNCA gene expression, but is yet to be investigated in PD. Here, we attempt to resolve this CT-rich haplotype and investigate its role in PD. We performed targeted PacBio HiFi sequencing of the region in 1375 PD cases and 959 controls. We replicate the previously reported associations and a novel association between two PD risk SNVs (rs356182 and rs5019538) and haplotype 4, the largest haplotype. Through quantitative trait locus analyzes we identify a significant haplotype 4 association with alternative CAGE transcriptional start site usage, not leading to significant differential SNCA gene expression in post-mortem frontal cortex brain tissue. Therefore, disease association in this locus might not be biologically driven by this CT-rich repeat region. Our data demonstrates the complexity of this SNCA region and highlights that further follow up functional studies are warranted.Item Inverted triplications formed by iterative template switches generate structural variant diversity at genomic disorder loci(Elsevier, 2024) Grochowski, Christopher M.; Bengtsson, Jesse D.; Du, Haowei; Gandhi, Mira; Lun, Ming Yin; Mehaffey, Michele G.; Park, KyungHee; Höps, Wolfram; Benito, Eva; Hasenfeld, Patrick; Korbel, Jan O.; Mahmoud, Medhat; Paulin, Luis F.; Jhangiani, Shalini N.; Hwang, James Paul; Bhamidipati, Sravya V.; Muzny, Donna M.; Fatih, Jawid M.; Gibbs, Richard A.; Pendleton, Matthew; Harrington, Eoghan; Juul, Sissel; Lindstrand, Anna; Sedlazeck, Fritz J.; Pehlivan, Davut; Lupski, James R.; Carvalho, Claudia M. B.The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a complex genomic rearrangement (CGR). Although it has been identified as an important pathogenic DNA mutation signature in genomic disorders and cancer genomes, its architecture remains unresolved. Here, we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the DNA of 24 patients identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted structural variant (SV) haplotypes. Using a combination of short-read genome sequencing (GS), long-read GS, optical genome mapping, and single-cell DNA template strand sequencing (strand-seq), the haplotype structure was resolved in 18 samples. The point of template switching in 4 samples was shown to be a segment of ∼2.2–5.5 kb of 100% nucleotide similarity within inverted repeat pairs. These data provide experimental evidence that inverted low-copy repeats act as recombinant substrates. This type of CGR can result in multiple conformers generating diverse SV haplotypes in susceptible dosage-sensitive loci.Item Crykey: Rapid identification of SARS-CoV-2 cryptic mutations in wastewater(Springer Nature, 2024) Liu, Yunxi; Sapoval, Nicolae; Gallego-García, Pilar; Tomás, Laura; Posada, David; Treangen, Todd J.; Stadler, Lauren B.Wastewater surveillance for SARS-CoV-2 provides early warnings of emerging variants of concerns and can be used to screen for novel cryptic linked-read mutations, which are co-occurring single nucleotide mutations that are rare, or entirely missing, in existing SARS-CoV-2 databases. While previous approaches have focused on specific regions of the SARS-CoV-2 genome, there is a need for computational tools capable of efficiently tracking cryptic mutations across the entire genome and investigating their potential origin. We present Crykey, a tool for rapidly identifying rare linked-read mutations across the genome of SARS-CoV-2. We evaluated the utility of Crykey on over 3,000 wastewater and over 22,000 clinical samples; our findings are three-fold: i) we identify hundreds of cryptic mutations that cover the entire SARS-CoV-2 genome, ii) we track the presence of these cryptic mutations across multiple wastewater treatment plants and over three years of sampling in Houston, and iii) we find a handful of cryptic mutations in wastewater mirror cryptic mutations in clinical samples and investigate their potential to represent real cryptic lineages. In summary, Crykey enables large-scale detection of cryptic mutations in wastewater that represent potential circulating cryptic lineages, serving as a new computational tool for wastewater surveillance of SARS-CoV-2.Item Depletion of lamins B1 and B2 promotes chromatin mobility and induces differential gene expression by a mesoscale-motion-dependent mechanism(Springer Nature, 2024) Pujadas Liwag, Emily M.; Wei, Xiaolong; Acosta, Nicolas; Carter, Lucas M.; Yang, Jiekun; Almassalha, Luay M.; Jain, Surbhi; Daneshkhah, Ali; Rao, Suhas S. P.; Seker-Polat, Fidan; MacQuarrie, Kyle L.; Ibarra, Joe; Agrawal, Vasundhara; Aiden, Erez Lieberman; Kanemaki, Masato T.; Backman, Vadim; Adli, Mazhar; Center for Theoretical Biological PhysicsB-type lamins are critical nuclear envelope proteins that interact with the three-dimensional genomic architecture. However, identifying the direct roles of B-lamins on dynamic genome organization has been challenging as their joint depletion severely impacts cell viability. To overcome this, we engineered mammalian cells to rapidly and completely degrade endogenous B-type lamins using Auxin-inducible degron technology.Item Differentially Private Medians and Interior Points for Non-Pathological Data(Schloss Dagstuhl - Leibniz Center for Informatics, 2024) Aliakbarpour, Maryam; Silver, Rose; Steinke, Thomas; Ullman, JonathanWe construct sample-efficient differentially private estimators for the approximate-median and interior-point problems, that can be applied to arbitrary input distributions over ℝ satisfying very mild statistical assumptions. Our results stand in contrast to the surprising negative result of Bun et al. (FOCS 2015), which showed that private estimators with finite sample complexity cannot produce interior points on arbitrary distributions.Item Automatic Active Lesion Tracking in Multiple Sclerosis Using Unsupervised Machine Learning(MDPI, 2024) Uwaeze, Jason; Narayana, Ponnada A.; Kamali, Arash; Braverman, Vladimir; Jacobs, Michael A.; Akhbardeh, AlirezaBackground: Identifying active lesions in magnetic resonance imaging (MRI) is crucial for the diagnosis and treatment planning of multiple sclerosis (MS). Active lesions on MRI are identified following the administration of Gadolinium-based contrast agents (GBCAs). However, recent studies have reported that repeated administration of GBCA results in the accumulation of Gd in tissues. In addition, GBCA administration increases health care costs. Thus, reducing or eliminating GBCA administration for active lesion detection is important for improved patient safety and reduced healthcare costs. Current state-of-the-art methods for identifying active lesions in brain MRI without GBCA administration utilize data-intensive deep learning methods. Objective: To implement nonlinear dimensionality reduction (NLDR) methods, locally linear embedding (LLE) and isometric feature mapping (Isomap), which are less data-intensive, for automatically identifying active lesions on brain MRI in MS patients, without the administration of contrast agents. Materials and Methods: Fluid-attenuated inversion recovery (FLAIR), T2-weighted, proton density-weighted, and pre- and post-contrast T1-weighted images were included in the multiparametric MRI dataset used in this study. Subtracted pre- and post-contrast T1-weighted images were labeled by experts as active lesions (ground truth). Unsupervised methods, LLE and Isomap, were used to reconstruct multiparametric brain MR images into a single embedded image. Active lesions were identified on the embedded images and compared with ground truth lesions. The performance of NLDR methods was evaluated by calculating the Dice similarity (DS) index between the observed and identified active lesions in embedded images. Results: LLE and Isomap, were applied to 40 MS patients, achieving median DS scores of 0.74 ± 0.1 and 0.78 ± 0.09, respectively, outperforming current state-of-the-art methods. Conclusions: NLDR methods, Isomap and LLE, are viable options for the identification of active MS lesions on non-contrast images, and potentially could be used as a clinical decision tool.Item Exploring the Relation between Contextual Social Determinants of Health and COVID-19 Occurrence and Hospitalization(MDPI, 2024) Chen, Aokun; Zhao, Yunpeng; Zheng, Yi; Hu, Hui; Hu, Xia; Fishe, Jennifer N.; Hogan, William R.; Shenkman, Elizabeth A.; Guo, Yi; Bian, JiangIt is prudent to take a unified approach to exploring how contextual social determinants of health (SDoH) relate to COVID-19 occurrence and outcomes. Poor geographically represented data and a small number of contextual SDoH examined in most previous research studies have left a knowledge gap in the relationships between contextual SDoH and COVID-19 outcomes. In this study, we linked 199 contextual SDoH factors covering 11 domains of social and built environments with electronic health records (EHRs) from a large clinical research network (CRN) in the National Patient-Centered Clinical Research Network (PCORnet) to explore the relation between contextual SDoH and COVID-19 occurrence and hospitalization. We identified 15,890 COVID-19 patients and 63,560 matched non-COVID-19 patients in Florida between January 2020 and May 2021. We adopted a two-phase multiple linear regression approach modified from that in the exposome-wide association (ExWAS) study. After removing the highly correlated SDoH variables, 86 contextual SDoH variables were included in the data analysis. Adjusting for race, ethnicity, and comorbidities, we found six contextual SDoH variables (i.e., hospital available beds and utilization, percent of vacant property, number of golf courses, and percent of minority) related to the occurrence of COVID-19, and three variables (i.e., farmers market, low access, and religion) related to the hospitalization of COVID-19. To our best knowledge, this is the first study to explore the relationship between contextual SDoH and COVID-19 occurrence and hospitalization using EHRs in a major PCORnet CRN. As an exploratory study, the causal effect of SDoH on COVID-19 outcomes will be evaluated in future studies.Item A scientific machine learning framework to understand flash graphene synthesis(Royal Society of Chemistry, 2023) Sattari, Kianoosh; Eddy, Lucas; Beckham, Jacob L.; Wyss, Kevin M.; Byfield, Richard; Qian, Long; Tour, James M.; Lin, Jian; NanoCarbon Center; Welch Institute for Advanced MaterialsFlash Joule heating (FJH) is a far-from-equilibrium (FFE) processing method for converting low-value carbon-based materials to flash graphene (FG). Despite its promises in scalability and performance, attempts to explore the reaction mechanism have been limited due to the complexities involved in the FFE process. Data-driven machine learning (ML) models effectively account for the complexities, but the model training requires a considerable amount of experimental data. To tackle this challenge, we constructed a scientific ML (SML) framework trained by using both direct processing variables and indirect, physics-informed variables to predict the FG yield. The indirect variables include current-derived features (final current, maximum current, and charge density) predicted from the proxy ML models and reaction temperatures simulated from multi-physics modeling. With the combined indirect features, the final ML model achieves an average R2 score of 0.81 ± 0.05 and an average RMSE of 12.1% ± 2.0% in predicting the FG yield, which is significantly higher than the model trained without them (R2 of 0.73 ± 0.05 and an RMSE of 14.3% ± 2.0%). Feature importance analysis validates the key roles of these indirect features in determining the reaction outcome. These results illustrate the promise of this SML to elucidate FFE material synthesis outcomes, thus paving a new avenue to processing other datasets from the materials systems involving the same or different FFE processes.Item Joint embedding of biological networks for cross-species functional alignment(Oxford University Press, 2023) Li, Lechuan; Dannenfelser, Ruth; Zhu, Yu; Hejduk, Nathaniel; Segarra, Santiago; Yao, VickyModel organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein–protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.https://github.com/ylaboratory/ETNAItem Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation(Oxford University Press, 2023) Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam MThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.MashMap3 is available at https://github.com/marbl/MashMap.