Browse
Recent Submissions
Now showing 1 - 20 of 195
Item Distilling the knowledge from large-language model for health event prediction(Springer Nature, 2024) Ding, Sirui; Ye, Jiancheng; Hu, Xia; Zou, NaHealth event prediction is empowered by the rapid and wide application of electronic health records (EHR). In the Intensive Care Unit (ICU), precisely predicting the health related events in advance is essential for providing treatment and intervention to improve the patients outcomes. EHR is a kind of multi-modal data containing clinical text, time series, structured data, etc. Most health event prediction works focus on a single modality, e.g., text or tabular EHR. How to effectively learn from the multi-modal EHR for health event prediction remains a challenge. Inspired by the strong capability in text processing of large language model (LLM), we propose the framework CKLE for health event prediction by distilling the knowledge from LLM and learning from multi-modal EHR. There are two challenges of applying LLM in the health event prediction, the first one is most LLM can only handle text data rather than other modalities, e.g., structured data. The second challenge is the privacy issue of health applications requires the LLM to be locally deployed, which may be limited by the computational resource. CKLE solves the challenges of LLM scalability and portability in the healthcare domain by distilling the cross-modality knowledge from LLM into the health event predictive model. To fully take advantage of the strong power of LLM, the raw clinical text is refined and augmented with prompt learning. The embedding of clinical text are generated by LLM. To effectively distill the knowledge of LLM into the predictive model, we design a cross-modality knowledge distillation (KD) method. A specially designed training objective will be used for the KD process with the consideration of multiple modality and patient similarity. The KD loss function consists of two parts. The first one is cross-modality contrastive loss function, which models the correlation of different modalities from the same patient. The second one is patient similarity learning loss function to model the correlations between similar patients. The cross-modality knowledge distillation can distill the rich information in clinical text and the knowledge of LLM into the predictive model on structured EHR data. To demonstrate the effectiveness of CKLE, we evaluate CKLE on two health event prediction tasks in the field of cardiology, heart failure prediction and hypertension prediction. We select the 7125 patients from MIMIC-III dataset and split them into train/validation/test sets. We can achieve a maximum 4.48% improvement in accuracy compared to state-of-the-art predictive model designed for health event prediction. The results demonstrate CKLE can surpass the baseline prediction models significantly on both normal and limited label settings. We also conduct the case study on cardiology disease analysis in the heart failure and hypertension prediction. Through the feature importance calculation, we analyse the salient features related to the cardiology disease which corresponds to the medical domain knowledge. The superior performance and interpretability of CKLE pave a promising way to leverage the power and knowledge of LLM in the health event prediction in real-world clinical settings.Item MoleQCage: Geometric High-Throughput Screening for Molecular Caging Prediction(American Chemical Society, 2024) Kravberg, Alexander; Devaurs, Didier; Varava, Anastasiia; Kavraki, Lydia E.; Kragic, DanicaAlthough being able to determine whether a host molecule can enclose a guest molecule and form a caging complex could benefit numerous chemical and medical applications, the experimental discovery of molecular caging complexes has not yet been achieved at scale. Here, we propose MoleQCage, a simple tool for the high-throughput screening of host and guest candidates based on an efficient robotics-inspired geometric algorithm for molecular caging prediction, providing theoretical guarantees and robustness assessment. MoleQCage is distributed as Linux-based software with a graphical user interface and is available online at https://hub.docker.com/r/dantrigne/moleqcage in the form of a Docker container. Documentation and examples are available as Supporting Information and online at https://hub.docker.com/r/dantrigne/moleqcage.Item Laser-induced high-entropy alloys as long-duration bifunctional electrocatalysts for seawater splitting(Royal Society of Chemistry, 2024) Xie, Yunchao; Xu, Shichen; Meng, Andrew C.; Zheng, Bujingda; Chen, Zhenru; Tour, James M.; Lin, Jian; NanoCarbon Center;Rice Advanced Materials InstituteElectrocatalytic seawater splitting has garnered significant attention as a promising approach for eco-friendly, large-scale green hydrogen production. Development of high-efficiency and cost-effective electrocatalysts remains a frontier in this field. Herein, we report a rapid in situ synthesis of FeNiCoCrRu high-entropy alloy nanoparticles (HEA NPs) by direct CO2 laser induction of metal precursors on carbon paper under ambient conditions. Due to the induced ultrahigh temperature and ultrafast heating/quenching rates, FeNiCoCrRu HEA NPs with sizes ranging from 5 to 40 nm possess uniform phase homogeneity. FeNiCoCrRu HEA NPs exhibit exceptional bifunctional electrocatalytic activities, delivering overpotentials of 0.148 V at 600 mA cm−2 for the hydrogen evolution reaction and 0.353 V at 300 mA cm−2 for the oxygen evolution reaction in alkaline seawater. When assembled FeNiCoCrRu HEA NPs to an electrolyzer, it shows a negligible voltage increase at 250 mA cm−2 even after over 3000-hour operation. This superior performance can be attributed to the high-entropy design, large electrochemical specific area, and excellent chemical and structural stability. An operando Raman spectroscopy study discloses that the Ni and Ru sites serve as active sites for hydrogen evolution, while the Ni site acts as an active site for oxygen evolution. This work demonstrates a laser-induced eco-friendly nanomaterial synthesis. The systematic studies offer an in-depth understanding of HEA design and its correlation with high-efficiency seawater splitting.Item High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation(Cold Spring Harbor Laboratory Press, 2024) Gustafson, Jonas A.; Gibson, Sophia B.; Damaraju, Nikhita; Zalusky, Miranda P. G.; Hoekzema, Kendra; Twesigomwe, David; Yang, Lei; Snead, Anthony A.; Richmond, Phillip A.; Coster, Wouter De; Olson, Nathan D.; Guarracino, Andrea; Li, Qiuhui; Miller, Angela L.; Goffena, Joy; Anderson, Zachary B.; Storz, Sophie H. R.; Ward, Sydney A.; Sinha, Maisha; Gonzaga-Jauregui, Claudia; Clarke, Wayne E.; Basile, Anna O.; Corvelo, André; Reeves, Catherine; Helland, Adrienne; Musunuri, Rajeeva Lochan; Revsine, Mahler; Patterson, Karynne E.; Paschal, Cate R.; Zakarian, Christina; Goodwin, Sara; Jensen, Tanner D.; Robb, Esther; Consortium, The 1000 Genomes ONT Sequencing; Research (UW-CRDR), University of Washington Center for Rare Disease; Consortium, Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR); McCombie, William Richard; Sedlazeck, Fritz J.; Zook, Justin M.; Montgomery, Stephen B.; Garrison, Erik; Kolmogorov, Mikhail; Schatz, Michael C.; McLaughlin, Richard N.; Dashnow, Harriet; Zody, Michael C.; Loose, Matt; Jain, Miten; Eichler, Evan E.; Miller, Danny E.Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.Item Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps(Cold Spring Harbor Laboratory Press, 2024) Saether, Kristine Bilgrav; Eisfeldt, Jesper; Bengtsson, Jesse D.; Lun, Ming Yin; Grochowski, Christopher M.; Mahmoud, Medhat; Chao, Hsiao-Tuan; Rosenfeld, Jill A.; Liu, Pengfei; Ek, Marlene; Schuy, Jakob; Ameur, Adam; Dai, Hongzheng; Network, Undiagnosed Diseases; Hwang, James Paul; Sedlazeck, Fritz J.; Bi, Weimin; Marom, Ronit; Wincent, Josephine; Nordgren, Ann; Carvalho, Claudia M. B.; Lindstrand, AnnaChromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in cis. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (n = 9) or srGS (n = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts EHMT1 consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.Item The GIAB genomic stratifications resource for human reference genomes(Springer Nature, 2024) Dwarshuis, Nathan; Kalra, Divya; McDaniel, Jennifer; Sanio, Philippe; Alvarez Jerez, Pilar; Jadhav, Bharati; Huang, Wenyu (Eddy); Mondal, Rajarshi; Busby, Ben; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Majidian, Sina; Zook, Justin M.Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.Item Single-cell somatic copy number variants in brain using different amplification methods and reference genomes(Springer Nature, 2024) Kalef-Ezra, Ester; Turan, Zeliha Gozde; Perez-Rodriguez, Diego; Bomann, Ida; Behera, Sairam; Morley, Caoimhe; Scholz, Sonja W.; Jaunmuktane, Zane; Demeulemeester, Jonas; Sedlazeck, Fritz J.; Proukakis, ChristosThe presence of somatic mutations, including copy number variants (CNVs), in the brain is well recognized. Comprehensive study requires single-cell whole genome amplification, with several methods available, prior to sequencing. Here we compare PicoPLEX with two recent adaptations of multiple displacement amplification (MDA): primary template-directed amplification (PTA) and droplet MDA, across 93 human brain cortical nuclei. We demonstrate different properties for each, with PTA providing the broadest amplification, PicoPLEX the most even, and distinct chimeric profiles. Furthermore, we perform CNV calling on two brains with multiple system atrophy and one control brain using different reference genomes. We find that 20.6% of brain cells have at least one Mb-scale CNV, with some supported by bulk sequencing or single-cells from other brain regions. Our study highlights the importance of selecting whole genome amplification method and reference genome for CNV calling, while supporting the existence of somatic CNVs in healthy and diseased human brain.Item StratoMod: predicting sequencing and variant calling errors with interpretable machine learning(Springer Nature, 2024) Dwarshuis, Nathan; Tonner, Peter; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Zook, Justin M.Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.Item A best-match approach for gene set analyses in embedding spaces(Cold Spring Harbor Laboratory Press, 2024) Li, Lechuan; Dannenfelser, Ruth; Cruz, Charlie; Yao, VickyEmbedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein–protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.Item Association of education attainment, smoking status, and alcohol use disorder with dementia risk in older adults: a longitudinal observational study(Springer Nature, 2024) Tang, Huilin; Shaaban, C. Elizabeth; DeKosky, Steven T.; Smith, Glenn E.; Hu, Xia; Jaffee, Michael; Salloum, Ramzi G.; Bian, Jiang; Guo, JingchuanPrevious research on the risk of dementia associated with education attainment, smoking status, and alcohol use disorder (AUD) has yielded inconsistent results, indicating potential heterogeneous treatment effects (HTEs) of these factors on dementia risk. Thus, this study aimed to identify the important variables that may contribute to HTEs of these factors in older adults.Item Sampling-Based Motion Planning: A Comparative Review(Annual Reviews, 2024) Orthey, Andreas; Chamzas, Constantinos; Kavraki, Lydia E.Sampling-based motion planning is one of the fundamental paradigms to generate robot motions, and a cornerstone of robotics research. This comparative review provides an up-to-date guide and reference manual for the use of sampling-based motion planning algorithms. It includes a history of motion planning, an overview of the most successful planners, and a discussion of their properties. It also shows how planners can handle special cases and how extensions of motion planning can be accommodated. To put sampling-based motion planning into a larger context, a discussion of alternative motion generation frameworks highlights their respective differences from sampling-based motion planning. Finally, a set of sampling-based motion planners are compared on 24 challenging planning problems in order to provide insights into which planners perform well in which situations and where future research would be required. This comparative review thereby provides not only a useful reference manual for researchers in the field but also a guide for practitioners to make informed algorithmic decisions.Item Singly exponential translation of alternating weak Büchi automata to unambiguous Büchi automata(Elsevier, 2024) Li, Yong; Schewe, Sven; Vardi, Moshe Y.We introduce a method for translating an alternating weak Büchi automaton (AWA), which corresponds to a Linear Dynamic Logic (LDL) formula, to an unambiguous Büchi automaton (UBA). Our translations generalize constructions for Linear Temporal Logic (LTL), a less expressive specification language than LDL. In classical constructions, LTL formulas are first translated to alternating very weak Büchi automata (AVAs)—automata that have only singleton strongly connected components (SCCs); these AVAs are then handled by efficient disambiguation procedures. However, general AWAs can have larger SCCs, which complicates disambiguation. Currently, the only available disambiguation procedure has to go through an intermediate construction of nondeterministic Büchi automata (NBAs), which would incur an exponential blow-up of its own. We introduce a translation from general AWAs to UBAs with a singly exponential blow-up, which also immediately provides a singly exponential translation from LDL to UBAs. Interestingly, the complexity of our translation is smaller than the best known disambiguation algorithm for NBAs (broadly (0.53n)n vs. (0.76n)n), while the input of our construction can be exponentially more succinct.Item Machine Learning to Enhance Electronic Detection of Diagnostic Errors(American Medical Association, 2024) Zimolzak, Andrew J.; Wei, Li; Mir, Usman; Gupta, Ashish; Vaghani, Viralkumar; Subramanian, Devika; Singh, HardeepItem Impact and characterization of serial structural variations across humans and great apes(Springer Nature, 2024) Höps, Wolfram; Rausch, Tobias; Jendrusch, Michael; Korbel, Jan O.; Sedlazeck, Fritz J.Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals (https://github.com/WHops/NAHRwhals), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.Item Polyphest: fast polyploid phylogeny estimation(Oxford University Press, 2024) Yan, Zhi; Cao, Zhen; Nakhleh, LuayDespite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies.We introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes.Polyphest is available at https://github.com/NakhlehLab/Polyphest.Item Reference-free structural variant detection in microbiomes via long-read co-assembly graphs(Oxford University Press, 2024) Curry, Kristen D; Yu, Feiqiao Brian; Vance, Summer E; Segarra, Santiago; Bhaya, Devaki; Chikhi, Rayan; Rocha, Eduardo P C; Treangen, Todd JMotivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining.Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux.Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.Item CrysFormer: Protein structure determination via Patterson maps, deep learning, and partial structure attention(AIP Publishing LLC, 2024) Pan, Tom; Dun, Chen; Jin, Shikai; Miller, Mitchell D.; Kyrillidis, Anastasios; Phillips, George N., Jr.Determining the atomic-level structure of a protein has been a decades-long challenge. However, recent advances in transformers and related neural network architectures have enabled researchers to significantly improve solutions to this problem. These methods use large datasets of sequence information and corresponding known protein template structures, if available. Yet, such methods only focus on sequence information. Other available prior knowledge could also be utilized, such as constructs derived from x-ray crystallography experiments and the known structures of the most common conformations of amino acid residues, which we refer to as partial structures. To the best of our knowledge, we propose the first transformer-based model that directly utilizes experimental protein crystallographic data and partial structure information to calculate electron density maps of proteins. In particular, we use Patterson maps, which can be directly obtained from x-ray crystallography experimental data, thus bypassing the well-known crystallographic phase problem. We demonstrate that our method, CrysFormer, achieves precise predictions on two synthetic datasets of peptide fragments in crystalline forms, one with two residues per unit cell and the other with fifteen. These predictions can then be used to generate accurate atomic models using established crystallographic refinement programs.Item Profiling complex repeat expansions in RFC1 in Parkinson’s disease(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Miano-Burkhardt, Abigail; Iwaki, Hirotaka; Malik, Laksh; Cogan, Guillaume; Makarious, Mary B.; Sullivan, Roisin; Vandrovcova, Jana; Ding, Jinhui; Gibbs, J. Raphael; Markham, Androo; Nalls, Mike A.; Kesharwani, Rupesh K.; Sedlazeck, Fritz J.; Casey, Bradford; Hardy, John; Houlden, Henry; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.A biallelic (AAGGG) expansion in the poly(A) tail of an AluSx3 transposable element within the gene RFC1 is a frequent cause of cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS), and more recently, has been reported as a rare cause of Parkinson’s disease (PD) in the Finnish population. Here, we investigate the prevalence of RFC1 (AAGGG) expansions in PD patients of non-Finnish European ancestry in 1609 individuals from the Parkinson’s Progression Markers Initiative study. We identified four PD patients carrying the biallelic RFC1 (AAGGG) expansion and did not identify any carriers in controls.Item A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study(JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, IlaBackground: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.Item MethPhaser: methylation-based long-read haplotype phasing of human genomes(Springer Nature, 2024) Fu, Yilei; Aganezov, Sergey; Mahmoud, Medhat; Beaulaurier, John; Juul, Sissel; Treangen, Todd J.; Sedlazeck, Fritz J.The assignment of variants across haplotypes, phasing, is crucial for predicting the consequences, interaction, and inheritance of mutations and is a key step in improving our understanding of phenotype and disease. However, phasing is limited by read length and stretches of homozygosity along the genome. To overcome this limitation, we designed MethPhaser, a method that utilizes methylation signals from Oxford Nanopore Technologies to extend Single Nucleotide Variation (SNV)-based phasing. We demonstrate that haplotype-specific methylations extensively exist in Human genomes and the advent of long-read technologies enabled direct report of methylation signals. For ONT R9 and R10 cell line data, we increase the phase length N50 by 78%-151% at a phasing accuracy of 83.4-98.7% To assess the impact of tissue purity and random methylation signals due to inactivation, we also applied MethPhaser on blood samples from 4 patients, still showing improvements over SNV-only phasing. MethPhaser further improves phasing across HLA and multiple other medically relevant genes, improving our understanding of how mutations interact across multiple phenotypes. The concept of MethPhaser can also be extended to non-human diploid genomes. MethPhaser is available at https://github.com/treangenlab/methphaser.