Computer Science Publications

Permanent URI for this collection

https://hdl.handle.net/1911/37143

Browse

Now showing 1 - 20 of 148

Distilling the knowledge from large-language model for health event prediction
(Springer Nature, 2024) Ding, Sirui; Ye, Jiancheng; Hu, Xia; Zou, Na
Health event prediction is empowered by the rapid and wide application of electronic health records (EHR). In the Intensive Care Unit (ICU), precisely predicting the health related events in advance is essential for providing treatment and intervention to improve the patients outcomes. EHR is a kind of multi-modal data containing clinical text, time series, structured data, etc. Most health event prediction works focus on a single modality, e.g., text or tabular EHR. How to effectively learn from the multi-modal EHR for health event prediction remains a challenge. Inspired by the strong capability in text processing of large language model (LLM), we propose the framework CKLE for health event prediction by distilling the knowledge from LLM and learning from multi-modal EHR. There are two challenges of applying LLM in the health event prediction, the first one is most LLM can only handle text data rather than other modalities, e.g., structured data. The second challenge is the privacy issue of health applications requires the LLM to be locally deployed, which may be limited by the computational resource. CKLE solves the challenges of LLM scalability and portability in the healthcare domain by distilling the cross-modality knowledge from LLM into the health event predictive model. To fully take advantage of the strong power of LLM, the raw clinical text is refined and augmented with prompt learning. The embedding of clinical text are generated by LLM. To effectively distill the knowledge of LLM into the predictive model, we design a cross-modality knowledge distillation (KD) method. A specially designed training objective will be used for the KD process with the consideration of multiple modality and patient similarity. The KD loss function consists of two parts. The first one is cross-modality contrastive loss function, which models the correlation of different modalities from the same patient. The second one is patient similarity learning loss function to model the correlations between similar patients. The cross-modality knowledge distillation can distill the rich information in clinical text and the knowledge of LLM into the predictive model on structured EHR data. To demonstrate the effectiveness of CKLE, we evaluate CKLE on two health event prediction tasks in the field of cardiology, heart failure prediction and hypertension prediction. We select the 7125 patients from MIMIC-III dataset and split them into train/validation/test sets. We can achieve a maximum 4.48% improvement in accuracy compared to state-of-the-art predictive model designed for health event prediction. The results demonstrate CKLE can surpass the baseline prediction models significantly on both normal and limited label settings. We also conduct the case study on cardiology disease analysis in the heart failure and hypertension prediction. Through the feature importance calculation, we analyse the salient features related to the cardiology disease which corresponds to the medical domain knowledge. The superior performance and interpretability of CKLE pave a promising way to leverage the power and knowledge of LLM in the health event prediction in real-world clinical settings.
MoleQCage: Geometric High-Throughput Screening for Molecular Caging Prediction
(American Chemical Society, 2024) Kravberg, Alexander; Devaurs, Didier; Varava, Anastasiia; Kavraki, Lydia E.; Kragic, Danica
Although being able to determine whether a host molecule can enclose a guest molecule and form a caging complex could benefit numerous chemical and medical applications, the experimental discovery of molecular caging complexes has not yet been achieved at scale. Here, we propose MoleQCage, a simple tool for the high-throughput screening of host and guest candidates based on an efficient robotics-inspired geometric algorithm for molecular caging prediction, providing theoretical guarantees and robustness assessment. MoleQCage is distributed as Linux-based software with a graphical user interface and is available online at https://hub.docker.com/r/dantrigne/moleqcage in the form of a Docker container. Documentation and examples are available as Supporting Information and online at https://hub.docker.com/r/dantrigne/moleqcage.
High-coverage nanopore sequencing of samples from the 1000 Genomes Project to build a comprehensive catalog of human genetic variation
(Cold Spring Harbor Laboratory Press, 2024) Gustafson, Jonas A.; Gibson, Sophia B.; Damaraju, Nikhita; Zalusky, Miranda P. G.; Hoekzema, Kendra; Twesigomwe, David; Yang, Lei; Snead, Anthony A.; Richmond, Phillip A.; Coster, Wouter De; Olson, Nathan D.; Guarracino, Andrea; Li, Qiuhui; Miller, Angela L.; Goffena, Joy; Anderson, Zachary B.; Storz, Sophie H. R.; Ward, Sydney A.; Sinha, Maisha; Gonzaga-Jauregui, Claudia; Clarke, Wayne E.; Basile, Anna O.; Corvelo, André; Reeves, Catherine; Helland, Adrienne; Musunuri, Rajeeva Lochan; Revsine, Mahler; Patterson, Karynne E.; Paschal, Cate R.; Zakarian, Christina; Goodwin, Sara; Jensen, Tanner D.; Robb, Esther; Consortium, The 1000 Genomes ONT Sequencing; Research (UW-CRDR), University of Washington Center for Rare Disease; Consortium, Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR); McCombie, William Richard; Sedlazeck, Fritz J.; Zook, Justin M.; Montgomery, Stephen B.; Garrison, Erik; Kolmogorov, Mikhail; Schatz, Michael C.; McLaughlin, Richard N.; Dashnow, Harriet; Zody, Michael C.; Loose, Matt; Jain, Miten; Eichler, Evan E.; Miller, Danny E.
Fewer than half of individuals with a suspected Mendelian or monogenic condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control data sets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project (1KGP) Oxford Nanopore Technologies Sequencing Consortium aims to generate LRS data from at least 800 of the 1KGP samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37× and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Leveraging the T2T assembly to resolve rare and pathogenic inversions in reference genome gaps
(Cold Spring Harbor Laboratory Press, 2024) Saether, Kristine Bilgrav; Eisfeldt, Jesper; Bengtsson, Jesse D.; Lun, Ming Yin; Grochowski, Christopher M.; Mahmoud, Medhat; Chao, Hsiao-Tuan; Rosenfeld, Jill A.; Liu, Pengfei; Ek, Marlene; Schuy, Jakob; Ameur, Adam; Dai, Hongzheng; Network, Undiagnosed Diseases; Hwang, James Paul; Sedlazeck, Fritz J.; Bi, Weimin; Marom, Ronit; Wincent, Josephine; Nordgren, Ann; Carvalho, Claudia M. B.; Lindstrand, Anna
Chromosomal inversions (INVs) are particularly challenging to detect due to their copy-number neutral state and association with repetitive regions. Inversions represent about 1/20 of all balanced structural chromosome aberrations and can lead to disease by gene disruption or altering regulatory regions of dosage-sensitive genes in cis. Short-read genome sequencing (srGS) can only resolve ∼70% of cytogenetically visible inversions referred to clinical diagnostic laboratories, likely due to breakpoints in repetitive regions. Here, we study 12 inversions by long-read genome sequencing (lrGS) (n = 9) or srGS (n = 3) and resolve nine of them. In four cases, the inversion breakpoint region was missing from at least one of the human reference genomes (GRCh37, GRCh38, T2T-CHM13) and a reference agnostic analysis was needed. One of these cases, an INV9 mappable only in de novo assembled lrGS data using T2T-CHM13 disrupts EHMT1 consistent with a Mendelian diagnosis (Kleefstra syndrome 1; MIM#610253). Next, by pairwise comparison between T2T-CHM13, GRCh37, and GRCh38, as well as the chimpanzee and bonobo, we show that hundreds of megabases of sequence are missing from at least one human reference, highlighting that primate genomes contribute to genomic diversity. Aligning population genomic data to these regions indicated that these regions are variable between individuals. Our analysis emphasizes that T2T-CHM13 is necessary to maximize the value of lrGS for optimal inversion detection in clinical diagnostics. These results highlight the importance of leveraging diverse and comprehensive reference genomes to resolve unsolved molecular cases in rare diseases.
The GIAB genomic stratifications resource for human reference genomes
(Springer Nature, 2024) Dwarshuis, Nathan; Kalra, Divya; McDaniel, Jennifer; Sanio, Philippe; Alvarez Jerez, Pilar; Jadhav, Bharati; Huang, Wenyu (Eddy); Mondal, Rajarshi; Busby, Ben; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Majidian, Sina; Zook, Justin M.
Despite the growing variety of sequencing and variant-calling tools, no workflow performs equally well across the entire human genome. Understanding context-dependent performance is critical for enabling researchers, clinicians, and developers to make informed tradeoffs when selecting sequencing hardware and software. Here we describe a set of “stratifications,” which are BED files that define distinct contexts throughout the genome. We define these for GRCh37/38 as well as the new T2T-CHM13 reference, adding many new hard-to-sequence regions which are critical for understanding performance as the field progresses. Specifically, we highlight the increase in hard-to-map and GC-rich stratifications in CHM13 relative to the previous references. We then compare the benchmarking performance with each reference and show the performance penalty brought about by these additional difficult regions in CHM13. Additionally, we demonstrate how the stratifications can track context-specific improvements over different platform iterations, using Oxford Nanopore Technologies as an example. The means to generate these stratifications are available as a snakemake pipeline at https://github.com/usnistgov/giab-stratifications. We anticipate this being useful in enabling precise risk-reward calculations when building sequencing pipelines for any of the commonly-used reference genomes.
Single-cell somatic copy number variants in brain using different amplification methods and reference genomes
(Springer Nature, 2024) Kalef-Ezra, Ester; Turan, Zeliha Gozde; Perez-Rodriguez, Diego; Bomann, Ida; Behera, Sairam; Morley, Caoimhe; Scholz, Sonja W.; Jaunmuktane, Zane; Demeulemeester, Jonas; Sedlazeck, Fritz J.; Proukakis, Christos
The presence of somatic mutations, including copy number variants (CNVs), in the brain is well recognized. Comprehensive study requires single-cell whole genome amplification, with several methods available, prior to sequencing. Here we compare PicoPLEX with two recent adaptations of multiple displacement amplification (MDA): primary template-directed amplification (PTA) and droplet MDA, across 93 human brain cortical nuclei. We demonstrate different properties for each, with PTA providing the broadest amplification, PicoPLEX the most even, and distinct chimeric profiles. Furthermore, we perform CNV calling on two brains with multiple system atrophy and one control brain using different reference genomes. We find that 20.6% of brain cells have at least one Mb-scale CNV, with some supported by bulk sequencing or single-cells from other brain regions. Our study highlights the importance of selecting whole genome amplification method and reference genome for CNV calling, while supporting the existence of somatic CNVs in healthy and diseased human brain.
StratoMod: predicting sequencing and variant calling errors with interpretable machine learning
(Springer Nature, 2024) Dwarshuis, Nathan; Tonner, Peter; Olson, Nathan D.; Sedlazeck, Fritz J.; Wagner, Justin; Zook, Justin M.
Despite the variety in sequencing platforms, mappers, and variant callers, no single pipeline is optimal across the entire human genome. Therefore, developers, clinicians, and researchers need to make tradeoffs when designing pipelines for their application. Currently, assessing such tradeoffs relies on intuition about how a certain pipeline will perform in a given genomic context. We present StratoMod, which addresses this problem using an interpretable machine-learning classifier to predict germline variant calling errors in a data-driven manner. We show StratoMod can precisely predict recall using Hifi or Illumina and leverage StratoMod’s interpretability to measure contributions from difficult-to-map and homopolymer regions for each respective outcome. Furthermore, we use Statomod to assess the effect of mismapping on predicted recall using linear vs. graph-based references, and identify the hard-to-map regions where graph-based methods excelled and by how much. For these we utilize our draft benchmark based on the Q100 HG002 assembly, which contains previously-inaccessible difficult regions. Furthermore, StratoMod presents a new method of predicting clinically relevant variants likely to be missed, which is an improvement over current pipelines which only filter variants likely to be false. We anticipate this being useful for performing precise risk-reward analyses when designing variant calling pipelines.
A best-match approach for gene set analyses in embedding spaces
(Cold Spring Harbor Laboratory Press, 2024) Li, Lechuan; Dannenfelser, Ruth; Cruz, Charlie; Yao, Vicky
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein–protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
Association of education attainment, smoking status, and alcohol use disorder with dementia risk in older adults: a longitudinal observational study
(Springer Nature, 2024) Tang, Huilin; Shaaban, C. Elizabeth; DeKosky, Steven T.; Smith, Glenn E.; Hu, Xia; Jaffee, Michael; Salloum, Ramzi G.; Bian, Jiang; Guo, Jingchuan
Previous research on the risk of dementia associated with education attainment, smoking status, and alcohol use disorder (AUD) has yielded inconsistent results, indicating potential heterogeneous treatment effects (HTEs) of these factors on dementia risk. Thus, this study aimed to identify the important variables that may contribute to HTEs of these factors in older adults.
Sampling-Based Motion Planning: A Comparative Review
(Annual Reviews, 2024) Orthey, Andreas; Chamzas, Constantinos; Kavraki, Lydia E.
Sampling-based motion planning is one of the fundamental paradigms to generate robot motions, and a cornerstone of robotics research. This comparative review provides an up-to-date guide and reference manual for the use of sampling-based motion planning algorithms. It includes a history of motion planning, an overview of the most successful planners, and a discussion of their properties. It also shows how planners can handle special cases and how extensions of motion planning can be accommodated. To put sampling-based motion planning into a larger context, a discussion of alternative motion generation frameworks highlights their respective differences from sampling-based motion planning. Finally, a set of sampling-based motion planners are compared on 24 challenging planning problems in order to provide insights into which planners perform well in which situations and where future research would be required. This comparative review thereby provides not only a useful reference manual for researchers in the field but also a guide for practitioners to make informed algorithmic decisions.
Singly exponential translation of alternating weak Büchi automata to unambiguous Büchi automata
(Elsevier, 2024) Li, Yong; Schewe, Sven; Vardi, Moshe Y.
We introduce a method for translating an alternating weak Büchi automaton (AWA), which corresponds to a Linear Dynamic Logic (LDL) formula, to an unambiguous Büchi automaton (UBA). Our translations generalize constructions for Linear Temporal Logic (LTL), a less expressive specification language than LDL. In classical constructions, LTL formulas are first translated to alternating very weak Büchi automata (AVAs)—automata that have only singleton strongly connected components (SCCs); these AVAs are then handled by efficient disambiguation procedures. However, general AWAs can have larger SCCs, which complicates disambiguation. Currently, the only available disambiguation procedure has to go through an intermediate construction of nondeterministic Büchi automata (NBAs), which would incur an exponential blow-up of its own. We introduce a translation from general AWAs to UBAs with a singly exponential blow-up, which also immediately provides a singly exponential translation from LDL to UBAs. Interestingly, the complexity of our translation is smaller than the best known disambiguation algorithm for NBAs (broadly (0.53n)n vs. (0.76n)n), while the input of our construction can be exponentially more succinct.
Machine Learning to Enhance Electronic Detection of Diagnostic Errors
(American Medical Association, 2024) Zimolzak, Andrew J.; Wei, Li; Mir, Usman; Gupta, Ashish; Vaghani, Viralkumar; Subramanian, Devika; Singh, Hardeep
Impact and characterization of serial structural variations across humans and great apes
(Springer Nature, 2024) Höps, Wolfram; Rausch, Tobias; Jendrusch, Michael; Korbel, Jan O.; Sedlazeck, Fritz J.
Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals (https://github.com/WHops/NAHRwhals), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.
Reference-free structural variant detection in microbiomes via long-read co-assembly graphs
(Oxford University Press, 2024) Curry, Kristen D; Yu, Feiqiao Brian; Vance, Summer E; Segarra, Santiago; Bhaya, Devaki; Chikhi, Rayan; Rocha, Eduardo P C; Treangen, Todd J
Motivation: The study of bacterial genome dynamics is vital for understanding the mechanisms underlying microbial adaptation, growth, and their impact on host phenotype. Structural variants (SVs), genomic alterations of 50 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to the absence of clear reference genomes and the presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing all metagenomic samples in a series (time or other metric) into a single co-assembly graph. The log fold change in graph coverage between successive samples is then calculated to call SVs that are thriving or declining.Results: We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, particularly as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between successive time and temperature samples, suggesting host advantage. Our approach leverages previous work in assembly graph structural and coverage patterns to provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial gene flux.Availability and implementation: rhea is open source and available at: https://github.com/treangenlab/rhea.
Profiling complex repeat expansions in RFC1 in Parkinson’s disease
(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Miano-Burkhardt, Abigail; Iwaki, Hirotaka; Malik, Laksh; Cogan, Guillaume; Makarious, Mary B.; Sullivan, Roisin; Vandrovcova, Jana; Ding, Jinhui; Gibbs, J. Raphael; Markham, Androo; Nalls, Mike A.; Kesharwani, Rupesh K.; Sedlazeck, Fritz J.; Casey, Bradford; Hardy, John; Houlden, Henry; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.
A biallelic (AAGGG) expansion in the poly(A) tail of an AluSx3 transposable element within the gene RFC1 is a frequent cause of cerebellar ataxia, neuropathy, vestibular areflexia syndrome (CANVAS), and more recently, has been reported as a rare cause of Parkinson’s disease (PD) in the Finnish population. Here, we investigate the prevalence of RFC1 (AAGGG) expansions in PD patients of non-Finnish European ancestry in 1609 individuals from the Parkinson’s Progression Markers Initiative study. We identified four PD patients carrying the biallelic RFC1 (AAGGG) expansion and did not identify any carriers in controls.
A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study
(JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, Ila
Background: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.
Detection of diffusely abnormal white matter in multiple sclerosis on multiparametric brain MRI using semi-supervised deep learning
(Springer Nature, 2024) Musall, Benjamin C.; Gabr, Refaat E.; Yang, Yanyu; Kamali, Arash; Lincoln, John A.; Jacobs, Michael A.; Ly, Vi; Luo, Xi; Wolinsky, Jerry S.; Narayana, Ponnada A.; Hasan, Khader M.
In addition to focal lesions, diffusely abnormal white matter (DAWM) is seen on brain MRI of multiple sclerosis (MS) patients and may represent early or distinct disease processes. The role of MRI-observed DAWM is understudied due to a lack of automated assessment methods. Supervised deep learning (DL) methods are highly capable in this domain, but require large sets of labeled data. To overcome this challenge, a DL-based network (DAWM-Net) was trained using semi-supervised learning on a limited set of labeled data for segmentation of DAWM, focal lesions, and normal-appearing brain tissues on multiparametric MRI. DAWM-Net segmentation performance was compared to a previous intensity thresholding-based method on an independent test set from expert consensus (N = 25). Segmentation overlap by Dice Similarity Coefficient (DSC) and Spearman correlation of DAWM volumes were assessed. DAWM-Net showed DSC > 0.93 for normal-appearing brain tissues and DSC > 0.81 for focal lesions. For DAWM-Net, the DAWM DSC was 0.49 ± 0.12 with a moderate volume correlation (ρ = 0.52, p < 0.01). The previous method showed lower DAWM DSC of 0.26 ± 0.08 and lacked a significant volume correlation (ρ = 0.23, p = 0.27). These results demonstrate the feasibility of DL-based DAWM auto-segmentation with semi-supervised learning. This tool may facilitate future investigation of the role of DAWM in MS.
Characterizing a complex CT-rich haplotype in intron 4 of SNCA using large-scale targeted amplicon long-read sequencing
(Springer Nature, 2024) Alvarez Jerez, Pilar; Daida, Kensuke; Grenn, Francis P.; Malik, Laksh; Miano-Burkhardt, Abigail; Makarious, Mary B.; Ding, Jinhui; Gibbs, J. Raphael; Moore, Anni; Reed, Xylena; Nalls, Mike A.; Shah, Syed; Mahmoud, Medhat; Sedlazeck, Fritz J.; Dolzhenko, Egor; Park, Morgan; Iwaki, Hirotaka; Casey, Bradford; Ryten, Mina; Blauwendraat, Cornelis; Singleton, Andrew B.; Billingsley, Kimberley J.
Parkinson’s disease (PD) is a common neurodegenerative disorder with a significant risk proportion driven by genetics. While much progress has been made, most of the heritability remains unknown. This is in-part because previous genetic studies have focused on the contribution of single nucleotide variants. More complex forms of variation, such as structural variants and tandem repeats, are already associated with several synucleinopathies. However, because more sophisticated sequencing methods are usually required to detect these regions, little is understood regarding their contribution to PD. One example is a polymorphic CT-rich region in intron 4 of the SNCA gene. This haplotype has been suggested to be associated with risk of Lewy Body (LB) pathology in Alzheimer’s Disease and SNCA gene expression, but is yet to be investigated in PD. Here, we attempt to resolve this CT-rich haplotype and investigate its role in PD. We performed targeted PacBio HiFi sequencing of the region in 1375 PD cases and 959 controls. We replicate the previously reported associations and a novel association between two PD risk SNVs (rs356182 and rs5019538) and haplotype 4, the largest haplotype. Through quantitative trait locus analyzes we identify a significant haplotype 4 association with alternative CAGE transcriptional start site usage, not leading to significant differential SNCA gene expression in post-mortem frontal cortex brain tissue. Therefore, disease association in this locus might not be biologically driven by this CT-rich repeat region. Our data demonstrates the complexity of this SNCA region and highlights that further follow up functional studies are warranted.
Inverted triplications formed by iterative template switches generate structural variant diversity at genomic disorder loci
(Elsevier, 2024) Grochowski, Christopher M.; Bengtsson, Jesse D.; Du, Haowei; Gandhi, Mira; Lun, Ming Yin; Mehaffey, Michele G.; Park, KyungHee; Höps, Wolfram; Benito, Eva; Hasenfeld, Patrick; Korbel, Jan O.; Mahmoud, Medhat; Paulin, Luis F.; Jhangiani, Shalini N.; Hwang, James Paul; Bhamidipati, Sravya V.; Muzny, Donna M.; Fatih, Jawid M.; Gibbs, Richard A.; Pendleton, Matthew; Harrington, Eoghan; Juul, Sissel; Lindstrand, Anna; Sedlazeck, Fritz J.; Pehlivan, Davut; Lupski, James R.; Carvalho, Claudia M. B.
The duplication-triplication/inverted-duplication (DUP-TRP/INV-DUP) structure is a complex genomic rearrangement (CGR). Although it has been identified as an important pathogenic DNA mutation signature in genomic disorders and cancer genomes, its architecture remains unresolved. Here, we studied the genomic architecture of DUP-TRP/INV-DUP by investigating the DNA of 24 patients identified by array comparative genomic hybridization (aCGH) on whom we found evidence for the existence of 4 out of 4 predicted structural variant (SV) haplotypes. Using a combination of short-read genome sequencing (GS), long-read GS, optical genome mapping, and single-cell DNA template strand sequencing (strand-seq), the haplotype structure was resolved in 18 samples. The point of template switching in 4 samples was shown to be a segment of ∼2.2–5.5 kb of 100% nucleotide similarity within inverted repeat pairs. These data provide experimental evidence that inverted low-copy repeats act as recombinant substrates. This type of CGR can result in multiple conformers generating diverse SV haplotypes in susceptible dosage-sensitive loci.
Depletion of lamins B1 and B2 promotes chromatin mobility and induces differential gene expression by a mesoscale-motion-dependent mechanism
(Springer Nature, 2024) Pujadas Liwag, Emily M.; Wei, Xiaolong; Acosta, Nicolas; Carter, Lucas M.; Yang, Jiekun; Almassalha, Luay M.; Jain, Surbhi; Daneshkhah, Ali; Rao, Suhas S. P.; Seker-Polat, Fidan; MacQuarrie, Kyle L.; Ibarra, Joe; Agrawal, Vasundhara; Aiden, Erez Lieberman; Kanemaki, Masato T.; Backman, Vadim; Adli, Mazhar; Center for Theoretical Biological Physics
B-type lamins are critical nuclear envelope proteins that interact with the three-dimensional genomic architecture. However, identifying the direct roles of B-lamins on dynamic genome organization has been challenging as their joint depletion severely impacts cell viability. To overcome this, we engineered mammalian cells to rapidly and completely degrade endogenous B-type lamins using Auxin-inducible degron technology.

Browse

Recent Submissions