Computer Science Publications

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 20 of 169
  • Item
    Differentially Private Medians and Interior Points for Non-Pathological Data
    (Schloss Dagstuhl - Leibniz Center for Informatics, 2024) Aliakbarpour, Maryam; Silver, Rose; Steinke, Thomas; Ullman, Jonathan
    We construct sample-efficient differentially private estimators for the approximate-median and interior-point problems, that can be applied to arbitrary input distributions over ℝ satisfying very mild statistical assumptions. Our results stand in contrast to the surprising negative result of Bun et al. (FOCS 2015), which showed that private estimators with finite sample complexity cannot produce interior points on arbitrary distributions.
  • Item
    Automatic Active Lesion Tracking in Multiple Sclerosis Using Unsupervised Machine Learning
    (MDPI, 2024) Uwaeze, Jason; Narayana, Ponnada A.; Kamali, Arash; Braverman, Vladimir; Jacobs, Michael A.; Akhbardeh, Alireza
    Background: Identifying active lesions in magnetic resonance imaging (MRI) is crucial for the diagnosis and treatment planning of multiple sclerosis (MS). Active lesions on MRI are identified following the administration of Gadolinium-based contrast agents (GBCAs). However, recent studies have reported that repeated administration of GBCA results in the accumulation of Gd in tissues. In addition, GBCA administration increases health care costs. Thus, reducing or eliminating GBCA administration for active lesion detection is important for improved patient safety and reduced healthcare costs. Current state-of-the-art methods for identifying active lesions in brain MRI without GBCA administration utilize data-intensive deep learning methods. Objective: To implement nonlinear dimensionality reduction (NLDR) methods, locally linear embedding (LLE) and isometric feature mapping (Isomap), which are less data-intensive, for automatically identifying active lesions on brain MRI in MS patients, without the administration of contrast agents. Materials and Methods: Fluid-attenuated inversion recovery (FLAIR), T2-weighted, proton density-weighted, and pre- and post-contrast T1-weighted images were included in the multiparametric MRI dataset used in this study. Subtracted pre- and post-contrast T1-weighted images were labeled by experts as active lesions (ground truth). Unsupervised methods, LLE and Isomap, were used to reconstruct multiparametric brain MR images into a single embedded image. Active lesions were identified on the embedded images and compared with ground truth lesions. The performance of NLDR methods was evaluated by calculating the Dice similarity (DS) index between the observed and identified active lesions in embedded images. Results: LLE and Isomap, were applied to 40 MS patients, achieving median DS scores of 0.74 ± 0.1 and 0.78 ± 0.09, respectively, outperforming current state-of-the-art methods. Conclusions: NLDR methods, Isomap and LLE, are viable options for the identification of active MS lesions on non-contrast images, and potentially could be used as a clinical decision tool.
  • Item
    Exploring the Relation between Contextual Social Determinants of Health and COVID-19 Occurrence and Hospitalization
    (MDPI, 2024) Chen, Aokun; Zhao, Yunpeng; Zheng, Yi; Hu, Hui; Hu, Xia; Fishe, Jennifer N.; Hogan, William R.; Shenkman, Elizabeth A.; Guo, Yi; Bian, Jiang
    It is prudent to take a unified approach to exploring how contextual social determinants of health (SDoH) relate to COVID-19 occurrence and outcomes. Poor geographically represented data and a small number of contextual SDoH examined in most previous research studies have left a knowledge gap in the relationships between contextual SDoH and COVID-19 outcomes. In this study, we linked 199 contextual SDoH factors covering 11 domains of social and built environments with electronic health records (EHRs) from a large clinical research network (CRN) in the National Patient-Centered Clinical Research Network (PCORnet) to explore the relation between contextual SDoH and COVID-19 occurrence and hospitalization. We identified 15,890 COVID-19 patients and 63,560 matched non-COVID-19 patients in Florida between January 2020 and May 2021. We adopted a two-phase multiple linear regression approach modified from that in the exposome-wide association (ExWAS) study. After removing the highly correlated SDoH variables, 86 contextual SDoH variables were included in the data analysis. Adjusting for race, ethnicity, and comorbidities, we found six contextual SDoH variables (i.e., hospital available beds and utilization, percent of vacant property, number of golf courses, and percent of minority) related to the occurrence of COVID-19, and three variables (i.e., farmers market, low access, and religion) related to the hospitalization of COVID-19. To our best knowledge, this is the first study to explore the relationship between contextual SDoH and COVID-19 occurrence and hospitalization using EHRs in a major PCORnet CRN. As an exploratory study, the causal effect of SDoH on COVID-19 outcomes will be evaluated in future studies.
  • Item
    A scientific machine learning framework to understand flash graphene synthesis
    (Royal Society of Chemistry, 2023) Sattari, Kianoosh; Eddy, Lucas; Beckham, Jacob L.; Wyss, Kevin M.; Byfield, Richard; Qian, Long; Tour, James M.; Lin, Jian; NanoCarbon Center; Welch Institute for Advanced Materials
    Flash Joule heating (FJH) is a far-from-equilibrium (FFE) processing method for converting low-value carbon-based materials to flash graphene (FG). Despite its promises in scalability and performance, attempts to explore the reaction mechanism have been limited due to the complexities involved in the FFE process. Data-driven machine learning (ML) models effectively account for the complexities, but the model training requires a considerable amount of experimental data. To tackle this challenge, we constructed a scientific ML (SML) framework trained by using both direct processing variables and indirect, physics-informed variables to predict the FG yield. The indirect variables include current-derived features (final current, maximum current, and charge density) predicted from the proxy ML models and reaction temperatures simulated from multi-physics modeling. With the combined indirect features, the final ML model achieves an average R2 score of 0.81 ± 0.05 and an average RMSE of 12.1% ± 2.0% in predicting the FG yield, which is significantly higher than the model trained without them (R2 of 0.73 ± 0.05 and an RMSE of 14.3% ± 2.0%). Feature importance analysis validates the key roles of these indirect features in determining the reaction outcome. These results illustrate the promise of this SML to elucidate FFE material synthesis outcomes, thus paving a new avenue to processing other datasets from the materials systems involving the same or different FFE processes.
  • Item
    Joint embedding of biological networks for cross-species functional alignment
    (Oxford University Press, 2023) Li, Lechuan; Dannenfelser, Ruth; Zhu, Yu; Hejduk, Nathaniel; Segarra, Santiago; Yao, Vicky
    Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein–protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.https://github.com/ylaboratory/ETNA
  • Item
    Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
    (Oxford University Press, 2023) Kille, Bryce; Garrison, Erik; Treangen, Todd J; Phillippy, Adam M
    The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.MashMap3 is available at https://github.com/marbl/MashMap.
  • Item
    Supervised convex clustering
    (Wiley, 2023) Wang, Minjie; Yao, Tianyi; Allen, Genevera I.
    Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.
  • Item
    Real-time, deep-learning aided lensless microscope
    (Optica Publishing Group, 2023) Wu, Jimin; Boominathan, Vivek; Veeraraghavan, Ashok; Robinson, Jacob T.
    Traditional miniaturized fluorescence microscopes are critical tools for modern biology. Invariably, they struggle to simultaneously image with a high spatial resolution and a large field of view (FOV). Lensless microscopes offer a solution to this limitation. However, real-time visualization of samples is not possible with lensless imaging, as image reconstruction can take minutes to complete. This poses a challenge for usability, as real-time visualization is a crucial feature that assists users in identifying and locating the imaging target. The issue is particularly pronounced in lensless microscopes that operate at close imaging distances. Imaging at close distances requires shift-varying deconvolution to account for the variation of the point spread function (PSF) across the FOV. Here, we present a lensless microscope that achieves real-time image reconstruction by eliminating the use of an iterative reconstruction algorithm. The neural network-based reconstruction method we show here, achieves more than 10000 times increase in reconstruction speed compared to iterative reconstruction. The increased reconstruction speed allows us to visualize the results of our lensless microscope at more than 25 frames per second (fps), while achieving better than 7 µm resolution over a FOV of 10 mm2. This ability to reconstruct and visualize samples in real-time empowers a more user-friendly interaction with lensless microscopes. The users are able to use these microscopes much like they currently do with conventional microscopes.
  • Item
    An automated respiratory data pipeline for waveform characteristic analysis
    (Wiley, 2023) Lusk, Savannah; Ward, Christopher S.; Chang, Andersen; Twitchell-Heyne, Avery; Fattig, Shaun; Allen, Genevera; Jankowsky, Joanna L.; Ray, Russell S.
    Comprehensive and accurate analysis of respiratory and metabolic data is crucial to modelling congenital, pathogenic and degenerative diseases converging on autonomic control failure. A lack of tools for high-throughput analysis of respiratory datasets remains a major challenge. We present Breathe Easy, a novel open-source pipeline for processing raw recordings and associated metadata into operative outcomes, publication-worthy graphs and robust statistical analyses including QQ and residual plots for assumption queries and data transformations. This pipeline uses a facile graphical user interface for uploading data files, setting waveform feature thresholds and defining experimental variables. Breathe Easy was validated against manual selection by experts, which represents the current standard in the field. We demonstrate Breathe Easy's utility by examining a 2-year longitudinal study of an Alzheimer's disease mouse model to assess contributions of forebrain pathology in disordered breathing. Whole body plethysmography has become an important experimental outcome measure for a variety of diseases with primary and secondary respiratory indications. Respiratory dysfunction, while not an initial symptom in many of these disorders, often drives disability or death in patient outcomes. Breathe Easy provides an open-source respiratory analysis tool for all respiratory datasets and represents a necessary improvement upon current analytical methods in the field. Key points Respiratory dysfunction is a common endpoint for disability and mortality in many disorders throughout life. Whole body plethysmography in rodents represents a high face-value method for measuring respiratory outcomes in rodent models of these diseases and disorders. Analysis of key respiratory variables remains hindered by manual annotation and analysis that leads to low throughput results that often exclude a majority of the recorded data. Here we present a software suite, Breathe Easy, that automates the process of data selection from raw recordings derived from plethysmography experiments and the analysis of these data into operative outcomes and publication-worthy graphs with statistics. We validate Breathe Easy with a terabyte-scale Alzheimer's dataset that examines the effects of forebrain pathology on respiratory function over 2 years of degeneration.
  • Item
    Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes
    (Springer Nature, 2023) Chin, Chen-Shan; Behera, Sairam; Khalak, Asif; Sedlazeck, Fritz J.; Sudmant, Peter H.; Wagner, Justin; Zook, Justin M.
    Advancements in sequencing technologies and assembly methods enable the regular production of high-quality genome assemblies characterizing complex regions. However, challenges remain in efficiently interpreting variation at various scales, from smaller tandem repeats to megabase rearrangements, across many human genomes. We present a PanGenome Research Tool Kit (PGR-TK) enabling analyses of complex pangenome structural and haplotype variation at multiple scales. We apply the graph decomposition methods in PGR-TK to the class II major histocompatibility complex demonstrating the importance of the human pangenome for analyzing complicated regions. Moreover, we investigate the Y-chromosome genes, DAZ1/DAZ2/DAZ3/DAZ4, of which structural variants have been linked to male infertility, and X-chromosome genes OPN1LW and OPN1MW linked to eye disorders. We further showcase PGR-TK across 395 complex repetitive medically important genes. This highlights the power of PGR-TK to resolve complex variation in regions of the genome that were previously too complex to analyze.
  • Item
    Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree
    (Springer Nature, 2023) Dylus, David; Altenhoff, Adrian; Majidian, Sina; Sedlazeck, Fritz J.; Dessimoz, Christophe
    Current methods for inference of phylogenetic trees require running complex pipelines at substantial computational and labor costs, with additional constraints in sequencing coverage, assembly and annotation quality, especially for large datasets. To overcome these challenges, we present Read2Tree, which directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy. In a benchmark encompassing a broad variety of datasets, Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant. Here, to illustrate the broad applicability of the tool, we reconstruct a yeast tree of life of 435 species spanning 590 million years of evolution. We also apply Read2Tree to >10,000 Coronaviridae samples, accurately classifying highly diverse animal samples and near-identical severe acute respiratory syndrome coronavirus 2 sequences on a single tree. The speed, accuracy and versatility of Read2Tree enable comparative genomics at scale.
  • Item
    Genomic variant benchmark: if you cannot measure it, you cannot improve it
    (Springer Nature, 2023) Majidian, Sina; Agustinho, Daniel Paiva; Chin, Chen-Shan; Sedlazeck, Fritz J.; Mahmoud, Medhat
    Genomic benchmark datasets are essential to driving the field of genomics and bioinformatics. They provide a snapshot of the performances of sequencing technologies and analytical methods and highlight future challenges. However, they depend on sequencing technology, reference genome, and available benchmarking methods. Thus, creating a genomic benchmark dataset is laborious and highly challenging, often involving multiple sequencing technologies, different variant calling tools, and laborious manual curation. In this review, we discuss the available benchmark datasets and their utility. Additionally, we focus on the most recent benchmark of genes with medical relevance and challenging genomic complexity.
  • Item
    Charge-based interactions through peptide position 4 drive diversity of antigen presentation by human leukocyte antigen class I molecules
    (Oxford University Press, 2022) Jackson, Kyle R; Antunes, Dinler A; Talukder, Amjad H; Maleki, Ariana R; Amagai, Kano; Salmon, Avery; Katailiha, Arjun S; Chiu, Yulun; Fasoulis, Romanos; Rigo, Maurício Menegatti; Abella, Jayvee R; Melendez, Brenda D; Li, Fenge; Sun, Yimo; Sonnemann, Heather M; Belousov, Vladislav; Frenkel, Felix; Justesen, Sune; Makaju, Aman; Liu, Yang; Horn, David; Lopez-Ferrer, Daniel; Huhmer, Andreas F; Hwu, Patrick; Roszik, Jason; Hawke, David; Kavraki, Lydia E; Lizée, Gregory
    Human leukocyte antigen class I (HLA-I) molecules bind and present peptides at the cell surface to facilitate the induction of appropriate CD8+ T cell-mediated immune responses to pathogen- and self-derived proteins. The HLA-I peptide-binding cleft contains dominant anchor sites in the B and F pockets that interact primarily with amino acids at peptide position 2 and the C-terminus, respectively. Nonpocket peptide–HLA interactions also contribute to peptide binding and stability, but these secondary interactions are thought to be unique to individual HLA allotypes or to specific peptide antigens. Here, we show that two positively charged residues located near the top of peptide-binding cleft facilitate interactions with negatively charged residues at position 4 of presented peptides, which occur at elevated frequencies across most HLA-I allotypes. Loss of these interactions was shown to impair HLA-I/peptide binding and complex stability, as demonstrated by both in vitro and in silico experiments. Furthermore, mutation of these Arginine-65 (R65) and/or Lysine-66 (K66) residues in HLA-A*02:01 and A*24:02 significantly reduced HLA-I cell surface expression while also reducing the diversity of the presented peptide repertoire by up to 5-fold. The impact of the R65 mutation demonstrates that nonpocket HLA-I/peptide interactions can constitute anchor motifs that exert an unexpectedly broad influence on HLA-I-mediated antigen presentation. These findings provide fundamental insights into peptide antigen binding that could broadly inform epitope discovery in the context of viral vaccine development and cancer immunotherapy.
  • Item
    Birational Quadratic Planar Maps with Generalized Complex Rational Representations
    (MDPI, 2023) Wang, Xuhui; Han, Yuhao; Ni, Qian; Li, Rui; Goldman, Ron
    Complex rational maps have been used to construct birational quadratic maps based on two special syzygies of degree one. Similar to complex rational curves, rational curves over generalized complex numbers have also been constructed by substituting the imaginary unit with a new independent quantity. We first establish the relationship between degree one, generalized, complex rational Bézier curves and quadratic rational Bézier curves. Then we provide conditions to determine when a quadratic rational planar map has a generalized complex rational representation. Thus, a rational quadratic planar map can be made birational by suitably choosing the middle Bézier control points and their corresponding weights. In contrast to the edges of complex rational maps of degree one, which are circular arcs, the edges of the planar maps can be generalized to hyperbolic and parabolic arcs by invoking the hyperbolic and parabolic numbers.
  • Item
    Stratification of Pediatric COVID-19 Cases Using Inflammatory Biomarker Profiling and Machine Learning
    (MDPI, 2023) Subramanian, Devika; Vittala, Aadith; Chen, Xinpu; Julien, Christopher; Acosta, Sebastian; Rusin, Craig; Allen, Carl; Rider, Nicholas; Starosolski, Zbigniew; Annapragada, Ananth; Devaraj, Sridevi
    While pediatric COVID-19 is rarely severe, a small fraction of children infected with SARS-CoV-2 go on to develop multisystem inflammatory syndrome (MIS-C), with substantial morbidity. An objective method with high specificity and high sensitivity to identify current or imminent MIS-C in children infected with SARS-CoV-2 is highly desirable. The aim was to learn about an interpretable novel cytokine/chemokine assay panel providing such an objective classification. This retrospective study was conducted on four groups of pediatric patients seen at multiple sites of Texas Children’s Hospital, Houston, TX who consented to provide blood samples to our COVID-19 Biorepository. Standard laboratory markers of inflammation and a novel cytokine/chemokine array were measured in blood samples of all patients. Group 1 consisted of 72 COVID-19, 70 MIS-C and 63 uninfected control patients seen between May 2020 and January 2021 and predominantly infected with pre-alpha variants. Group 2 consisted of 29 COVID-19 and 43 MIS-C patients seen between January and May 2021 infected predominantly with the alpha variant. Group 3 consisted of 30 COVID-19 and 32 MIS-C patients seen between August and October 2021 infected with alpha and/or delta variants. Group 4 consisted of 20 COVID-19 and 46 MIS-C patients seen between October 2021 andJanuary 2022 infected with delta and/or omicron variants. Group 1 was used to train an L1-regularized logistic regression model which was tested using five-fold cross validation, and then separately validated against the remaining naïve groups. The area under receiver operating curve (AUROC) and F1-score were used to quantify the performance of the cytokine/chemokine assay-based classifier. Standard laboratory markers predict MIS-C with a five-fold cross-validated AUROC of 0.86 ± 0.05 and an F1 score of 0.78 ± 0.07, while the cytokine/chemokine panel predicted MIS-C with a five-fold cross-validated AUROC of 0.95 ± 0.02 and an F1 score of 0.91 ± 0.04, with only sixteen of the forty-five cytokines/chemokines sufficient to achieve this performance. Tested on Group 2 the cytokine/chemokine panel yielded AUROC = 0.98 and F1 = 0.93, on Group 3 it yielded AUROC = 0.89 and F1 = 0.89, and on Group 4 AUROC = 0.99 and F1 = 0.97. Adding standard laboratory markers to the cytokine/chemokine panel did not improve performance. A top-10 subset of these 16 cytokines achieves equivalent performance on the validation data sets. Our findings demonstrate that a sixteen-cytokine/chemokine panel as well as the top ten subset provides a highly sensitive, and specific method to identify MIS-C in patients infected with SARS-CoV-2 of all the major variants identified to date.
  • Item
    A deep learning solution for crystallographic structure determination
    (International Union of Crystallography, 2023) Pan, T.; Jin, S.; Miller, M. D.; Kyrillidis, A.; Phillips, G. N.
    The general de novo solution of the crystallographic phase problem is difficult and only possible under certain conditions. This paper develops an initial pathway to a deep learning neural network approach for the phase problem in protein crystallography, based on a synthetic dataset of small fragments derived from a large well curated subset of solved structures in the Protein Data Bank (PDB). In particular, electron-density estimates of simple artificial systems are produced directly from corresponding Patterson maps using a convolutional neural network architecture as a proof of concept.
  • Item
    PME: pruning-based multi-size embedding for recommender systems
    (Frontiers Media S.A., 2023) Liu, Zirui; Song, Qingquan; Li, Li; Choi, Soo-Hyun; Chen, Rui; Hu, Xia
    Embedding is widely used in recommendation models to learn feature representations. However, the traditional embedding technique that assigns a fixed size to all categorical features may be suboptimal due to the following reasons. In recommendation domain, the majority of categorical features' embeddings can be trained with less capacity without impacting model performance, thereby storing embeddings with equal length may incur unnecessary memory usage. Existing work that tries to allocate customized sizes for each feature usually either simply scales the embedding size with feature's popularity or formulates this size allocation problem as an architecture selection problem. Unfortunately, most of these methods either have large performance drop or incur significant extra time cost for searching proper embedding sizes. In this article, instead of formulating the size allocation problem as an architecture selection problem, we approach the problem from a pruning perspective and propose Pruning-based Multi-size Embedding (PME) framework. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding with significant less search cost. Experimental results validate that PME can efficiently find proper sizes and hence achieve strong performance while significantly reducing the number of parameters in the embedding layer.
  • Item
    EnGens: a computational framework for generation and analysis of representative protein conformational ensembles
    (Oxford University Press, 2023) Conev, Anja; Rigo, Mauricio Menegatti; Devaurs, Didier; Fonseca, André Faustino; Kalavadwala, Hussain; de Freitas, Martiela Vaz; Clementi, Cecilia; Zanatta, Geancarlo; Antunes, Dinler Amaral; Kavraki, Lydia E
    Proteins are dynamic macromolecules that perform vital functions in cells. A protein structure determines its function, but this structure is not static, as proteins change their conformation to achieve various functions. Understanding the conformational landscapes of proteins is essential to understand their mechanism of action. Sets of carefully chosen conformations can summarize such complex landscapes and provide better insights into protein function than single conformations. We refer to these sets as representative conformational ensembles. Recent advances in computational methods have led to an increase in the number of available structural datasets spanning conformational landscapes. However, extracting representative conformational ensembles from such datasets is not an easy task and many methods have been developed to tackle it. Our new approach, EnGens (short for ensemble generation), collects these methods into a unified framework for generating and analyzing representative protein conformational ensembles. In this work, we: (1) provide an overview of existing methods and tools for representative protein structural ensemble generation and analysis; (2) unify existing approaches in an open-source Python package, and a portable Docker image, providing interactive visualizations within a Jupyter Notebook pipeline; (3) test our pipeline on a few canonical examples from the literature. Representative ensembles produced by EnGens can be used for many downstream tasks such as protein–ligand ensemble docking, Markov state modeling of protein dynamics and analysis of the effect of single-point mutations.
  • Item
    Enabling accurate and early detection of recently emerged SARS-CoV-2 variants of concern in wastewater
    (Springer Nature, 2023) Sapoval, Nicolae; Liu, Yunxi; Lou, Esther G.; Hopkins, Loren; Ensor, Katherine B.; Schneider, Rebecca; Stadler, Lauren B.; Treangen, Todd J.
    As clinical testing declines, wastewater monitoring can provide crucial surveillance on the emergence of SARS-CoV-2 variant of concerns (VoCs) in communities. In this paper we present QuaID, a novel bioinformatics tool for VoC detection based on quasi-unique mutations. The benefits of QuaID are three-fold: (i) provides up to 3-week earlier VoC detection, (ii) accurate VoC detection (>95% precision on simulated benchmarks), and (iii) leverages all mutational signatures (including insertions & deletions).
  • Item
    PepSim: T-cell cross-reactivity prediction via comparison of peptide sequence and peptide-HLA structure
    (Frontiers Media S.A., 2023) Hall-Swan, Sarah; Slone, Jared; Rigo, Mauricio M.; Antunes, Dinler A.; Lizée, Gregory; Kavraki, Lydia E.
    IntroductionPeptide-HLA class I (pHLA) complexes on the surface of tumor cells can be targeted by cytotoxic T-cells to eliminate tumors, and this is one of the bases for T-cell-based immunotherapies. However, there exist cases where therapeutic T-cells directed towards tumor pHLA complexes may also recognize pHLAs from healthy normal cells. The process where the same T-cell clone recognizes more than one pHLA is referred to as T-cell cross-reactivity and this process is driven mainly by features that make pHLAs similar to each other. T-cell cross-reactivity prediction is critical for designing T-cell-based cancer immunotherapies that are both effective and safe.MethodsHere we present PepSim, a novel score to predict T-cell cross-reactivity based on the structural and biochemical similarity of pHLAs.Results and discussionWe show our method can accurately separate cross-reactive from non-crossreactive pHLAs in a diverse set of datasets including cancer, viral, and self-peptides. PepSim can be generalized to work on any dataset of class I peptide-HLAs and is freely available as a web server at pepsim.kavrakilab.org.