Computer Science
Permanent URI for this community
Browse
Browsing Computer Science by Title
Now showing 1 - 20 of 435
Results Per Page
Sort Options
Item A best-match approach for gene set analyses in embedding spaces(Cold Spring Harbor Laboratory Press, 2024) Li, Lechuan; Dannenfelser, Ruth; Cruz, Charlie; Yao, VickyEmbedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein–protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.Item A Characterization of Compound Documents on the Web(1999-11-29) Lara, Eyal de; Wallach, Dan S.; Zwaenepoel, WillyRecent developments in office productivity suites make it easier for users to publish rich {\em compound documents\/} on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935different Web sites. Our main conclusions are: Compound documents are in general much larger than current HTML documents. For large documents, embedded objects and images make up a large part of the documents' size. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. Compression considerably reduces the size of documents in both formats.Item A Chromosome-length Assembly of the Black Petaltail (Tanypteryx hageni) Dragonfly(Oxford University Press, 2023) Tolman, Ethan R; Beatty, Christopher D; Bush, Jonas; Kohli, Manpreet; Moreno, Carlos M; Ware, Jessica L; Weber, K Scott; Khan, Ruqayya; Maheshwari, Chirag; Weisz, David; Dudchenko, Olga; Aiden, Erez Lieberman; Frandsen, Paul B; Center for Theoretical Biological PhysicsWe present a chromosome-length genome assembly and annotation of the Black Petaltail dragonfly (Tanypteryx hageni). This habitat specialist diverged from its sister species over 70 million years ago, and separated from the most closely related Odonata with a reference genome 150 million years ago. Using PacBio HiFi reads and Hi-C data for scaffolding we produce one of the most high-quality Odonata genomes to date. A scaffold N50 of 206.6 Mb and a single copy BUSCO score of 96.2% indicate high contiguity and completeness.Item A Chromosome-Length Reference Genome for the Endangered Pacific Pocket Mouse Reveals Recent Inbreeding in a Historically Large Population(Oxford University Press, 2022) Wilder, Aryn P; Dudchenko, Olga; Curry, Caitlin; Korody, Marisa; Turbek, Sheela P; Daly, Mark; Misuraca, Ann; Wang, Gaojianyong; Khan, Ruqayya; Weisz, David; Fronczek, Julie; Aiden, Erez Lieberman; Houck, Marlys L; Shier, Debra M; Ryder, Oliver A; Steiner, Cynthia C; Center for Theoretical Biological PhysicsHigh-quality reference genomes are fundamental tools for understanding population history, and can provide estimates of genetic and demographic parameters relevant to the conservation of biodiversity. The federally endangered Pacific pocket mouse (PPM), which persists in three small, isolated populations in southern California, is a promising model for studying how demographic history shapes genetic diversity, and how diversity in turn may influence extinction risk. To facilitate these studies in PPM, we combined PacBio HiFi long reads with Omni-C and Hi-C data to generate a de novo genome assembly, and annotated the genome using RNAseq. The assembly comprised 28 chromosome-length scaffolds (N50 = 72.6 MB) and the complete mitochondrial genome, and included a long heterochromatic region on chromosome 18 not represented in the previously available short-read assembly. Heterozygosity was highly variable across the genome of the reference individual, with 18% of windows falling in runs of homozygosity (ROH) >1 MB, and nearly 9% in tracts spanning >5 MB. Yet outside of ROH, heterozygosity was relatively high (0.0027), and historical Ne estimates were large. These patterns of genetic variation suggest recent inbreeding in a formerly large population. Currently the most contiguous assembly for a heteromyid rodent, this reference genome provides insight into the past and recent demographic history of the population, and will be a critical tool for management and future studies of outbreeding depression, inbreeding depression, and genetic load.Item A Comparison of Software Architectures for E-business Applications(2002-02-20) Cecchet, Emmanuel; Chanda, Anupam; Elnikety, Sameh; Marguerite, Julie; Zwaenepoel, WillyAs dynamic content has become more prevalent on the Web, a number of standard mechanisms have evolved to generate such dynamic content. We study three specific mechanisms in common use: PHP, Java servlets, and Enterprise Java Beans (EJB). PHP and Java servlets require a direct encoding of the database queries in the application logic. EJB provides a level of indirection, allowing the application logic to call bean methods that then perform database queries. Unlike PHP, which typically executes on the same machine as the Web server, Java servlets and EJB allow the application logic to execute on different machines, including the machine on which the database executes or a completely separate (set of) machine(s). We present a comparison of the performance of these three systems in different configurations for two application benchmarks: an auction site and an online bookstore. We choose these two applications because they impose vastly different loads on the sub-systems: the auction site stresses the Web server front-end while the online bookstore stresses the database. We use open-source software in common use in all of our experiments (the Apache Web server, Tomcat servlet server, Jonas EJB server, and MySQL relational database). The computational demands ofJava servlets are modestly higher than those of PHP. The ability, however, of locating the servlets on a machine different from the Web server results in better performance for Java servlets than for PHP in the case that the application imposes a significant load on the front-end Web server. The computational demands of EJB are much higher than those of PHP and Java servlets. As with Java servlets, we can alleviate EJB's performance problems by putting them on a separate machine, but the resulting overall performance remains inferior to that of the other two systems.Item A CRISPR toolbox for generating intersectional genetic mouse models for functional, molecular, and anatomical circuit mapping(Springer Nature, 2022) Lusk, Savannah J.; McKinney, Andrew; Hunt, Patrick J.; Fahey, Paul G.; Patel, Jay; Chang, Andersen; Sun, Jenny J.; Martinez, Vena K.; Zhu, Ping Jun; Egbert, Jeremy R.; Allen, Genevera; Jiang, Xiaolong; Arenkiel, Benjamin R.; Tolias, Andreas S.; Costa-Mattioli, Mauro; Ray, Russell S.The functional understanding of genetic interaction networks and cellular mechanisms governing health and disease requires the dissection, and multifaceted study, of discrete cell subtypes in developing and adult animal models. Recombinase-driven expression of transgenic effector alleles represents a significant and powerful approach to delineate cell populations for functional, molecular, and anatomical studies. In addition to single recombinase systems, the expression of two recombinases in distinct, but partially overlapping, populations allows for more defined target expression. Although the application of this method is becoming increasingly popular, its experimental implementation has been broadly restricted to manipulations of a limited set of common alleles that are often commercially produced at great expense, with costs and technical challenges associated with production of intersectional mouse lines hindering customized approaches to many researchers. Here, we present a simplified CRISPR toolkit for rapid, inexpensive, and facile intersectional allele production.Item A deep learning solution for crystallographic structure determination(International Union of Crystallography, 2023) Pan, T.; Jin, S.; Miller, M. D.; Kyrillidis, A.; Phillips, G. N.The general de novo solution of the crystallographic phase problem is difficult and only possible under certain conditions. This paper develops an initial pathway to a deep learning neural network approach for the phase problem in protein crystallography, based on a synthetic dataset of small fragments derived from a large well curated subset of solved structures in the Protein Data Bank (PDB). In particular, electron-density estimates of simple artificial systems are produced directly from corresponding Patterson maps using a convolutional neural network architecture as a proof of concept.Item A Deterministic Model for Parallel Program Performance Evaluation(1998-12-03) Adve, Vikram S.; Vernon, Mary K.Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on scalability, but have been less successful in practice for predicting detailed, quantitative information about program performance. We develop a conceptually simple model that provides detailed performance prediction for parallel programs with arbitrary task graphs, a wide variety of task scheduling policies, shared-memory communication, and significant resource contention. Unlike many previous models, our model assumes deterministic task execution times which permits detailed analysis of synchronization, task scheduling, the order of task execution as well as mean values of communication costs. The assumption of deterministic task times is supported by a recent study of the influence of non-deterministic delays in parallel programs. We show that the deterministic task graph model is accurate and efficient for five shared-memory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly non-uniform task times, and significant communication and resource contention. We also use three example programs to illustrate the predictive capabilities of the model. In two cases, broad insights and detailed metrics from the model are used to suggest improvements in load-balancing and the model quickly and accurately predicts the impact of these changes. In the third case, further novel metrics are used to obtain insight into the impact of program design changes that improve communication locality as well as load-balancing. Finally, we briefly present results of a comparison between our model and representative models based on stochastic task execution times.Item A divide-and-conquer method for scalable phylogenetic network inference from multilocus data(Oxford University Press, 2019) Zhu, Jiafan; Liu, Xinhao; Ogilvie, Huw A.; Nakhleh, Luay K.Motivation: Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. Results: In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.Item A Graphical Multistage Calculus(2005-07-22) Ellner, Stephan; Taha, WalidWhile visual programming languages continue to gain popularity in domains ranging from scientific computing to real-time systems, the wealth of abstraction mechanisms, reasoning principles, and type systems developed over the last thirty years is currently available mainly for textual languages. With the goal of understanding how results in the textual languages can be mapped to the graphical setting, we develop the visual calculus PreVIEW. While this calculus visualizes computations in dataflow-style similar to languages like LabVIEW and Simulink, its formal model is based on Ariola and Blom's work on cyclic lambda calculi. We extend this model with staging constructs, establish a precise connection between textual and graphical program representations, and show how a reduction semantics for a multi-stage language can be lifted from the textual to the graphical setting.Item A Hierarchical Region-Based Static Single Assignment Form(2009-12-14) Sarkar, Vivek; Zhao, JishengModern compilation systems face the challenge of incrementally reanalyzing a program’s intermediate representation each time a code transformation is performed. Current approaches typically either re-analyze the entire program after an individual transformation or limit the analysis information that is available after a transformation. To address both efficiency and precision goals in an optimizing compiler, we introduce a hierarchical static single-assignment form called Region Static Single-Assignment (Region-SSA) form. Static single assignment (SSA) form is an efficient intermediate representation that is well suited for solving many data flow analysis and optimization problems. By partitioning the program into hierarchical regions, Region-SSA form maintains a local SSA form for each region. Region-SSA supports a demand-driven re-computation of SSA form after a transformation is performed, since only the updated region’s SSA form needs to be reconstructed along with a potential propagation of exposed defs and uses. In this paper, we introduce the Region-SSA data structure, and present algorithms for construction and incremental reconstruction of Region-SSA form. The Region-SSA data structure includes a tree based region hierarchy, a region based control flow graph, and region-based SSA forms. We have implemented in Region SSA form in the Habanero-Java (HJ) research compiler. Our experimental results show significant improvements in compile-time compared to traditional approaches that recomputed the entire procedure’s SSA form exhaustively after transformation. For loop unrolling transformations, compile-time speedups up to 35.8× were observed using Region-SSA form relative to standard SSA form. For loop interchange transformations, compile-time speedups up to 205.6× were observed. We believe that Region-SSA form is an attractive foundation for future compiler frameworks that need to incorporate sophisticated incremental program analyses and transformations.Item A Linear Transform Scheme for Combining Weights into Scores(1998-10-09) Sung, SamRanking has been widely used in many applications. A ranking scheme usually employs a "scoring rule" that assigns a final numerical value to each and every object to be ranked. A scoring rule normally involves the use of one or many scores, and it gives more weight to the scores that is more important. In this paper, we give a scheme that can combine weights into scores in a natural way. We compare our scheme to the formula given by Fagin. We give additional desirable properties that weighted "scoring rule" are desirable to possess. Some interesting issues on weighted scoring rule are also discussed.Item A MAC protocol for Multi Frequency Physical Layer(2003-01-23) Kumar, Rajnish; PalChaudhuri, Santashil; Saha, AmitExisting MAC protocols for wireless LAN systems assume that a particular node can operate on only one frequency and that most/all of the nodes operate on the same frequency. We propose a MAC protocol for use in an ad hoc network of mobile nodes using a wireless LAN system that defines multiple independent frequency channels. Each node can switch quickly from one channel to another but can operate on one channel at a time. We simulate the proposed protocol by modifying the wireless extension. Our simulations show that the proposed protocol, though simple, is capable of much better performance in the presence of multiple independent channels than IEEE 802.11which assumes a single frequency channel for all nodes. As expected, the proposed protocol works as well as IEEE 802.11 in the presence of a single channel.Item A Machine Learning Model for Risk Stratification of Postdiagnosis Diabetic Ketoacidosis Hospitalization in Pediatric Type 1 Diabetes: Retrospective Study(JMIR, 2024) Subramanian, Devika; Sonabend, Rona; Singh, IlaBackground: Diabetic ketoacidosis (DKA) is the leading cause of morbidity and mortality in pediatric type 1 diabetes (T1D), occurring in approximately 20% of patients, with an economic cost of $5.1 billion/year in the United States. Despite multiple risk factors for postdiagnosis DKA, there is still a need for explainable, clinic-ready models that accurately predict DKA hospitalization in established patients with pediatric T1D. Objective: We aimed to develop an interpretable machine learning model to predict the risk of postdiagnosis DKA hospitalization in children with T1D using routinely collected time-series of electronic health record (EHR) data. Methods: We conducted a retrospective case-control study using EHR data from 1787 patients from among 3794 patients with T1D treated at a large tertiary care US pediatric health system from January 2010 to June 2018. We trained a state-of-the-art; explainable, gradient-boosted ensemble (XGBoost) of decision trees with 44 regularly collected EHR features to predict postdiagnosis DKA. We measured the model’s predictive performance using the area under the receiver operating characteristic curve–weighted F1-score, weighted precision, and recall, in a 5-fold cross-validation setting. We analyzed Shapley values to interpret the learned model and gain insight into its predictions. Results: Our model distinguished the cohort that develops DKA postdiagnosis from the one that does not (P<.001). It predicted postdiagnosis DKA risk with an area under the receiver operating characteristic curve of 0.80 (SD 0.04), a weighted F1-score of 0.78 (SD 0.04), and a weighted precision and recall of 0.83 (SD 0.03) and 0.76 (SD 0.05) respectively, using a relatively short history of data from routine clinic follow-ups post diagnosis. On analyzing Shapley values of the model output, we identified key risk factors predicting postdiagnosis DKA both at the cohort and individual levels. We observed sharp changes in postdiagnosis DKA risk with respect to 2 key features (diabetes age and glycated hemoglobin at 12 months), yielding time intervals and glycated hemoglobin cutoffs for potential intervention. By clustering model-generated Shapley values, we automatically stratified the cohort into 3 groups with 5%, 20%, and 48% risk of postdiagnosis DKA. Conclusions: We have built an explainable, predictive, machine learning model with potential for integration into clinical workflow. The model risk-stratifies patients with pediatric T1D and identifies patients with the highest postdiagnosis DKA risk using limited follow-up data starting from the time of diagnosis. The model identifies key time points and risk factors to direct clinical interventions at both the individual and cohort levels. Further research with data from multiple hospital systems can help us assess how well our model generalizes to other populations. The clinical importance of our work is that the model can predict patients most at risk for postdiagnosis DKA and identify preventive interventions based on mitigation of individualized risk factors.Item A maximum pseudo-likelihood approach for phylogenetic networks(BioMed Central, 2015) Yu, Yun; Nakhleh, Luay K.Abstract Background Several phylogenomic analyses have recently demonstrated the need to account simultaneously for incomplete lineage sorting (ILS) and hybridization when inferring a species phylogeny. A maximum likelihood approach was introduced recently for inferring species phylogenies in the presence of both processes, and showed very good results. However, computing the likelihood of a model in this case is computationally infeasible except for very small data sets. Results Inspired by recent work on the pseudo-likelihood of species trees based on rooted triples, we introduce the pseudo-likelihood of a phylogenetic network, which, when combined with a search heuristic, provides a statistical method for phylogenetic network inference in the presence of ILS. Unlike trees, networks are not always uniquely encoded by a set of rooted triples. Therefore, even when given sufficient data, the method might converge to a network that is equivalent under rooted triples to the true one, but not the true one itself. The method is computationally efficient and has produced very good results on the data sets we analyzed. The method is implemented in PhyloNet, which is publicly available in open source. Conclusions Maximum pseudo-likelihood allows for inferring species phylogenies in the presence of hybridization and ILS, while scaling to much larger data sets than is currently feasible under full maximum likelihood. The nonuniqueness of phylogenetic networks encoded by a system of rooted triples notwithstanding, the proposed method infers the correct network under certain scenarios, and provides candidates for further exploration under other criteria and/or data in other scenarios.Item A New Approach to Routing With Dynamic Metrics(1998-11-18) Chen, Johnny; Druschel, Peter; Subramanian, DevikaWe present a new routing algorithm to compute paths within a network using dynamic link metrics. Dynamic link metrics are cost metrics that depend on a link's dynamic characteristics, e.g., the congestion on the link. Our algorithm is destination-initiated: the destination initiates a global path computation to itself using dynamic link metrics. All other destinations that do not initiate this dynamic metric computation use paths that are calculated and maintained by a traditional routing algorithm using static link metrics. Analysis of Internet packet traces show that a high percentage of network traffic is destined for a small number of networks. Because our algorithm is destination-initiated, it achieves maximum performance at minimum cost when it only recomputes dynamic metric paths to these selected "hot" destination networks. This selective approach to route recomputation reduces many of the problems (principally route oscillations) associated with calculating all routes simultaneously. We compare the routing efficiency and end-to-end performance of our algorithm against those of traditional algorithms using dynamic link metrics. The results of our experiments show that our algorithm can provide higher network performance at a significantly lower routing cost under conditions that arise in real networks. The effectiveness of the algorithm stems from the independent, time-staggered recomputation of important paths using dynamic metrics, allowing for splits in congested traffic that cannot be made by traditional routing algorithms.Item A Polynomial Blossom for the Askey–Wilson Operator(Springer, 2018) Simeonov, Plamen; Goldman, RonWe introduce a blossoming procedure for polynomials related to the Askey–Wilson operator. This new blossom is symmetric, multiaffine, and reduces to the complex representation of the polynomial on a certain diagonal. This Askey–Wilson blossom can be used to find the Askey–Wilson derivative of a polynomial of any order. We also introduce a corresponding Askey–Wilson Bernstein basis for which this new blossom provides the dual functionals. We derive a partition of unity property and a Marsden identity for this Askey–Wilson Bernstein basis, which turn out to be the terminating versions of Rogers’ 6ϕ5 summation formula and a very-well-poised 8ϕ7 summation formula. Recurrence and symmetry relations and differentiation and degree elevation formulas for the Askey–Wilson Bernstein bases, as well as degree elevation formulas for Askey–Wilson Bézier curves, are also given.Item A Practical Soft Type System for Scheme(1993-12-06) Cartwright, Robert; Wright, AndrewSoft type systems provide the benefits of static type checking for dynamically typed languages without rejecting untypable programs. A soft type checker infers types for variables and expressions and inserts explicit run-time checks to transform untypable programs to typable form. We describe a practical soft type system for R4RS Scheme. Our type checker uses a representation for types that is expressive, easy to interpret, and supports efficient type inference. Soft Scheme supports all of R4RS Scheme, including procedures of fixed and variable arity, assignment, continuations, and top-level definitions. Our implementation is available by anonymous FTP.Item A Related-Key Cryptanalysis of RC4(2000-06-08) Grosul, Alexander; Wallach, Dan S.In this paper we present analysis of the RC4 stream cipher and show that for each 2048-bit key there exists a family of related keys, differing in one of the byte positions. The keystreams generated by RC4 for a key and its related keys are substantially similar in the initial hundred bytes before diverging. RC4 is most commonly used with a 128-bit key repeated 16 times;this variant does not suffer from the weaknesses we describe. We recommend that applications of RC4 with keys longer than 128 bits (and particularly those using the full 2048-bit keys) discard the initial 256 bytes of the keystream output.Item A Resource Management Framework for Predictable Quality of Service in Web Servers(2003-07-07) Aron, Mohit; Druschel, Peter; Iyer, SitaramThis paper presents a resource management framework for providing predictable quality of service (QoS) in Web servers. The framework allows Web server and proxy operators to ensure a probabilistic minimal QoS level, expressed as an average request rate, for a certain class of requests (called a service), irrespective of the load imposed by other requests. A measurement-based admission control framework determines whether a service can be hosted on a given server or proxy, based on the measured statistics of the resource consumptions and the desired QoS levels of all the co-located services. In addition, we present a feedback-based resource scheduling framework that ensures that QoS levels are maintained among admitted, co-located services. Experimental results obtained with a prototype implementation of our framework on trace-based workloads show its effectiveness in providing desired QoS levels with high confidence, while achieving high average utilization of the hardware.