Computer Science

Permanent URI for this community

https://hdl.handle.net/1911/8313

Browse

Now showing 1 - 20 of 393

A best-match approach for gene set analyses in embedding spaces
(Cold Spring Harbor Laboratory Press, 2024) Li, Lechuan; Dannenfelser, Ruth; Cruz, Charlie; Yao, Vicky
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein–protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
A Characterization of Compound Documents on the Web
(1999-11-29) Lara, Eyal de; Wallach, Dan S.; Zwaenepoel, Willy
Recent developments in office productivity suites make it easier for users to publish rich {\em compound documents\/} on the Web. Compound documents appear as a single unit of information but may contain data generated by different applications, such as text, images, and spreadsheets. Given the popularity enjoyed by these office suites and the pervasiveness of the Web as a publication medium, we expect that in the near future these compound documents will become an increasing proportion of the Web's content. As a result, the content handled by servers, proxies, and browsers may change considerably from what is currently observed. Furthermore, these compound documents are currently treated as opaque byte streams, but future Web infrastructure may wish to understand their internal structure to provide higher-quality service. In order to guide the design of this future Web infrastructure, we characterize compound documents currently found on the Web. Previous studies of Web content either ignored these document types altogether or did not consider their internal structure. We study compound documents originated by the three most popular applications from the Microsoft Office suite: Word, Excel, and PowerPoint. Our study encompasses over 12,500 documents retrieved from 935different Web sites. Our main conclusions are: Compound documents are in general much larger than current HTML documents. For large documents, embedded objects and images make up a large part of the documents' size. For small documents, XML format produces much larger documents than OLE. For large documents, there is little difference. Compression considerably reduces the size of documents in both formats.
A Chromosome-length Assembly of the Black Petaltail (Tanypteryx hageni) Dragonfly
(Oxford University Press, 2023) Tolman, Ethan R; Beatty, Christopher D; Bush, Jonas; Kohli, Manpreet; Moreno, Carlos M; Ware, Jessica L; Weber, K Scott; Khan, Ruqayya; Maheshwari, Chirag; Weisz, David; Dudchenko, Olga; Aiden, Erez Lieberman; Frandsen, Paul B; Center for Theoretical Biological Physics
We present a chromosome-length genome assembly and annotation of the Black Petaltail dragonfly (Tanypteryx hageni). This habitat specialist diverged from its sister species over 70 million years ago, and separated from the most closely related Odonata with a reference genome 150 million years ago. Using PacBio HiFi reads and Hi-C data for scaffolding we produce one of the most high-quality Odonata genomes to date. A scaffold N50 of 206.6 Mb and a single copy BUSCO score of 96.2% indicate high contiguity and completeness.
A Comparison of Software Architectures for E-business Applications
(2002-02-20) Cecchet, Emmanuel; Chanda, Anupam; Elnikety, Sameh; Marguerite, Julie; Zwaenepoel, Willy
As dynamic content has become more prevalent on the Web, a number of standard mechanisms have evolved to generate such dynamic content. We study three specific mechanisms in common use: PHP, Java servlets, and Enterprise Java Beans (EJB). PHP and Java servlets require a direct encoding of the database queries in the application logic. EJB provides a level of indirection, allowing the application logic to call bean methods that then perform database queries. Unlike PHP, which typically executes on the same machine as the Web server, Java servlets and EJB allow the application logic to execute on different machines, including the machine on which the database executes or a completely separate (set of) machine(s). We present a comparison of the performance of these three systems in different configurations for two application benchmarks: an auction site and an online bookstore. We choose these two applications because they impose vastly different loads on the sub-systems: the auction site stresses the Web server front-end while the online bookstore stresses the database. We use open-source software in common use in all of our experiments (the Apache Web server, Tomcat servlet server, Jonas EJB server, and MySQL relational database). The computational demands ofJava servlets are modestly higher than those of PHP. The ability, however, of locating the servlets on a machine different from the Web server results in better performance for Java servlets than for PHP in the case that the application imposes a significant load on the front-end Web server. The computational demands of EJB are much higher than those of PHP and Java servlets. As with Java servlets, we can alleviate EJB's performance problems by putting them on a separate machine, but the resulting overall performance remains inferior to that of the other two systems.
A Deterministic Model for Parallel Program Performance Evaluation
(1998-12-03) Adve, Vikram S.; Vernon, Mary K.
Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on scalability, but have been less successful in practice for predicting detailed, quantitative information about program performance. We develop a conceptually simple model that provides detailed performance prediction for parallel programs with arbitrary task graphs, a wide variety of task scheduling policies, shared-memory communication, and significant resource contention. Unlike many previous models, our model assumes deterministic task execution times which permits detailed analysis of synchronization, task scheduling, the order of task execution as well as mean values of communication costs. The assumption of deterministic task times is supported by a recent study of the influence of non-deterministic delays in parallel programs. We show that the deterministic task graph model is accurate and efficient for five shared-memory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly non-uniform task times, and significant communication and resource contention. We also use three example programs to illustrate the predictive capabilities of the model. In two cases, broad insights and detailed metrics from the model are used to suggest improvements in load-balancing and the model quickly and accurately predicts the impact of these changes. In the third case, further novel metrics are used to obtain insight into the impact of program design changes that improve communication locality as well as load-balancing. Finally, we briefly present results of a comparison between our model and representative models based on stochastic task execution times.
A Graphical Multistage Calculus
(2005-07-22) Ellner, Stephan; Taha, Walid
While visual programming languages continue to gain popularity in domains ranging from scientific computing to real-time systems, the wealth of abstraction mechanisms, reasoning principles, and type systems developed over the last thirty years is currently available mainly for textual languages. With the goal of understanding how results in the textual languages can be mapped to the graphical setting, we develop the visual calculus PreVIEW. While this calculus visualizes computations in dataflow-style similar to languages like LabVIEW and Simulink, its formal model is based on Ariola and Blom's work on cyclic lambda calculi. We extend this model with staging constructs, establish a precise connection between textual and graphical program representations, and show how a reduction semantics for a multi-stage language can be lifted from the textual to the graphical setting.
A Hierarchical Region-Based Static Single Assignment Form
(2009-12-14) Sarkar, Vivek; Zhao, Jisheng
Modern compilation systems face the challenge of incrementally reanalyzing a program’s intermediate representation each time a code transformation is performed. Current approaches typically either re-analyze the entire program after an individual transformation or limit the analysis information that is available after a transformation. To address both efficiency and precision goals in an optimizing compiler, we introduce a hierarchical static single-assignment form called Region Static Single-Assignment (Region-SSA) form. Static single assignment (SSA) form is an efficient intermediate representation that is well suited for solving many data flow analysis and optimization problems. By partitioning the program into hierarchical regions, Region-SSA form maintains a local SSA form for each region. Region-SSA supports a demand-driven re-computation of SSA form after a transformation is performed, since only the updated region’s SSA form needs to be reconstructed along with a potential propagation of exposed defs and uses. In this paper, we introduce the Region-SSA data structure, and present algorithms for construction and incremental reconstruction of Region-SSA form. The Region-SSA data structure includes a tree based region hierarchy, a region based control flow graph, and region-based SSA forms. We have implemented in Region SSA form in the Habanero-Java (HJ) research compiler. Our experimental results show significant improvements in compile-time compared to traditional approaches that recomputed the entire procedure’s SSA form exhaustively after transformation. For loop unrolling transformations, compile-time speedups up to 35.8× were observed using Region-SSA form relative to standard SSA form. For loop interchange transformations, compile-time speedups up to 205.6× were observed. We believe that Region-SSA form is an attractive foundation for future compiler frameworks that need to incorporate sophisticated incremental program analyses and transformations.
A Linear Transform Scheme for Combining Weights into Scores
(1998-10-09) Sung, Sam
Ranking has been widely used in many applications. A ranking scheme usually employs a "scoring rule" that assigns a final numerical value to each and every object to be ranked. A scoring rule normally involves the use of one or many scores, and it gives more weight to the scores that is more important. In this paper, we give a scheme that can combine weights into scores in a natural way. We compare our scheme to the formula given by Fagin. We give additional desirable properties that weighted "scoring rule" are desirable to possess. Some interesting issues on weighted scoring rule are also discussed.
A MAC protocol for Multi Frequency Physical Layer
(2003-01-23) Kumar, Rajnish; PalChaudhuri, Santashil; Saha, Amit
Existing MAC protocols for wireless LAN systems assume that a particular node can operate on only one frequency and that most/all of the nodes operate on the same frequency. We propose a MAC protocol for use in an ad hoc network of mobile nodes using a wireless LAN system that defines multiple independent frequency channels. Each node can switch quickly from one channel to another but can operate on one channel at a time. We simulate the proposed protocol by modifying the wireless extension. Our simulations show that the proposed protocol, though simple, is capable of much better performance in the presence of multiple independent channels than IEEE 802.11which assumes a single frequency channel for all nodes. As expected, the proposed protocol works as well as IEEE 802.11 in the presence of a single channel.
A New Approach to Routing With Dynamic Metrics
(1998-11-18) Chen, Johnny; Druschel, Peter; Subramanian, Devika
We present a new routing algorithm to compute paths within a network using dynamic link metrics. Dynamic link metrics are cost metrics that depend on a link's dynamic characteristics, e.g., the congestion on the link. Our algorithm is destination-initiated: the destination initiates a global path computation to itself using dynamic link metrics. All other destinations that do not initiate this dynamic metric computation use paths that are calculated and maintained by a traditional routing algorithm using static link metrics. Analysis of Internet packet traces show that a high percentage of network traffic is destined for a small number of networks. Because our algorithm is destination-initiated, it achieves maximum performance at minimum cost when it only recomputes dynamic metric paths to these selected "hot" destination networks. This selective approach to route recomputation reduces many of the problems (principally route oscillations) associated with calculating all routes simultaneously. We compare the routing efficiency and end-to-end performance of our algorithm against those of traditional algorithms using dynamic link metrics. The results of our experiments show that our algorithm can provide higher network performance at a significantly lower routing cost under conditions that arise in real networks. The effectiveness of the algorithm stems from the independent, time-staggered recomputation of important paths using dynamic metrics, allowing for splits in congested traffic that cannot be made by traditional routing algorithms.
A Polynomial Blossom for the Askey–Wilson Operator
(Springer, 2018) Simeonov, Plamen; Goldman, Ron
We introduce a blossoming procedure for polynomials related to the Askey–Wilson operator. This new blossom is symmetric, multiaffine, and reduces to the complex representation of the polynomial on a certain diagonal. This Askey–Wilson blossom can be used to find the Askey–Wilson derivative of a polynomial of any order. We also introduce a corresponding Askey–Wilson Bernstein basis for which this new blossom provides the dual functionals. We derive a partition of unity property and a Marsden identity for this Askey–Wilson Bernstein basis, which turn out to be the terminating versions of Rogers’ 6ϕ5 summation formula and a very-well-poised 8ϕ7 summation formula. Recurrence and symmetry relations and differentiation and degree elevation formulas for the Askey–Wilson Bernstein bases, as well as degree elevation formulas for Askey–Wilson Bézier curves, are also given.
A Practical Soft Type System for Scheme
(1993-12-06) Cartwright, Robert; Wright, Andrew
Soft type systems provide the benefits of static type checking for dynamically typed languages without rejecting untypable programs. A soft type checker infers types for variables and expressions and inserts explicit run-time checks to transform untypable programs to typable form. We describe a practical soft type system for R4RS Scheme. Our type checker uses a representation for types that is expressive, easy to interpret, and supports efficient type inference. Soft Scheme supports all of R4RS Scheme, including procedures of fixed and variable arity, assignment, continuations, and top-level definitions. Our implementation is available by anonymous FTP.
A Related-Key Cryptanalysis of RC4
(2000-06-08) Grosul, Alexander; Wallach, Dan S.
In this paper we present analysis of the RC4 stream cipher and show that for each 2048-bit key there exists a family of related keys, differing in one of the byte positions. The keystreams generated by RC4 for a key and its related keys are substantially similar in the initial hundred bytes before diverging. RC4 is most commonly used with a 128-bit key repeated 16 times;this variant does not suffer from the weaknesses we describe. We recommend that applications of RC4 with keys longer than 128 bits (and particularly those using the full 2048-bit keys) discard the initial 256 bytes of the keystream output.
A Resource Management Framework for Predictable Quality of Service in Web Servers
(2003-07-07) Aron, Mohit; Druschel, Peter; Iyer, Sitaram
This paper presents a resource management framework for providing predictable quality of service (QoS) in Web servers. The framework allows Web server and proxy operators to ensure a probabilistic minimal QoS level, expressed as an average request rate, for a certain class of requests (called a service), irrespective of the load imposed by other requests. A measurement-based admission control framework determines whether a service can be hosted on a given server or proxy, based on the measured statistics of the resource consumptions and the desired QoS levels of all the co-located services. In addition, we present a feedback-based resource scheduling framework that ensures that QoS levels are maintained among admitted, co-located services. Experimental results obtained with a prototype implementation of our framework on trace-based workloads show its effectiveness in providing desired QoS levels with high confidence, while achieving high average utilization of the hardware.
A review of parameters and heuristics for guiding metabolic pathfinding
(Springer International Publishing, 2017-09-15) Kim, Sarah M.; Peña, Matthew I.; Moll, Mark; Bennett, George N.; Kavraki, Lydia E.
Abstract Recent developments in metabolic engineering have led to the successful biosynthesis of valuable products, such as the precursor of the antimalarial compound, artemisinin, and opioid precursor, thebaine. Synthesizing these traditionally plant-derived compounds in genetically modified yeast cells introduces the possibility of significantly reducing the total time and resources required for their production, and in turn, allows these valuable compounds to become cheaper and more readily available. Most biosynthesis pathways used in metabolic engineering applications have been discovered manually, requiring a tedious search of existing literature and metabolic databases. However, the recent rapid development of available metabolic information has enabled the development of automated approaches for identifying novel pathways. Computer-assisted pathfinding has the potential to save biochemists time in the initial discovery steps of metabolic engineering. In this paper, we review the parameters and heuristics used to guide the search in recent pathfinding algorithms. These parameters and heuristics capture information on the metabolic network structure, compound structures, reaction features, and organism-specificity of pathways. No one metabolic pathfinding algorithm or search parameter stands out as the best to use broadly for solving the pathfinding problem, as each method and parameter has its own strengths and shortcomings. As assisted pathfinding approaches continue to become more sophisticated, the development of better methods for visualizing pathway results and integrating these results into existing metabolic engineering practices is also important for encouraging wider use of these pathfinding methods.
A Sample-Driven Call Stack Profiler
(2004-07-15) Fowler, Rob; Froyd, Nathan; Mellor-Crummey, John
Call graph profiling reports measurements of resource utilization along with information about the calling context in which the resources were consumed. We present the design of a novel profiler that measures resource utilization and its associated calling context using a stack sampling technique. Our scheme has a novel combination of features and mechanisms. First, it requires no compiler support or instrumentation, either of source or binary code. Second, it works on heavily optimized code and on complex, multi-module applications. Third, it uses sampling rather than tracing to build a context tree, collect histogram data, and to characterize calling patterns. Fourth, the data structures and algorithms are efficient enough to construct the complete tree exposed in the sampling process. We describe an implementation for the Alpha/Tru64 platform and present experimental measurements that compare this implementation with the standard call graph profiler provided on Tru64, hi prof. We show results from a variety of programs in several languages indicating that our profiler operates with modest overhead. Our experiments show that the profiling overhead of our technique is nearly a factor of 55 lower than that of hi prof when profiling a call-intensive recursive program.
A Security Analysis of My.MP3.com and the Beam-it Protocol
(2000-03-08) Stubblefield, Adam; Wallach, Dan S.
My.MP3.com is a service that streams audio in the MP3 format to its users. In order to resolve copyright concerns, the service first requires that a user prove he or she owns the right to listen to a particular CD. The mechanism used for the verification is a program called Beam-it which reads a random subset of an audio CD and interacts with the My.MP3.com servers using a proprietary protocol. This paper presents a reverse-engineering of the protocol and the client-side code which implements it. An analysis of Beam-it's security implications and speculations as to the Beam-it server architecture are also presented. We found the protocol to provide strong protection against a user pretending to have a music CD without actually possessing it, however we found the protocol to be unnecessarily verbose and includes information that some users may prefer to keep private.
A Set of Convolution Identities Relating the Blocks of Two Dixon Resultant Matrices
(1999-06-16) Chionh, Eng-Wee; Goldman, Ronald; Zhang, Ming
Resultants for bivariate polynomials are often represented by the determinants of very big matrices. Properly grouping the entries of these matrices into blocks is a very effective tool for studying the properties of these resultants. Here we derive a set of convolution identities relating the blocks of two Dixon bivariate resultant representations.
A Simple and Effective Caching Scheme for Dynamic Content
(2000-11-28) Cox, Alan; Rajamani, Karthick
As web sites increasingly deliver dynamic content, the process of content generation at request time is becoming a severe limitation to web site throughput. Recent studies have shown that much of the dynamic content is, however, better characterized as pseudo-dynamic, i.e., a dynamic composition of stored or static data. Consequently, caching the generated web pages may increase the web server's throughput if there is some temporal locality in the request stream. In this paper, we perform a quantitative analysis of the benefits of caching for dynamic content using the e-commerce benchmark, TPC-W,as the workload. We implement caching through a simple and efficient Apache extension module, DCache, that can be easily incorporated into the current infrastructure for dynamic content delivery. Our DCache module uses conventional expiration times and our own request-initiated invalidation scheme as the methods for keeping the cache consistent. It also supports site-specific optimization by providing a mechanism to incorporate the priorities of specific web pages into the caching scheme. Our experiments show that we can obtain over 3 times the non-caching throughput with our caching approach.
A simple, fast dominance algorithm
(2006-01-11) Cooper, Keith D.; Harvey, Timothy J.; Kennedy, Ken
The problem of finding the dominators in a control-flow graph has a long history in the literature. The original algorithms suffered from a large asymptotic complexity but were easy to understand. Subsequent work improved the time bound, but generally sacrificed both simplicity and ease of implementation. This paper returns to a simple formulation of dominance as a global data-flow problem. Some insights into the nature of dominance lead to an implementation of an O(N2) algorithm that runs faster, in practice, than the classic Lengauer-Tarjan algorithm, which has a timebound of O(E ∗ log(N)). We compare the algorithm to Lengauer-Tarjan because it is the best known and most widely used of the fast algorithms for dominance. Working from the same implementation insights, we also rederive (from earlier work on control dependence by Ferrante, et al.) a method for calculating dominance frontiers that we show is faster than the original algorithm by Cytron, et al. The aim of this paper is not to present a new algorithm, but, rather, to make an argument based on empirical evidence that algorithms with discouraging asymptotic complexities can be faster in practice than those more commonly employed. We show that, in some cases, careful engineering of simple algorithms can overcome theoretical advantages, even when problems grow beyond realistic sizes. Further, we argue that the algorithms presented herein are intuitive and easily implemented, making them excellent teaching tools.

Browse

Browsing Computer Science by Title

Results Per Page

Sort Options