R-3 Repository :: Browsing by Author "Jermaine, Christopher M."

Browsing by Author "Jermaine, Christopher M."

Now showing 1 - 15 of 15

Adaptive Similarity Measures for Material Identification in Hyperspectral Imagery
(2013-09-16) Bue, Brian; Merenyi, Erzsebet; Jermaine, Christopher M.; Subramanian, Devika; Wagstaff, Kiri
Remotely-sensed hyperspectral imagery has become one the most advanced tools for analyzing the processes that shape the Earth and other planets. Effective, rapid analysis of high-volume, high-dimensional hyperspectral image data sets demands efficient, automated techniques to identify signatures of known materials in such imagery. In this thesis, we develop a framework for automatic material identification in hyperspectral imagery using adaptive similarity measures. We frame the material identification problem as a multiclass similarity-based classification problem, where our goal is to predict material labels for unlabeled target spectra based upon their similarities to source spectra with known material labels. As differences in capture conditions affect the spectral representations of materials, we divide the material identification problem into intra-domain (i.e., source and target spectra captured under identical conditions) and inter-domain (i.e., source and target spectra captured under different conditions) settings. The first component of this thesis develops adaptive similarity measures for intra-domain settings that measure the relevance of spectral features to the given classification task using small amounts of labeled data. We propose a technique based on multiclass Linear Discriminant Analysis (LDA) that combines several distinct similarity measures into a single hybrid measure capturing the strengths of each of the individual measures. We also provide a comparative survey of techniques for low-rank Mahalanobis metric learning, and demonstrate that regularized LDA yields competitive results to the state-of-the-art, at substantially lower computational cost. The second component of this thesis shifts the focus to inter-domain settings, and proposes a multiclass domain adaptation framework that reconciles systematic differences between spectra captured under similar, but not identical, conditions. Our framework computes a similarity-based mapping that captures structured, relative relationships between classes shared between source and target domains, allowing us apply a classifier trained using labeled source spectra to classify target spectra. We demonstrate improved domain adaptation accuracy in comparison to recently-proposed multitask learning and manifold alignment techniques in several case studies involving state-of-the-art synthetic and real-world hyperspectral imagery.
computer-aided mechanism design
(2015-04-17) Fang, Ye; Chaudhuri, Swarat; Vardi, Moshi; Nakhleh, Luay K.; Jermaine, Christopher M.
Algorithmic mechanism design, as practised today, is a manual process; however, manual design and reasoning do not scale well with the complexity of design tasks. In this thesis, we study computer-aided mechanism design as an alternative to manual construction and analysis of mechanisms. In our approach, a mechanism is a program that receives inputs from agents with private preferences, and produces a public output. Rather than programming such a mechanism manually, the human designer writes a high-level partial specification that includes behavioral models of agents and a set of logical correctness requirements (for example, truth-telling) on the desired mechanism. A program synthesis algorithm is now used to automatically search a large space of candidate mechanisms and find one that satis es the requirements. The algorithm is based on a reduction to automated rst-order logic theorem proving | speci cally, deciding the satis ability of quanti er-free formulas in the rst-order theory of reals. We present an implementation of our synthesis approach on top of a Satis ability Modulo Theories solver. The system is evaluated through several case studies where we automatically synthesize a set of classic mechanisms and their variations, including the Vickrey auction, a multistage auction, a position auction, and a voting mechanism.
Control Plane Design and Performance Analysis for Optical Multicast-Capable Datacenter Networks
(2014-04-18) Xia, Yiting; Ng, T. S. Eugene; Jermaine, Christopher M.; Cox, Alan L.
This study presents a control plane design for an optical multicast-capable datacenter network and evaluates the system performance using simulations. The increasing number of datacenter applications with heavy one-to-many communications has raised the need for an efficient group data delivery solution. We propose a clean-slate architecture that uses optical multicast technology to enable ultra-fast, energy-efficient, low cost, and highly reliable group data delivery in the datacenter. Since the optical components are agnostic of existing communication protocols, I design novel control mechanisms to coordinate datacenter applications with the optical network. Applications send explicit requests for group data delivery through an API exposed by a centralized controller. Based on the collected traffic demands, the controller computes optical resource allocations using a proposed control algorithm to maximize utilization of the optical network. Finally, the controller changes the optical network topology according to the computation decision and sets forwarding rules to route traffic to the correct data paths. I evaluate the optimality and complexity of the control algorithm with real datacenter traffic. It achieves near optimal solutions in almost all experiment cases and can finish computation instantaneously on a large datacenter setting. I also develop a set of simulators to compare the performance of our system against a number of state-of-the-art group data delivery approaches, such as the non-blocking datacenter architecture, datacenter BitTorrent, datacenter IP multicast, etc. Extensive simulations using synthetic traffic show our solution can provide an order of magnitude performance improvement. Tradeoffs of our system are analyzed quantitatively as well.
Examining the Use of Homology Models in Predicting Kinase Binding Affinity
(2013-12-05) Chyan, Jeffrey; Kavraki, Lydia E.; Nakhleh, Luay K.; Jermaine, Christopher M.; Moll, Mark
Drug design is a difficult and multi-faceted problem that has led to extensive interdiscplinary work in the field of computational biology. In recent years, several computational methods have emerged. The overall goal of computational algorithms is to narrow down the number of leads that will be further considered for laboratory experimentation and clinical studies. Much of current drug design focuses on a family of proteins called kinases because they play a pivotal role in many of the cell signaling pathways in the human body. Drugs need to be designed such that they bind to specific kinases in the human kinome inhibiting kinase functions that can be causing various diseases such as cancer. It is important for drugs to have high specificity inhibiting only certain kinases avoiding undesirable effects on the human body. Computational prediction methods can accomplish this complex task by doing a comparative analysis on the binding site of kinases both in sequence and structure to predict binding affinity with potential drugs. However, computational methods depend on existing protein data to make predictions. There is a lack of structural protein data relative to known proteins and protein sequences. A potential solution to the the lack of information is to use computationally generated structural data called homology models. This thesis introduces a framework for the integration of homology models with CCORPS, a semi-supervised learning method that identifies structural features in proteins that correlate with protein function. We discuss the effects of using homology models to supplement existing experimental structural data for kinases to predict the binding affinity of kinases with various drugs in our experiments. While the work in this thesis focuses on predicting kinase binding affinity, the framework can be generalized showing the potential of using CCORPS with computationally generated data when there is a lack of experimental data.
Generative Language Models for Program Synthesis and Evaluation
(2024-12-06) Jiang, Mingchao; Jermaine, Christopher M.
Recent advances in Large Language Models (LLMs), such as GPT and Claude, have significantly advanced the field of program synthesis. To evaluate the performance of these models, traditional benchmarks like APPS, MBPP, and HumanEval reveal limitations due to potential data leakage and their inability to mirror the complexity of real-world programming. These benchmarks typically feature concise, stand-alone code samples that fail to assess the nuanced capabilities required for comprehensive coding tasks adequately. To address these limitations, this dissertation introduces a novel, private benchmark dataset - SimCoPilot, specifically crafted to simulate the ability of an AI such as a large language model (LLM) to perform as a “copilot”-style, interactive coding assistant. In SimCoPilot, an AI is asked to provide small amounts of code within an existing project, ranging in size from hundreds to thousands of lines. The benchmark tests an AI’s ability to write code in both completion (providing code to finish a method or a block) and infill scenarios (providing code to fill a blank in a method), covering various domains such as classic algorithms, databases, computer vision, and neural networks. Despite their varied architectures, most LLMs typically treat source code as mere string objects and require large-scale models and extensive training datasets. Unlike natural language, however, source code is a formal language imbued with rich syntactical and semantic structures. Addressing this disparity, this dissertation explored an innovative approach that explicitly extracts and integrates these syntactic and semantic elements into an encoder-decoder transformer model. Our detailed evaluation analyzes how LLMs manage different code dependencies and logic complexities, providing insights into their operational effectiveness in realistic programming environments. This examination provides profound insights into the capabilities of modern Language Models in navigating realistic programming challenges, thereby making a significant contribution to the understanding of their practical applicability in the software development environment.
Learning to Highlight Relevant Text in Binary Classified Documents
(2013-12-16) Kumar, Rahul; Jermaine, Christopher M.; Kavraki, Lydia E.; Nakhleh, Luay K.
Answering questions like “has this person ever been treated for breast cancer?” are critical for the success of tasks like clinical trial design, association analysis, documentation of mandatory discharge summary, etc. In this thesis, I argue that traditional machine learning approaches have had limited success addressing this problem and present a better approach to answering these questions. In order to address the above problem, I take a different approach which annotates key textual passages, which are then used in answering these questions. This approach is superior as it doesn’t involve going through the whole electronic medical record (EMR). This thesis is an attempt to understand how to model such annotations for an EMR. These annotations will help in answering questions which otherwise require reading the whole text. In this thesis I present efficient inference algorithm for existing “Word Label Regression” (WLR) model and extend it to extract more accurate key textual passages. The extended version of the algorithm explores one can use language features like punctuations to model annotations effectively.
Models and Methods for Evolutionary Histories Involving Hybridization and Incomplete Lineage Sorting
(2014-04-09) Yu, Yun; Nakhleh, Luay K.; Jermaine, Christopher M.; Kohn, Michael H.; Kavraki, Lydia E.
Hybridization plays an important evolutionary role in several groups of organisms. A phylogenetic approach to detecting hybridization entails sequencing multiple loci across the genomes of a group of species of interest, reconstructing their gene trees, and exploit- ing their differences as signal of hybridization. However, methods that follow this approach mostly ignore population effects, such as incomplete lineage sorting (ILS). Given that hybridization occurs between closely related organisms, ILS may very well be at play and, hence, must be accounted for in the analysis framework. Methods that account for both hybridization and ILS currently exist for only very limited cases. The contributions of my work are two-fold: • I devised the first parsimony criterion for the inference of phylogenetic networks (topologies alone) in the presence of ILS, along with new algorithms for the inference. • I devised the first likelihood criterion for the inference of phylogenetic networks (topologies, branch lengths, and inheritance probabilities) in the presence of ILS, along with new algorithms for the inference. I have implemented all the algorithms in our open-source, publicly available PhyloNet software package, and studied their performance in extensive simulation studies. Both the parsimony and likelihood approaches show very good performance in terms of identifying the location of hybridization events, as well as estimating the proportions of genes that underwent hybridization. Also, the parsimony approach shows good performance in terms of efficiency on handling large data sets in the experiments. Further, I analyzed two biological data sets (a data sets of yeast genomes and another of house mouse genomes) and found support for hybridization in both. My work will allow, for the first time, systematic phylogenomic analyses of data sets where hybridization is suspected. Thus, biologists will be able now to revisit existing analyses and conduct new ones with richer evolutionary models and inference methods. Further, the computational techniques presented here can be extended to other reticulate evolutionary events, such as horizontal gene transfer, which are believed to be ubiquitous in bacteria.
Ordinal Regression over Time Series Data, with an Application to Classifying Patient Vital Sign Quality
(2013-06-05) Myers, Risa B; Jermaine, Christopher M.; Kavraki, Lydia E.; Nakhleh, Luay K.
This research is concerned with modeling the vital sign data that are present in many electronic health records. Such vital sign data are typically represented as a (possibly multidimensional) time series. My goal is to label these time series with one o
Performance Analysis and Configuration Selection for Applications in the Cloud
(2015-05-29) Liu, Ruiqi; Ng, T. S. Eugene; Cox, Alan L.; Jermaine, Christopher M.
Cloud computing is becoming increasingly popular and widely used in both industries and academia. Making best use of cloud computing resources is critically important. Default resource configurations provided by cloud platforms are often not tailored for applications. Hardware heterogeneity in cloud computers such as Amazon EC2 leads to wide variation in performance, which provides an avenue for research in saving cost and improving performance by exploiting the heterogeneity. In this thesis, I conduct exhaustive measurement studies on Amazon EC2 cloud platforms. I characterize the heterogeneity of resources, and analyze the suitability of different resource configurations for various applications. Measurement results show significant performance diversity across resource configurations of different virtual machine sizes and with different processor types. Diversity in resource capacity is not the only reason for performance diversity; diagnostic measurements reveal that the influence from the cloud provider’s scheduling policy is also an important factor. Furthermore, I propose a nearest neighbor shortlisting algorithm that selects a configuration leading to superior performance for an application by matching the characteristics of the application with that of known benchmark programs. My experimental evaluations show that nearest neighbor greatly reduces the testing overhead since only the shortlisted top configurations rather than all configurations need to be tested; the method achieves high accuracy because the target application chooses the configuration for itself via testing. Even without any test, nearest neighbor is able to obtain a configuration with less than 5% performance loss for 80% applications.
Population Regulomics: Applying population genetics to the cis-regulome
(2014-02-24) Ruths, Troy; Nakhleh, Luay K.; Jermaine, Christopher M.; Kavraki, Lydia E.; Kohn, Michael H.
Population genetics provides a mathematical and computational framework for understanding and modeling evolutionary processes, and so it is vital for the investigation of biological systems. In its current state, molecular population genetics is exclusively focused on molecular sequences (DNA, RNA, or amino acid sequences), where all application-ready simulators and analytic measures work only on sequence data. Consequently, in the early 2000s, when technologies became available to sequence entire genomes, population genetic approaches were naturally applied to mine out signatures of selection and conservation, resulting in the subfi eld of population genomics. Nearly every present genome project applies population genomic techniques to identify functional information and genome structure. Recent technologies have ushered in a similar wave of genetic information, this time focusing on biological mechanisms operating above the genome, most notably on gene regulation (regulatory networks). In this work, I develop a molecular population genetics approach for gene regulation, called population regulomics, which includes simulators and analytic measurements that operate on populations of regulatory networks. I conducted extensive data analyses to connect the genome with the cis-regulome, developed computationally effi cient simulators, and adapted population genetic measurements on sequence to the regulatory network. By connecting genomic information to cis-regulation, we may apply the wealth of knowledge at the genome level to observed patterns at the regulatory level with unknown evolutionary origin. I demonstrate that by applying population regulomics to the E. coli cis-regulatory network, for the rst time we are able to quantify the evolutionary origins of topological patterns and reveal the surprising amount of neutral signal in the bacterial cis-regulome. Since regulatory networks play a central role in cellular functioning and, consequently, organismal fitness, this new sub-fi eld of population regulomics promises to shed the light of evolution on regulatory mechanisms and, more broadly, on the genetic mechanisms underlying the various phenotypes.
Reward Scheduling for QoS in Cloud Applications
(2012-09-05) Elnably, Ahmed; Varman, Peter J.; Cavallaro, Joseph R.; Jermaine, Christopher M.
The growing popularity of multi-tenant, cloud-based computing platforms is increasing interest in resource allocation models that permit flexible sharing of the underlying infrastructure. This thesis introduces a novel IO resource allocation model that better captures the requirements of paying tenants sharing a physical infrastructure. The model addresses a major concern regarding application performance stability when clients migrate from a dedicated to a shared platform. Specifically, while clients would like their applications to behave similarly in both situations, traditional models of fairness, like proportional share allocation, do not exhibit this behavior in the context of modern multi-tiered storage architectures. We also present a scheduling algorithm, the Reward Scheduler, that implements the new allocation policy, by rewarding clients with better runtime characteristics, resulting in benefits to both the clients and the service provider. Moreover, the Reward scheduler also supports weight-based capacity allocation subject to a minimum reservation and maximum limitation on the IO allocation for each task. Experimental results indicate that the proposed algorithm proportionally allocates the system capacity in proportion to their entitlements.
Shepherding Distributions for parallel Markov Chain Monte Carlo
(2017-04-07) Chowdhury, Arkabandhu; Jermaine, Christopher M.
One of the major concerns for Markov Chain Monte Carlo (MCMC) algorithms is that they can take a long time to converge to the desired stationary distribution. In practice, MCMC algorithms may take to millions of iterations to converge to the target distribution, requiring a wall clock time measured in months. This thesis presents a general algorithmic framework for running MCMC algorithms in a parallel/distributed environment, that can result in faster burn-in leading to convergence to the target distribution. Our framework, which we call the method of "shepherding distributions", relies on the introduction of an auxiliary distribution called a shepherding distribution (SD) that uses several MCMC chains running in parallel. These chains collectively explore the space of samples, communicating via the shepherding distribution, to reach high likelihood regions faster. We consider various scenarios where shepherding distributions can be used, including the case where several machines or CPU cores work on the same data in parallel (the so-called transition parallel application of the framework) and the case where a large data set itself can be partitioned across several machines or CPU cores and various chains work on subsets of the data (the so-called data parallel application of the framework). This latter application is particularly useful in solving "big data" Machine Learning problems. Experiments under both scenarios illustrate the effectiveness of our shepherding approach to MCMC parallelization.
Statistical Machine Learning for Text Mining with Markov Chain Monte Carlo Inference
(2014-04-25) Drummond, Anna; Jermaine, Christopher M.; Nakhleh, Luay K.; Chaudhuri, Swarat; Allen, Genevera
This work concentrates on mining textual data. In particular, I apply Statistical Machine Learning to document clustering, predictive modeling, and document classification tasks undertaken in three different application domains. I have designed novel statistical Bayesian models for each application domain, as well as derived Markov Chain Monte Carlo (MCMC) algorithms for the model inference. First, I investigate the usefulness of using topic models, such as the popular Latent Dirichlet Allocation (LDA) and its extensions, as a pre-processing feature selection step for unsupervised document clustering. Documents are clustered using the pro- portion of the various topics that are present in each document; the topic proportion vectors are then used as an input to an unsupervised clustering algorithm. I analyze two approaches to topic model design utilized in the pre-processing step: (1) A traditional topic model, such as LDA (2) A novel topic model integrating a discrete mixture to simultaneously learn the clustering structure and the topic model that is conducive to the learned structure. I propose two variants of the second approach, one of which is experimentally found to be the best option. Given that clustering is one of the most common data mining tasks, it seems like an obvious application for topic modeling. Second, I focus on automatically evaluating the quality of programming assignments produced by students in a Massive Open Online Course (MOOC), specifically an interactive game programming course, where automated test-based grading is not applicable due the the character of the assignments (i.e., interactive computer games). Automatically evaluating interactive computer games is not easy because such pro- grams lack any sort of well-defined logical specification, so it is difficult to devise a testing platform that can play a student-coded game to determine whether it is correct. I propose a stochastic model that given a set of user-defined metrics and graded example programs, can learn, without running the programs and without a grading rubric, to assign scores that are predictive of what a human (i.e., peer-grader) would give to ungraded assignments. The main goal of the third problem I consider is email/document classification. I concentrate on incorporating the information about senders/receivers/authors of a document to solve a supervised classification problem. I propose a novel vectorized representation for people associated with a document. People are placed in the latent space of a chosen dimensionality and have a set of weights specific to the roles they can play (e.g., in the email case, the categories would be TO, FROM, CC, and BCC). The latent space positions together with the weights are used to map a set of people to a vector by taking a weighted average. In particular, a multi-labeled email classification problem is considered, where an email can be relevant to all/some/none of the desired categories. I develop three stochastic models that can be used to learn to predict multiple labels, taking into account correlations.
Towards Accurate Reconstruction of Phylogenetic Networks
(2012-09-05) Park, HyunJung; Nakhleh, Luay K.; Kohn, Michael H.; Jermaine, Christopher M.
Since Darwin proposed that all species on the earth have evolved from a common ancestor, evolution has played an important role in understanding biology. While the evolutionary relationships/histories of genes are represented using trees, the genomic evolutionary history may not be adequately captured by a tree, as some evolutionary events, such as horizontal gene transfer (HGT), do not fit within the branches of a tree. In this case, phylogenetic networks are more appropriate for modeling evolutionary histories. In this dissertation, we present computational algorithms to reconstruct phylogenetic networks from different types of data. Under the assumption that species have single copies of genes, and HGT and speciation are the only events through the course of evolution, gene sequences can be sampled one copy per species for HGT detection. Given the alignments of the sequences, we propose systematic methods that estimate the significance of detected HGT events under maximum parsimony (MP) and maximum likelihood (ML). The estimated significance aims at addressing the issue of overestimation of both optimization criteria in the search for phylogenetic networks and helps the search identify networks with the ``right" number of HGT edges. We study their performance on both synthetic and biological data sets. While the studies show very promising results in identifying HGT edges, they also highlight the issues that are challenging for each criterion. We also develop algorithms that estimate the amount of HGT events and reconstruct phylogenetic networks by utilizing the pairwise Subtree-Prune-Regraft (SPR) operation from a collection of trees. The methods produce good results in general in terms of quickly estimating the minimum number of HGT events required to reconcile a set of trees. Further, we identify conditions under which the methods do not work well in order to help in the development of new methods in this area. Finally, we extend the assumption for the genetic evolutionary process and allow for duplication and loss. Under this assumption, we analyze gene family trees of proteobacterial strains using a parsimony-based approach to detect evolutionary events. Also we discuss the current issues of parsimony-based approaches in the biological data analysis and propose a way to retrieve significant estimates. The evolutionary history of species is complex with various evolutionary events. As HGT contributes largely to this complexity, accurately identifying HGT will help untangle evolutionary histories and solve important questions. As our algorithms identify significant HGT events in the data and reconstruct accurate phylogenetic networks from them, they can be used to address questions arising in large-scale biological data analyses.
Virtual Machine Live Migration in Cloud Computing
(2013-11-01) Zheng, Jie; Ng, T. S. Eugene; Cox, Alan L.; Jermaine, Christopher M.; Knightly, Edward W.; Sripanidkulchai, Kunwadee
Hybrid cloud computing, where private and public cloud resources are combined and applications can migrate freely, ushers in unprecedented flexibility for businesses. To unleash the benefits, commercial products already enable the live migration of full virtual machines between distant cloud datacenters. Unfortunately, two problems exist. First, no {\em live migration progress management system} exists, leading to (1) guesswork over how long a migration might take and the inability to schedule dependent tasks accordingly; (2) unacceptable application degradations -- application components could become split over distant cloud datacenters for an arbitrary period during migration; (3) inability to balance application performance and migration time -- e.g. to finish migration later for less performance interference. Second, multi-tier application architectures are widely employed in today's virtualized cloud computing environments. Although existing solutions are able to migrate a single VM efficiently, little attention has been devoted to migrating related VMs in multi-tier applications. Ignoring the relatedness of VMs during migration can lead to serious application performance degradation. In this thesis, we design the first migration progress management system called Pacer. Pacer's techniques are based on robust and lightweight run-time measurements of system and workload characteristics, efficient and accurate analytic models for progress predictions, and online adaptation to maintain user-defined migration objectives for coordinated and timely migrations. We formulates the multi-tier application migration problem, and presents a new communication-cost-driven coordinated approach, as well as a system called COMMA that realizes this approach. We experimentally show that using COMMA for the migration of a 3-tier application reduces the amount of inter-component communication impacted by migration by up to 475 times compared to naive parallel migration.

Browsing by Author "Jermaine, Christopher M."

Results Per Page

Sort Options