Browsing by Author "Jermaine, Christopher M"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Data Mining of Chinese Social Media(2014-10-31) Shu, Anhei; Wallach, Daniel S.; Jermaine, Christopher M; Bronk, ChrisWe present measurements and analysis of censorship on Weibo, a popular microblogging site in China. Since we were limited in the rate at which we could download posts, we identified users likely to participate in sensitive topics and recursively followed their social contacts, biasing our search toward a subset of Weibo where we hoped to be more likely to observe censorship. Our architecture enables us to detect post deletions within one minute of the deletion event, giving us a high-fidelity view of what is being deleted by the censors and when. We found that deletions happen most heavily in the first hour after a post has been submitted. Focusing on original posts, not reposts/retweets, we observed that nearly 30% of the total deletion events occur within 5-30 minutes. Nearly 90% of the deletions happen within the first 24 hours. Leveraging our data, we also consider a variety of hypotheses about the mechanisms used by Weibo for censorship, such as the extent to which they use retrospective keyword-based censorship, and how repost/retweet popularity interacts with censorship. By leveraging natural language processing techniques we also perform a topical analysis of the deleted posts, overcoming the usage of neologisms, named entities, and informal language that typifies Chinese social media. Using Independent Component Analysis, we find that the topics where mass removal happens the fastest are those that combine events that are hot topics in Weibo as a whole (e.g., the Beijing rainstorms or a sex scandal) with themes common to sensitive posts (e.g., Beijing, government, China, and policeman). Air pollution is a pressing concern for industrialized countries. Air quality measurements and their interpretations often take on political overtones. Similar concerns reflect the our understanding of what levels of measured pollution correspond to different levels of human nuisance, impairment, or injury. In this paper, we consider air pollution metrics from four large Chinese cities (U.S. embassy/consulate data, and Chinese domestic measurements) and compare them to a large volume of discussions on Weibo (a popular Chinese microblogging system). In the city with the worst PM2.5, Beijing, we found a strong correlation (R=0.82) between Chinese use of pollution-related terms and the ambient pollution. In other Chinese cities with lower pollution, the correlation was weaker. Nonetheless, our results show that social media may be a valuable proxy measurement for pollution, which may be quite valuable when traditional measurement stations are unavailable (or whose output is censored or misreported).Item Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data Processing Frameworks(2015-07-13) Liu, Zhaolei; Ng, T. S. Eugene; Cox, Alan L; Jermaine, Christopher MThe shift to the in-memory data processing paradigm has had a major influence on the development of cluster data processing frameworks. Numerous frameworks from the industry, open source community and academia are adopting the in-memory paradigm to achieve functionalities and performance breakthroughs. However, despite the advantages of these in-memory frameworks, in practice they are susceptible to memory-pressure related performance collapse and failures. The contributions of this thesis are two-fold. Firstly, we conduct a detailed diagnosis of the memory pressure problem and identify three preconditions for the performance collapse. These preconditions not only explain the problem but also shed light on the possible solution strategies. Secondly, we propose a novel programming abstraction called the leaky buffer that eliminates one of the preconditions, thereby addressing the underlying problem. We have implemented the leaky buffer abstraction in Spark for two distinct use cases. Experiments on a range of memory intensive aggregation operations show that the leaky buffer abstraction can drastically reduce the occurrence of memory-related failures, improve performance by up to 507% and reduce memory usage by up to 87.5%.Item Predicting Liver Segmentation Model Failure with Feature-Based Out-of-Distribution Detection and Generative Adversarial Networks(2024-08-07) Woodland, McKell; Patel, Ankit B; Jermaine, Christopher M; Brock, Kristy KAdvanced liver cancer is often treated with radiotherapy, which requires precise liver segmentation. Deep learning models excel at segmentation but struggle on unseen data, a problem exacerbated by the difficulty of amassing large datasets in medical imaging. Clinicians manually correct these errors, but as models improve, the risk of clinicians overlooking mistakes due to automation bias increases. To ensure quality care for all patients, this thesis aims to offer automated, scalable, and interpretable solutions for detecting liver segmentation model failures. My first approach prioritized performance and scalability. It applied the Mahalanobis distance (MD) to the features of four Swin UNETR and nnU-net liver segmentation models. I proposed reducing the dimensionality of these features with either principal component analysis (PCA) or uniform manifold approximation and projection (UMAP), resulting in improved performance and efficiency. Additionally, I proposed a k-th nearest neighbors distance (KNN) as a non-parametric alternative to the MD for medical imaging. KNN drastically improved scalability and performance on raw and average-pooled bottleneck features. My second approach emphasized interpretability by introducing generative modeling for the localization of novel information that a model will fail on. It employed a StyleGAN2 network to model a distribution of 3,234 abdominal computed tomography exams (CTs). It then localized metal artifacts and abnormal fluid buildup, two prevalent causes of liver segmentation model failure, in 55 CTs by reconstructing the scans with backpropagation on the StyleGAN’s input space and focusing on the regions with the highest reconstruction errors. The computational cost, data requirements, and training complexity of generative adversarial networks, along with a lack of reliable evaluation measures, have impeded their application to medical imaging. Accordingly, a significant portion of this thesis is dedicated to evaluating the applications of StyleGAN2 and the Fréchet Inception Distance (FID), a common measure of synthetic image quality, to medical imaging. The principal contributions of this thesis are integrating PCA and UMAP with MD, utilizing KNN for out-of-distribution detection in medical imaging, leveraging generative modeling to localize novel information at inference, providing a comprehensive application study of StyleGAN2 to medical imaging, and challenging prevailing assumptions about the FID in medical imaging.Item Query Processing and Optimization for Database Stochastic Analytics(2014-12-03) Perez, Luis Leopoldo; Jermaine, Christopher M; Ng, T.S. Eugene; Varman, Peter JThe application of relational database systems to analytical processing has been an active area of research for about two decades, motivated by constant surges in the scale of the data and in the complexity of the analysis tasks. Simultaneously, stochastic techniques have become commonplace in large-scale data analytics. This work is concerned with the application of relational database systems to support stochastic analytical tasks, particularly with the query evaluation and optimization phases. In this work, three problems are addressed in the context of MCDB/SimSQL, a relational database system for uncertain data management and analytics. The first contribution is a set of efficient techniques for evaluating queries that require satisfying a probability threshold, such as "Which pending orders are estimated to be processed and shipped by the end of the month, with a probability of at least 95%?" where the processing and shipment times of each order are generated by an arbitrary stochastic process. Results show that these techniques make sensible use of resources, weeding out data elements that require relatively few samples during the early stages of query evaluation. The second problem is concerned with recycling the materialized intermediate results of a query to optimize other queries in the future. Taking the assumption that a history of past queries provides an accurate picture of the workload, I describe techniques for query optimization that evaluate the costs and benefits of materializing intermediate results, with the objective of minimizing the hypothetical costs of future queries, subject to constraints on disk space. Results show a substantial improvement over conventional query caching techniques in workload and average query execution time. Finally, this work addresses the problem of evaluating queries for stochastic generative models, specified in a high level notation that treats random variables as first-class objects and allows operations with structured objects such as vectors and matrices. I describe a notation that, relying on the syntax of comprehensions, provides a language for denoting generative models that guarantees correspondence with relational algebra expressions, and techniques for translating a model into a database schema and set of relational queries.Item Synthesis of Patient Data to Predict Outcomes(2016-04-21) Myers, Risa B; Jermaine, Christopher MHealthcare data is increasingly collected and stored in electronic format, providing access to previously untapped information. At the same time, healthcare costs continue to escalate. Predicting important outcomes such as 30-day mortality or complications from healthcare data, including patient monitoring data and perioperative clinical information, may provide advance warning of issues or identification of non-ideal care. This has the potential to lead to improved outcomes or reduced cost. In this research I describe statistical machine learning models that predict outcomes from clinical data. In particular, I focus on data with a temporal component. First, I describe an autoregressive-ordinal regression model that reduces time series data to a small set of representative numbers, based on time spent in the states of a hidden Markov model. The AR-OR model is a generative model using Bayesian techniques. This model is used to mimic expert anesthesiologist assessment of surgical vitals signs. I correlate the quality labels with key 30-day outcomes and am able to demonstrate high correlation of poor surgical vital sign quality with increased post-operative complications. Next, I describe enhancements to the AR-OR model that enable it to predict short-term outcomes from short duration time series. These improvements adaptively weight values in the time series by currency and allow for concurrent, independent series. I validate these additions by predicting elevated intracranial pressure crises and periods of depressed brain tissue oxygen in traumatic brain injury patients. After this, I focus on perioperative clinical data and describe the Cumulative Perioperative Model. This model illustrates how including time dependent patient data, such as initial hospital location and post-operative surgical destination, improves the ability to predict 30-day mortality and identify patients at risk. I implement this model using Markov random fields, conditional random fields, and logistic regression. All of these models and approaches demonstrate the ability to predict key outcomes from temporal healthcare data with a high level of accuracy.