Browsing by Author "Hu, Xia"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item Association of education attainment, smoking status, and alcohol use disorder with dementia risk in older adults: a longitudinal observational study(Springer Nature, 2024) Tang, Huilin; Shaaban, C. Elizabeth; DeKosky, Steven T.; Smith, Glenn E.; Hu, Xia; Jaffee, Michael; Salloum, Ramzi G.; Bian, Jiang; Guo, JingchuanPrevious research on the risk of dementia associated with education attainment, smoking status, and alcohol use disorder (AUD) has yielded inconsistent results, indicating potential heterogeneous treatment effects (HTEs) of these factors on dementia risk. Thus, this study aimed to identify the important variables that may contribute to HTEs of these factors in older adults.Item Auto-GNN: Neural architecture search of graph neural networks(Frontiers Media S.A., 2022) Zhou, Kaixiong; Huang, Xiao; Song, Qingquan; Chen, Rui; Hu, Xia; DATA LabGraph neural networks (GNNs) have been widely used in various graph analysis tasks. As the graph characteristics vary significantly in real-world systems, given a specific scenario, the architecture parameters need to be tuned carefully to identify a suitable GNN. Neural architecture search (NAS) has shown its potential in discovering the effective architectures for the learning tasks in image and language modeling. However, the existing NAS algorithms cannot be applied efficiently to GNN search problem because of two facts. First, the large-step exploration in the traditional controller fails to learn the sensitive performance variations with slight architecture modifications in GNNs. Second, the search space is composed of heterogeneous GNNs, which prevents the direct adoption of parameter sharing among them to accelerate the search progress. To tackle the challenges, we propose an automated graph neural networks (AGNN) framework, which aims to find the optimal GNN architecture efficiently. Specifically, a reinforced conservative controller is designed to explore the architecture space with small steps. To accelerate the validation, a novel constrained parameter sharing strategy is presented to regularize the weight transferring among GNNs. It avoids training from scratch and saves the computation time. Experimental results on the benchmark datasets demonstrate that the architecture identified by AGNN achieves the best performance and search efficiency, comparing with existing human-invented models and the traditional search methods.Item Backdoor in AI: Algorithms, Attacks, and Defenses(2024-08-05) Tang, Ruixiang; Hu, XiaAs deep learning models (DNNs) become increasingly integral to critical domains such as healthcare, finance, and autonomous systems, ensuring their safety and reliability is of utmost importance. Among the various threats to these systems, backdoor attacks pose a particularly insidious challenge. These attacks compromise the model by embedding a hidden backdoor function, which can be triggered by specific inputs to manipulate the model's behavior. My research goal initially involves exploring the potential backdoor attack surface within the deep learning pipeline. Once we gain a more comprehensive understanding of the backdoor attack mechanism, we can then proceed to develop advanced defense algorithms. First, for exploring the new backdoor attack surface, we propose a training-free backdoor attack approach which is different from the traditional backdoor insertion method where backdoor behaviors are injected by training the model on a poisoned dataset. Specifically, the proposed attack embeds the backdoor into the target model by inserting a tiny malicious module, TrojanNet, into the target model. The infected model with the backdoor function can misclassify inputs into a target label when the inputs are stamped with preset triggers. The proposed TrojanNet has several new properties including (1) it is model-agnostic and could be injected into most DNNs, dramatically expanding its attack scenarios and (2) the training-free mechanism saves massive training efforts. Second, to defend against backdoor attacks, we proposed a honeypot defense method. Our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. To this end, we propose and integrate a honeypot module into the original DNNs, specifically designed to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in DNNs carry sufficient backdoor features while carrying minimal information about the original tasks. Consequently, we can impose penalties on the information acquired by the honeypot module to inhibit backdoor creation during the fine-tuning process of the stem network. Comprehensive experiments conducted on benchmark datasets substantiate the effectiveness and robustness of our defensive strategy. Third, we actively explore leveraging backdoors for socially beneficial applications. We demonstrate that backdoors can be used for watermarking valuable assets within the deep learning pipeline. We focused on using backdoors as watermarks to protect data, models, and APIs. To monitor the unauthorized use of datasets, we introduced a clean-label backdoor watermarking framework. Our findings indicate that incorporating just 1\% of watermarking samples is sufficient to embed a traceable backdoor function into unauthorized models. To counteract model theft or unauthorized redistribution, we introduced a novel product-key-based security layer for deep learning models. This mechanism restricts access to the model's functionalities until a verified key is entered.Item Counterfactuals for Interpretable Machine Learning: Model Reasoning from “What” to “How”(2023-05-23) Yang, Fan; Hu, XiaWith the extensive usage of machine learning (ML) in real-world applications, how to effectively explain the behaviors of ML models is becoming increasingly significant. A bunch of interpretation techniques have then been proposed, aiming to facilitate end-users for a better understanding towards the model working mechanism. Existing techniques for interpretable machine learning mainly focus on the feature attribution methods, where highly contributed features are exported as evidence for model predictions. However, those obtained feature contribution scores are not discriminative in nature, which makes them limited in reasoning decisions and understanding "how". Counterfactual Explanation, serving as one of the emerging types of ML interpretations, has raised the attention from both researchers and practitioners in recent years. Counterfactual explanation is essentially a series of hypothetical data samples, which is categorized under the example-based reasoning methodology and explored under "what-if" circumstances. The overall interpretation goal of counterfactuals is to indicate how the model decision alters with input perturbations. With valid counterfactual explanations, end-users can know how to flip the model decisions to a preferred outcome, so as to get a better sense of the decision boundaries. In this thesis, I will cover my previous research efforts on counterfactual explanations, and outline the introduction from three different perspectives. Firstly, for counterfactual derivation, I designed a framework to generate counterfactuals specifically for raw data instances with the proposed Attribute-Informed Perturbation. By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively. Instead of directly modifying instances in the data space, I iteratively optimized the constructed attribute-informed latent space, where features are more robust and semantic. Secondly, for counterfactual explainer deployment, I proposed a Model-based Counterfactual Synthesizer framework for efficient interpretation. I analyzed the model-based counterfactual process, and constructed a base synthesizer by adopting the conditional generative adversarial net structure. To better approximate the counterfactual universe for those minor queries, I employed the umbrella sampling technique to conduct the synthesizer training. I also enhanced the synthesizer by incorporating the causal dependence among attributes, and further validated its correctness through the causality identification approach. Thirdly, for counterfactual delivery to stake-holders, I proposed a novel framework to generate differentially private counterfactuals, where noises are injected for protection while maintaining the explanation roles. I trained an autoencoder with the functional mechanism to construct noisy class prototypes, and then derived the counterfactual explanation from the latent prototypes based on the post-processing immunity of differential privacy. Beyond general stake-holders, I also specifically proposed two explanation delivery frameworks for end-users and model developers. The further research goals are to focus on the sequential counterfactual which is more actionable for end-users, and the global counterfactual which is more insightful for model developers. At the end of thesis, I will list several promising directions to explore in the future.Item Efficient Methods for Deep Reinforcement Learning: Algorithms and Applications(2023-03-14) Zha, Daochen; Hu, XiaDeep reinforcement learning (deep RL) has recently achieved remarkable success in various domains, from simulated games to real-world applications. However, deep RL agents are notoriously sample-inefficient; they often need to collect a large number of samples from the environment to achieve a reasonable performance. This sample efficiency issue becomes more pronounced in sparse reward environments, where the rewards are zeros in most of the states so that the deep RL agents can barely learn. Unfortunately, collecting samples can be extremely expensive in many real-world applications; we may only be able to collect a very limited number of samples for training. The sample efficiency issue significantly hinders the applications of deep RL in the real world. To bridge this gap, this thesis makes several contributions to efficient deep RL. First, we propose a learning-based experience replay algorithm to improve the sample efficiency with better sample reuse. Second, we present an episode-level exploration strategy for efficient exploration in spare environments. Third, we investigate a real-world application of embedding table sharding and design an efficient training algorithm based on an estimated environment. Finally, we devise a more general framework by leveraging pre-trained models to improve efficiency and apply it to embedding table sharding. Putting all these together, our research could help build more efficient deep RL systems and facilitate their real-world deployment.Item Exploring the Relation between Contextual Social Determinants of Health and COVID-19 Occurrence and Hospitalization(MDPI, 2024) Chen, Aokun; Zhao, Yunpeng; Zheng, Yi; Hu, Hui; Hu, Xia; Fishe, Jennifer N.; Hogan, William R.; Shenkman, Elizabeth A.; Guo, Yi; Bian, JiangIt is prudent to take a unified approach to exploring how contextual social determinants of health (SDoH) relate to COVID-19 occurrence and outcomes. Poor geographically represented data and a small number of contextual SDoH examined in most previous research studies have left a knowledge gap in the relationships between contextual SDoH and COVID-19 outcomes. In this study, we linked 199 contextual SDoH factors covering 11 domains of social and built environments with electronic health records (EHRs) from a large clinical research network (CRN) in the National Patient-Centered Clinical Research Network (PCORnet) to explore the relation between contextual SDoH and COVID-19 occurrence and hospitalization. We identified 15,890 COVID-19 patients and 63,560 matched non-COVID-19 patients in Florida between January 2020 and May 2021. We adopted a two-phase multiple linear regression approach modified from that in the exposome-wide association (ExWAS) study. After removing the highly correlated SDoH variables, 86 contextual SDoH variables were included in the data analysis. Adjusting for race, ethnicity, and comorbidities, we found six contextual SDoH variables (i.e., hospital available beds and utilization, percent of vacant property, number of golf courses, and percent of minority) related to the occurrence of COVID-19, and three variables (i.e., farmers market, low access, and religion) related to the hospitalization of COVID-19. To our best knowledge, this is the first study to explore the relationship between contextual SDoH and COVID-19 occurrence and hospitalization using EHRs in a major PCORnet CRN. As an exploratory study, the causal effect of SDoH on COVID-19 outcomes will be evaluated in future studies.Item Lossy Computation For Large-Scale Machine Learning(2024-08-05) Liu, Zirui; Hu, XiaIn recent years, machine learning (ML), particularly deep learning, has made significant strides in areas like image recognition and language processing. It's been shown that more parameters and data can greatly boost ML model performance. However, the growth in model and data size is outpacing hardware capabilities, leading to a gap between ML needs and hardware development. My research is aimed at creating scalable ML algorithms and systems to meet current and future ML demands, exploring methods like randomized and low-precision computations to handle larger data and model sizes without changing hardware. First, for dealing with large datasets, such as in analyzing molecular structures or social networks where data is interconnected, Graph neural networks (GNNs) have recently emerged as one of the de-facto standard tools to analyze the graph data. Leveraging the message passing mechanism, GNNs learn the representation of each node by iteratively aggregating information from its neighbors to capture of graph structures and relationships. However, the key challenges in graph representation learning is the scalability issue as the real-word graphs may contain more than billions of nodes, resulting in significant memory and speed inefficiency when training GNNs on huge graphs. To address the challenges of memory and time inefficiency in large-scale graph learning, we introduce two lossy computation paradigms. First, we propose a memory-efficient framework for training GNNs with significantly compressed activations. Second, we present a time-efficient GNN training method with degree-based graph sparsification. Second, regarding the challenge of handling large models, as the model size grows, large language models (LLMs) have exhibited human-like conversation ability. This advancement opens the door to a wave of new applications, such as custom AI agents. To achieve this, two essential steps are involved: fine-tuning and serving. Fine-tuning is the process of adapting the LLM to a specific task, such as understanding and responding to domain-specific inquiries. The second step, serving, is about generating outputs to the questions in real-time. However, both of these two steps are hard and expensive due to the large model scale, limiting their accessibility to most of the users. Similarly, to improve efficiency in fine-tuning and serving LLMs, we also employ lossy computation approaches. Our first method enhances memory efficiency in LLM fine-tuning through the use of randomized matrix multiplication. Our second approach introduces a prompt tuning framework that optimizes the accuracy-efficiency trade-off for compressed LLMs. Lastly, we implement an extreme low-bit quantization technique for the KV Cache to further enhance performance.Item PME: pruning-based multi-size embedding for recommender systems(Frontiers Media S.A., 2023) Liu, Zirui; Song, Qingquan; Li, Li; Choi, Soo-Hyun; Chen, Rui; Hu, XiaEmbedding is widely used in recommendation models to learn feature representations. However, the traditional embedding technique that assigns a fixed size to all categorical features may be suboptimal due to the following reasons. In recommendation domain, the majority of categorical features' embeddings can be trained with less capacity without impacting model performance, thereby storing embeddings with equal length may incur unnecessary memory usage. Existing work that tries to allocate customized sizes for each feature usually either simply scales the embedding size with feature's popularity or formulates this size allocation problem as an architecture selection problem. Unfortunately, most of these methods either have large performance drop or incur significant extra time cost for searching proper embedding sizes. In this article, instead of formulating the size allocation problem as an architecture selection problem, we approach the problem from a pruning perspective and propose Pruning-based Multi-size Embedding (PME) framework. During the search phase, we prune the dimensions that have the least impact on model performance in the embedding to reduce its capacity. Then, we show that the customized size of each token can be obtained by transferring the capacity of its pruned embedding with significant less search cost. Experimental results validate that PME can efficiently find proper sizes and hence achieve strong performance while significantly reducing the number of parameters in the embedding layer.Item Randomized Algorithms for Mega-AI Models(2023-08-08) Xu, Zhaozhuo; Shrivastava, Anshumali; Baraniuk, Richard; Hu, XiaOver the past few years, we have witnessed remarkable accomplishments in machine learning (ML) models due to increases in their sizes. However, the growth in model size has outpaced upgrades to hardware and network bandwidth, resulting in difficulties in training these Mega-AI models within current system infrastructures. Additionally, the shift towards training ML models on user devices, in light of global data privacy protection trends, has constrained hardware resources, exacerbating the tension between effectiveness and efficiency. Moreover, there exists an accuracy-efficiency trade-off in current ML algorithms and systems, where reducing computation and memory usage results in accuracy losses during both training and inference. This thesis aims to demonstrate algorithmic advancements in improving this trade-off in training Mega-AI models. Rather than relying on big data, we propose a focus on good data and sparse models, which refer to models with many parameters but only activate a subset during training for efficiency. We also frame the pursuit of good data and activated parameters as an information retrieval problem and develop hashing algorithms and data structures to maintain training accuracy while improving efficiency. This thesis begins with work on data sparsity and presents a hash-based sampling algorithm for Mega-AI models that adaptively selects data samples during training. We also demonstrate how this approach improves the machine teaching algorithm with 425.12x speedups and 99.76\% energy savings on edge devices. We then discuss our recent success in model sparsity and present a provably efficient hashing algorithm that adaptively selects and updates a subset of parameters during training. We also introduce methods to bridge the accuracy decline of sparse Mega-AI models in the post-training process. Finally, we present DRAGONN, a system that utilizes hash algorithms to achieve near-optimal communication for sparse and distributed ML. To demonstrate the utility of these scalable and sustainable ML algorithms, we apply them to personalized education, seismic imaging, and bioinformatics. Specifically, we show how modifying the ML algorithm can reduce seismic processing time from 10 months to 10 minutes.Item Toward Data-centric Automated Machine Learning(2023-04-14) Lai, Henry; Hu, XiaMachine learning has become increasingly popular and has shown significant success in many fields. There are four main processes involved in developing a machine learning solution: data preparation, model selection, hyper-parameter tuning, and deployment for feedback collection. While automated machine learning (AutoML) has been proposed to streamline the middle two processes and deliver efficient solutions without requiring laborious trial-and-error efforts, the framework requires a well-prepared dataset and a perfectly defined setting, which may limit its capability toward more challenging real-world applications. Recent studies suggest that data preparation is often the key to optimal solutions in many challenging real-world applications. To bridge the gap between model selection and data preparation, we propose a complimentary AutoML framework that focuses on data-centric operations, which perform automated data preparations in different stages of a machine learning pipeline. Our framework includes a data-centric model customization framework to generate sample-specific learning strategies based on the attributes of individual data samples, a data-centric knowledge acquisition framework to effectively collect expert knowledge based on data distribution while considering its long-term effects on the model training procedure, and a model-aware data preparation framework that takes data distribution and attributes into consideration to further improve the datasets for challenging problem settings. Our goal is to develop an end-to-end data-centric AutoML system for real-world applications. To achieve this, we propose developing an end-to-end AutoML system for anomaly detection on time series data as a prototype to promote the proposed framework. With all these efforts, our research could further expand the capability of AutoML toward real-world applications.