Browsing by Author "Kyrillidis, Anastasios"
Now showing 1 - 13 of 13
Results Per Page
Sort Options
Item A General Method for Efficient Distributed Training and Federated Learning in Synchronous and Asynchronous Scenarios(2023-11-30) Dun, Chen; Kyrillidis, AnastasiosIn the past decades of development of machine learning systems, there is an eternal conflict: model performance versus model scale versus computation resources. The never ended desire to improve model performance significantly increases the size of machine learning model, the size of training dataset and the training time, while the available computation resources are generally limited due to limited memory size and computation power of computation devices and limited data usage (due to data storage or user privacy). In general, there are two main research attempts to solve such eternal conflict. The first attempt focuses on decreasing the needed computation resources. Accordingly, synchronous distributed training systems (such as data parallelism and model parallelism) and asynchronous distributed training system have been widely studied. Further, federated learning system has been researched to address the additional restriction of data usage due to data privacy or data storage. The second attempt to solve the eternal conflict instead focuses on improving model performance with limited model scale with Mixture of Expert (MoE) system. As we find there is hidden shared essence between these two directions, we aim to create a general methodology that can solve the problems met in both directions mentioned above. We propose a novel methodology that partitions, randomly or by a controlled method, the large neural network model into smaller subnetworks, each of which is distributed to local workers, trained independently and synchronized periodically. For the first direction, we demonstrate, with theoretical guarantee and empirical experiments, that such methodology can be applied in both synchronous and asynchronous systems, in different model architectures, and in both distributed training and federated learning, in most cases significantly reducing communication, memory and computation cost. For the second direction, we demonstrate that such methodology can significantly improve the model performance in MoE system without increasing model scale, by guiding the training of specialized experts. We also demonstrate our methodology can be applied to MoE systems on both traditional deep learning model and recent Large Language Model (LLM).Item Contrastive Learning in Deep Learning(2023-10-12) Chen, John; Kyrillidis, AnastasiosContrastive Learning is a popular method for training modern deep neural networks. In this thesis, we explore several methods in the supervised learning and semi-supervised learning setting. Firstly, we propose a technique called Negative Sampling in Semi-Supervised Learning (NS3L). NS3L exploits implicit negative evidence to improve the top-line performance of deep neural networks in semi-supervised learning. NS3L requires almost no additional computation and overhead and is shown to improve existing state-of-the-art methods. Secondly, we take the view of implicit contrastive learning and propose the data augmentation method StackMix. Following the “Mix” line of work, StackMix takes pairs of samples and concatenates the inputs while averaging the outputs. This way, the neural network needs to learn to differentiate between the two samples within the concatenated sample. Improved performance is demonstrated on a variety of settings. Lastly, we tackle the computational requirements of FixMatch, a semi-supervised learning method, and propose Fast FixMatch based on curriculum batch size. Curriculum batch size exploits natural training dynamics by starting with a small batch size and ending with a large batch size. Coupled with two other complementary methods that together perform better than a sum of parts, Fast FixMatch demonstrates substantial decreased training computations compared with FixMatch.Item CrysFormer: Protein structure determination via Patterson maps, deep learning, and partial structure attention(AIP Publishing LLC, 2024) Pan, Tom; Dun, Chen; Jin, Shikai; Miller, Mitchell D.; Kyrillidis, Anastasios; Phillips, George N., Jr.Determining the atomic-level structure of a protein has been a decades-long challenge. However, recent advances in transformers and related neural network architectures have enabled researchers to significantly improve solutions to this problem. These methods use large datasets of sequence information and corresponding known protein template structures, if available. Yet, such methods only focus on sequence information. Other available prior knowledge could also be utilized, such as constructs derived from x-ray crystallography experiments and the known structures of the most common conformations of amino acid residues, which we refer to as partial structures. To the best of our knowledge, we propose the first transformer-based model that directly utilizes experimental protein crystallographic data and partial structure information to calculate electron density maps of proteins. In particular, we use Patterson maps, which can be directly obtained from x-ray crystallography experimental data, thus bypassing the well-known crystallographic phase problem. We demonstrate that our method, CrysFormer, achieves precise predictions on two synthetic datasets of peptide fragments in crystalline forms, one with two residues per unit cell and the other with fifteen. These predictions can then be used to generate accurate atomic models using established crystallographic refinement programs.Item Current progress and open challenges for applying deep learning across the biosciences(Springer Nature, 2022) Sapoval, Nicolae; Aghazadeh, Amirali; Nute, Michael G.; Antunes, Dinler A.; Balaji, Advait; Baraniuk, Richard; Barberan, C.J.; Dannenfelser, Ruth; Dun, Chen; Edrisi, Mohammadamin; Elworth, R.A. Leo; Kille, Bryce; Kyrillidis, Anastasios; Nakhleh, Luay; Wolfe, Cameron R.; Yan, Zhi; Yao, Vicky; Treangen, Todd J.Deep Learning (DL) has recently enabled unprecedented advances in one of the grand challenges in computational biology: the half-century-old problem of protein structure prediction. In this paper we discuss recent advances, limitations, and future perspectives of DL on five broad areas: protein structure prediction, protein function prediction, genome engineering, systems biology and data integration, and phylogenetic inference. We discuss each application area and cover the main bottlenecks of DL approaches, such as training data, problem scope, and the ability to leverage existing DL architectures in new contexts. To conclude, we provide a summary of the subject-specific and general challenges for DL across the biosciences.Item Declarative Relational Machine Learning Systems(2023-02-22) Jankov, Dimitrije; Jermaine, Christopher; Kyrillidis, Anastasios; Uribe, CesarSeveral systems, most notably TensorFlow and PyTorch, have revolutionized how we practice machine learning (ML). They allow an ML practitioner to create complex models with great ease. In recent years there has been an explosion in the size of ML models, and it has become apparent that the systems we use today limit the data scientist to a few standard implementations like data parallelism (DP). In an ideal scenario, the ML practitioner would specify their model, and a system would take care of managing the specifics of the computations. My research explores how we can design and implement such systems. Specifically, it tries to find the right set of changes to a declarative relational system so that it can accommodate the needs of ML systems. The results of my research show that one can create scalable distributed machine learning systems that do not constrain the abilities of data scientists and enable greater productivity.Item Exploring Spatial Resolution in Image Processing(2021-04-30) Yu, Lantao; Orchard, Michael T.; Baraniuk, Richard G.; Pitkow, Xaq; Kyrillidis, Anastasios; Guleryuz, Onur G.Motivated by the human visual system’s instinct to explore details, image processing algorithms designed to facilitate the viewer’s interpretation of details in an image are ubiquitous. Such algorithms seek to extract the highest spatial frequency information that an original image has to offer, and to render that information clearly to the viewer in the form of an image with often an increased number of pixels. This thesis focuses on methods for extracting the highest possible spatial frequency information from digital imagery. Classical sampling theory provides a full understanding of the highest possible spatial frequency information that can be represented by sampled images that have been spatially band-limited to the Nyquist rate. However, natural digital images are rarely band-limited and often carry substantial energy (and information) at frequencies well beyond the Nyquist rate. My research investigates approaches for extracting information from this out-of-band (beyond the Nyquist frequency limit) energy and proposes algorithms to use that information to generate images with higher spatial resolution. This thesis pursues three approaches to extracting high spatial frequency information from digital imagery, based on frequency, spatial, and cross-channel perspectives to the problem. a) Coefficients representing out-of-band high-frequency contents are closely related to co-located coefficients representing in-band, low-frequency contents. The frequency perspective seeks to exploit those relationships to estimate both the uncorrupted out-of-band and in-band coefficients representing an image with higher spatial resolution; b) Spatial patches (blocks of pixels) of an image are known to be similar to other spatial patches elsewhere in the image. Thus, a patch with high-resolution details that has an insufficient number of samples to accurately represent its details could benefit from its similarity to other spatial patches. Although each individual patch may still be insufficiently sampled to retain its details, the ensemble of samples from the collection of similar patches provides a richer sampling pattern that I seek to exploit in the spatial perspective to the problem; c) In some imaging settings, multiple electro-magnetic channels of images are available from the same scene, with different imaging modalities offering different sensor information, each with its own spatial resolution. The cross-channel perspective seeks to exploit cross-channel proximity to produce high-resolution versions of multiple channels.Item Fast Quantum State Reconstruction via Accelerated Non-Convex Programming(MDPI, 2023) Kim, Junhyung Lyle; Kollias, George; Kalev, Amir; Wei, Ken X.; Kyrillidis, AnastasiosWe propose a new quantum state reconstruction method that combines ideas from compressed sensing, non-convex optimization, and acceleration methods. The algorithm, called Momentum-Inspired Factored Gradient Descent (MiFGD), extends the applicability of quantum tomography for larger systems. Despite being a non-convex method, MiFGD converges provably close to the true density matrix at an accelerated linear rate asymptotically in the absence of experimental and statistical noise, under common assumptions. With this manuscript, we present the method, prove its convergence property and provide the Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real (noisy) experiments, performed on the IBM’s quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than the state-of-the-art approaches, with similar or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite the presence of experimental and statistical noise in the tomographic data. Finally, we provide a ready-to-use code for state tomography of multi-qubit systems.Item LoFT: Finding Lottery Tickets through Filter-wise Training(2022-05-04) Wang, Qihan; Kyrillidis, AnastasiosRecent work on pruning techniques and the Lottery Ticket Hypothesis (LTH) shows that there exist “winning tickets” in large neural networks. These tickets represent versions of the full model that can be trained separately to achieve comparable accuracy with respect to the full models. However, in practice the process of finding these tickets can be a burdensome task, especially when the original neural network gets larger: Often one has to pretrain the large model for at least a number of epochs. In this paper, we explore how we can empirically identify when such winning tickets emerge, and use this heuristic to design efficient pretraining algorithms. Our focus in this work is on convolutional neural networks (CNNs): To identify good filters within winning tickets, we propose a novel filter distance metric that well-represents the model convergence, without the need to know the true winning ticket or training the model in full. Our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by this metric, we present the LOttery ticket through Filter-wise Training algorithm, dubbed as LoFT. LoFT is a model-parallel pretraining algorithm that partitions convolutional layers in CNNs by filters to train them independently on different distributed workers, leading to reduced memory and communication costs during pretraining. Experiments show that LoFT achieves non-trivial savings in communication, while maintaining comparable or even better accuracy compared to other model-parallel training methods.Item Perspectives on Algorithmic, Structural, and Pragmatic Acceleration Techniques in Machine Learning and Quantum Computing(2024-07-24) Kim, Junhyung; Kyrillidis, AnastasiosIn the modern era of big data and emerging quantum computing, the need to process information at an extreme scale has become inevitable. For instance, GPT-3, a popular large language model, was trained on a corpus of text with 45 terabytes of data with 175 billion model parameters \cite{brown2020language}. Similarly, in quantum computing, the number of free parameters that define quantum states and processes scale exponentially with the number of subsystems (i.e., qubits), often rendering naive methods, such as convex programming, inapplicable even with a moderate number of qubits \cite{ladd2010quantum}. In this two-part thesis, we explore various perspectives to achieve “acceleration” in such computationally challenging scenarios, both theoretically and empirically, with applications in quantum computing and machine learning. Specifically, we consider (i) algorithmic acceleration via the momentum technique (e.g., Polyak’s momentum or Nesterov’s accelerated method); (ii) model structural acceleration (i.e., by decreasing the degree of freedom to be inferred) via low-rank approximation; and (iii) acceleration for practitioners via principled hyperparameter recipes and distributed/federated protocols. Part one explores the problem of quantum state tomography (QST), which is formulated as a nonconvex optimization problem. QST is the canonical procedure to identify the nature of imperfections in implementing quantum processing units (QPUs) and, eventually, building a fault-tolerant quantum computer. The main computational bottleneck in QST is that the parameter space (i.e., optimization) and the number of measurements required (i.e., sample complexity) both exponentially increase. Two novel QST methods are proposed: (i) a centralized non-convex method that combines ideas from matrix factorization, compressive sensing, and Nesterov’s acceleration that can drastically decrease the optimization and sample complexities; and (ii) an extension of the aforementioned method to a distributed one that can utilize a set of classical local machines communicating with a central quantum server, which can be suitable for the noisy intermediate-scale quantum (NISQ) era. Part two explores achieving pragmatic acceleration in modern machine learning systems, which nowadays involve billions of parameters with increasingly convoluted objective functions, such as distributed objectives and game-theoretic formulations. As such, optimizing the model parameters has increasingly become a time- and computing-intensive task, where practitioners often rely on expensive grid searches with numerous rounds of retraining. Within part two, the first contribution studies the stability and acceleration of the stochastic proximal point method with momentum; the modified proximal operator provides incredible robustness with hyperparameter misspecification while enjoying accelerated convergence. The second contribution proposes an adaptive step size scheme for stochastic gradient descent, in the context of federated learning, based on the approximation of the local smoothness of the individual function that each client optimizes. Finally, the third contribution explores acceleration in smooth games, and identifies three cases of game Jacobian eigenvalue distribution where the momentum extragradient method exhibits accelerated convergence rates, along with the optimal hyperparameters for each scenario.Item Provable compressed sensing quantum state tomography via non-convex methods(Springer Nature, 2018) Kyrillidis, Anastasios; Kalev, Amir; Park, Dohyung; Bhojanapalli, Srinadh; Caramanis, Constantine; Sanghavi, SujayWith nowadays steadily growing quantum processors, it is required to develop new quantum tomography tools that are tailored for high-dimensional systems. In this work, we describe such a computational tool, based on recent ideas from non-convex optimization. The algorithm excels in the compressed sensing setting, where only a few data points are measured from a low-rank or highly-pure quantum state of a high-dimensional system. We show that the algorithm can practically be used in quantum tomography problems that are beyond the reach of convex solvers, and, moreover, is faster and more accurate than other state-of-the-art non-convex approaches. Crucially, we prove that, despite being a non-convex program, under mild conditions, the algorithm is guaranteed to converge to the global minimum of the quantum state tomography problem; thus, it constitutes a provable quantum state tomography protocol.Item Randomized Algorithms for Training Deep Models with Large Outputs(2022-03-01) Medini, Tharun; Shrivastava, Anshumali; Baraniuk, Richard; Kyrillidis, AnastasiosIn the last decade, it has been shown that many hard AI tasks, especially in NLP, can be naturally modeled as extreme classification problems leading to improved precision. However, such models are prohibitively expensive to train due to the memory blow-up in the last layer. As an example, we will delve into a real Amazon Search dataset, for which a simple fully connected neural network with a reasonable hidden layer can easily reach well beyond 100 billion parameters (> 400 GB memory). This memory requirement is too big to fit even on a very expensive NVIDIA DGX box equipped with 8 V100 GPUs, each with 32 GB RAM. To cater to problems of this scale, my work presents several principled solutions, building on a fundamental algorithm called Merged-Average Classifiers via Hashing (MACH). MACH is a generic K-classification algorithm where memory provably scales at O(log K) without any strong assumption on the classes. This thesis is divided into three main chapter. The first chapter is `Extreme Classification in Log Memory', in which we rethink the problem of Extreme Classification (or Extreme Multi-label Learning, XML) as a Sketching Problem. MACH is subtly a count-min sketch structure in disguise, which uses universal hashing to reduce classification with a large number of classes to few embarrassingly parallel and independent classification tasks with a small (constant) number of classes. MACH naturally provides a technique for zero communication model parallelism. When experimented with 6 datasets; some multiclass and some multilabel, MACH shows consistent improvement over respective state-of-the-art baselines. In particular, we train an end-to-end deep classifier on a private product search dataset sampled from Amazon Search Engine with 70 million queries and 49.46 million products. MACH outperforms, by a significant margin, the state-of-the-art extreme classification models deployed on commercial search engines: Parabel and DSSM (Deep Semantic Search Model). That largest model that we trained has 6.4 billion parameters and takes less than 35 hours on a single p3.16x machine. Our training times are 7-10 times faster, and our memory footprints are 2-4 times smaller than the best baselines. This training time is also significantly lower than the one reported by Google's mixture of experts (MoE) language model on a comparable model size and hardware. In the second chapter, we realize that MACH is effectively a variant of an embedding model, with a critical difference being that it trains high dimensional sparse embeddings (contrary to the usual low dimensional dense embedding models). Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this chapter, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we use MACH's partitioning approach algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This yields a novel asymmetric mixture of Sparse, Orthogonal, Learned And Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that SOLAR's way of one-sided learning is equivalent to learning both query and label embeddings. Thanks to these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public XML datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each task with up to 10 times faster training and 2 times faster inference. In the third and final chapter, we discuss the challenges that representation learning has brought into Information Retrieval (IR). Learning-to-Index (LTI) has emerged as a key technique to solve most IR problems. Since query time is the most critical aspect effecting the scalability of a search system, there is an inherent tension between accuracy, scalability and the ability to load-balance in distributed settings. We will discuss an algorithm called Iterative Repartitioning for Learning to Index (IRLI), in which, we will retain the best features of SOLAR (sparsity and load balance) while additionally having Locality Sensitive Hashing (LSH) property. IRLI iteratively refines partitions of items by learning the relevant buckets directly from the query-item relevance data. To ensure that the buckets are balanced, IRLI uses the power-of-k choices strategy. Due to its design, IRLI can be used for both extreme classification and near-neighbor retrieval. In practice IRLI surpasses the best baseline's precision for multi-label classification while being 5 times faster for inference.Item Theories and Perspectives on Practical Deep Learning(2023-07-05) Wolfe, Cameron Ronald; Kyrillidis, AnastasiosDeep neural networks (DNNs) have proven to be adept at accurately automating many tasks (e.g., image and text classification, object detection, text generation, and more). Across most domains, DNNs tend to achieve better performance with increasing scale, both in terms of dataset and model size. As such, the benefit of DNNs comes at a steep computational (and monetary) cost, which can limit their applicability. This document aims to identify novel and intuitive techniques that can make deep learning more usable across domains and communities.Item Towards Robust Planning for High-DoF Robots in Human Environments: The Role of Optimization(2024-08-09) Quintero Pena, Carlos; Kavraki, Lydia E; Kyrillidis, AnastasiosRobot motion planning has been a key component in the race to achieve true robot autonomy. It encompasses methods to generate robot motion that meets kinematic constraints, robot dynamics and that is safe (avoids colliding with the environment). It has been particularly successful in efficiently finding motions for high degree-of-freedom robots such as manipulators, but despite tremendous advances, motion planning methods are not ready for human environments. The uncertainty, diversity and clutter of the human world challenge the assumptions of motion planning methods breaking their guarantees, rendering them useless or dramatically worsening their performance. In this thesis, we propose methods to address three important challenges in augmenting motion planning and long-horizon manipulation for human environments. First, we present a framework that enables human-guided motion planning and demonstrate how it can be used for safe planning in partially-observed environments. Second, we present two methods for safe motion planning in the presence of sensing uncertainty, one that requires the poses of segmented objects and another one that acts directly on distance information from a noisy sensor. Finally, we present a framework that dramatically improves the performance of long-horizon manipulation tasks in the presence of clutter for an important class of manipulation problems. All of our contributions have mathematical optimization as a connecting thread to synthesize high-dimensional trajectories using low-dimensional information or as a layer between high-level and low-level planners. Our results demonstrate how these formulations can be effectively used to augment motion planning and planning for manipulation in novel ways, attaining more robust, efficient and reliable methods.