Browsing by Author "Hu, Xia (Ben)"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Deep Graph Representation Learning: Scalability and Applications(2023-08-10) Zhou, Kaixiong; Hu, Xia (Ben)The ubiquity of graphs in science and industry has motivated the developments of graph representation learning algorithms, where graph neural networks (GNNs) have emerged as one of the predominant computational tools. In general, GNNs apply a recursive message passing mechanism to learn the representation of each node by incorporating representations of itself and its neighbors. Despite the promising results of GNNs achieved in many fields, their scalability and application are still too limited to learn the complex and large-scale graph data. The scalability of GNNs is defined from two perspectives: model depth scalability and data processing scalability. First, the model depth of GNNs is often less than three layers, which prevents one from effectively modeling the high-order neighborhood dependencies. Second, GNNs are notorious to suffer from bottlenecks of memory space and computation time on the large graphs, which are characterized by the large amount of nodes and edges. Third, although many GNN prototypes have been proposed in the benchmark datasets, it is not straightforward to apply GNNs to a new application on hand with the specific domain knowledge. To address the above challenges, I have devoted to exploring a series of work to advances the optimization of deep GNNs, the efficient training on large graphs, and their well-performing applications. Part I aims at scaling up the model depth at graph neural architecture to learn the complex neighborhood structure. At the fundamental theory level, we analyze the over-smoothing issue within deep model, where the node representation vectors over the graph converge to similar embeddings. At the algorithm level, we develop a set of novel tricks including normalization, skip connection, and weight regularization to tackle the over-smoothing. At the benchmark level, we develop the first platform to comprehensively incorporate the existing tricks, fairly evaluate them, and propose a new model of deep GNNs with superior generalization performance across tens of benchmark datasets. At Part II, we present algorithms to enhance GNNs’ scalability in learning the large-scale graph datasets. A novel training paradigm of graph isolated training is proposed to decouple the large graph into many small clusters and train expert GNNs for each of them. By cutting down the inter-cluster communication between clusters, our solution significantly accelerates the training process while maintaining the node classification accuracy. We also analyze label bias issue at the small batch, which might lead to overfitting of GNNs. An adaptive label smoothing is then designed to address the label bias and improve the model’s generalization performance. At Part III, we further explore the wide applications of GNNs. Based on the transfer learning paradigm of “pre-train, prompt, fine-tune”, we design the first graph prompting function. The graph prompt reformulates the downstream task looking the same as the pretext one and transfers the pre-trained model easily to the downstream problem. At the area of bioinformatics, we extend GNNs to hierarchically learn the different abstract graph structures of graph molecules. At the area of tabular data mining, we use GNNs to explicitly learn the feature interactions between columns and make recommendation for each sample. Finally, I discuss the future work of graph machine learning.Item Parameter and Data Sparsity for Efficient Training of Large Neural Networks(2023-12-01) Daghaghi, Shabnam; Shrivastava, Anshumali; Baraniuk, Richard; Hu, Xia (Ben)The size of deep neural networks has tremendously increased due to the recent advancements in generative AI and developments of Large Language Models (LLMs). Simultaneously, the exponential growth in data volumes has compounded the challenges associated with training these expansive models. This thesis utilizes randomized algorithms to leverage the notion of dynamic sparsity, where only a small subset of parameters is crucial for each input, and mitigate some of the bottlenecks in the training of large deep neural networks. In the first chapter, we utilize parameter sparsity to address the training bottleneck in neural networks with a large output layer, which occurs in many applications such as recommendation systems and information retrieval. The calculation of full softmax is costly from the computational and energy perspective. There have been various sampling approaches to overcome this challenge, popularly known as negative sampling (NS). However, the existing NS approaches trade either efficiency or adaptivity. We propose two classes of distributions where the sampling scheme is truly adaptive and provably generates negative samples in near-constant time. Our proposal is based on Locality Sensitive Hashing (LSH) data structure and significantly outperforms the baselines in terms of accuracy and wall-clock time. In the second chapter, we propose an efficient data sampling scheme based on LSH data structure. Data sampling is an effective method to improve the training speed of large deep neural networks, and it originates from the fact that not all data points have an equal amount of importance. We propose a novel dynamic sampling distribution based on nonparametric kernel regression. To make this approach computationally feasible, we develop an efficient sketch-based approximation to the Nadaraya-Watson estimator and provide the exponential convergence guarantees. We demonstrate that our sampling algorithm outperforms the baseline in terms of wall-clock time and accuracy across multiple datasets. Finally, in the third chapter, we study the system design and implementation of networks with dynamic parameter sparsity on CPUs. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based feed-forward and back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this section, we argue that SLIDE’s current implementation is sub-optimal and does not exploit several features available in modern CPUs. In particular, we show that by utilizing opportunities such as memory optimization, quantization, and vectorization, we obtain up to 7x speedup in the computations on the same hardware.