Hardware-Software Co-Design for Optimizing MPI Programs in Data Center Network
High Performance Computing (HPC) systems are critical. A single server/processor cannot handle the heavy computation needs of today’s applications. HPC systems are built out of increasing numbers of processors to solve these computation-intensive problems. Communication between machines is essential. These applications may consist of thousands of processes spread across machines coordinating to solve a specific large-scale problem. The critical component of these systems is the network that connects the servers and makes this collaboration between servers possible. The performance of the network has a significant impact on the application performance. To better understand the main issues and improve the communication performance in this thesis, we investigate data center networks and provide a general overview and analysis of the literature covering various research areas, including data center network architectures, network protocols for data center networks, and state-of-the-art communication frameworks. We argue that many of the challenges faced by HPC applications in the communication phase can be addressed by augmenting the existing physical network architecture with low-cost optical technologies. However, we observe that integrating physical network/ hardware-based solutions alone would not be adoptable by HPC applications users. It requires some level of software-level application adaptations to the physical network before benefiting from the new characteristics of the network. Without a proper application to network interaction, the network cannot automatically adapt to the application’s needs and vice versa. Our goal is to explore co-designing hardware and software solutions that optimize the data center network for MPI-based HPC programs. We propose a static source code analysis solution to identify the different communication patterns and requirements of applications and design algorithms that find the optimal network placement of the tasks to reduce the number of cross-rack communications to the least possible. We implement a prototype of our solution that automates learning the application communication characteristics, application to network interaction, and network to application adaptation (reconfiguring the network). We evaluate our tool and demonstrate the high potential of hardware-software co-design for optimizing HPC programs in the data center network.
Rahbar, Afsaneh. "Hardware-Software Co-Design for Optimizing MPI Programs in Data Center Network." (2021) Diss., Rice University. https://hdl.handle.net/1911/111760.