Towards Network-aware Sharing for Performance Interference Mitigation in Data Center Networks

Date
2024-12-06
Journal Title
Journal ISSN
Volume Title
Publisher
Embargo
Abstract

In today’s data centers, cloud network resources are shared among different applications, services, and tenants. The sharing of the network is subjected to increasingly stringent performance requirements for high throughput, ultra-low latency, and large-scale deployment. However, different traffic can interfere with each other and lead to unpredictable network performance. There are two significant aspects of network issues resulting from traffic interference. On the one hand, partial or entire networks can be blocked; on the other hand, the network bottleneck link can be unfairly shared. The susceptibility of a network to traffic interference and the corresponding network issues depend on several factors, such as the design of the network, the protocols in use, and how traffic shares the network resources.

Network blocking, often caused by deadlocks, can severely degrade performance. Lossless Ethernet deployments, which aim to eliminate packet loss by using flow control protocols to pause data transmission and preserve buffer space, are particularly prone to deadlocks. To address the problem seamlessly, this thesis presents ITSY, a data plane system designed to detect and resolve deadlocks efficiently using initial triggers. It can detect deadlocks instantaneously with minimal overhead and prevent the recurrence of the same deadlock. Furthermore, ensuring fair network sharing among different traffic in data centers is challenging. Different traffic contributes to network congestion in varying degrees. Unfortunately, the network cannot differentiate these contributions and customize rate control for different traffic. As a result, malicious or selfish traffic can monopolize bandwidth and interfere with others, causing well-behaved traffic to suffer. In this thesis, we envision dealing with network unfairness in a two-pronged approach: providing bandwidth isolation among different traffic and mitigating burst occurrence to reduce latency interference. Specifically, we design Augmented Queue (AQ), a scalable in-network abstraction that provides precise bandwidth guarantees at the application, transport, and link layers. In addition, we propose Sentinel, a proactive and agile management mechanism to mitigate the adverse effects of microbursts in multi-tenant networks, thereby improving the application-level latency.

Description
Degree
Doctor of Philosophy
Type
Thesis
Keywords
Network Sharing, Performance Interference Mitigation
Citation
Has part(s)
Forms part of
Published Version
Rights
Link to license
Citable link to this page