Network Architectures for Machine Learning Training Workloads in Cloud Data Centers
Main Article Content
Abstract
Networks in cloud data centers were designed around loosely coupled, request-response workloads generating predominantly north-to-south traffic, yet distributed machine learning training demands sustained, tightly synchronized east-to-west communication across large accelerator clusters. While prior work has addressed individual components of this problem — collective communication algorithms, transport protocols, and specific topology deployments — no unified framework has synthesized these threads into a structured architectural comparison spanning the full spectrum from general-purpose cloud to purpose-built ML training infrastructure. This paper fills that gap through a systematization of knowledge on the network architectural requirements of distributed ML training, synthesizing peer-reviewed literature from 2013 to 2024. Sources were selected through a keyword-based search of ACM Digital Library, IEEE Xplore, and USENIX proceedings, filtered by deployment scale and relevance to network performance. The analysis characterizes communication patterns across data-parallel, tensor-parallel, and pipeline-parallel training strategies; reviews leaf-spine, multi-rail, and optical circuit-switched topologies; and examines congestion dynamics arising from collective communication operations at scale. Evidence from the synthesized literature suggests that purpose-built ML training networks achieve substantially higher cluster efficiency than general-purpose cloud fabrics, and that elevated per-operation latency introduces measurable overhead accumulating across multi-day training runs. A production deployment scenario and practical design recommendations organized by cluster scale tier are also presented.