Network Architectures for Machine Learning Training Workloads in Cloud Data Centers

Vijaya Bhaskar Methuku

doi:10.52710/cfs.1039

PDF

Published: Apr 17, 2026

DOI: https://doi.org/10.52710/cfs.1039

Keywords:

Machine Learning, Distributed Training, Cloud Data Center, Network Topology, East–West Traffic, Gradient Synchronization, Congestion Control, Accelerator Clusters, RDMA, Collective Communication, In-Network Computing, Optical Interconnect

Vijaya Bhaskar Methuku

Abstract

Networks in cloud data centers were designed around loosely coupled, request-response workloads generating predominantly north-to-south traffic, yet distributed machine learning training demands sustained, tightly synchronized east-to-west communication across large accelerator clusters. While prior work has addressed individual components of this problem — collective communication algorithms, transport protocols, and specific topology deployments — no unified framework has synthesized these threads into a structured architectural comparison spanning the full spectrum from general-purpose cloud to purpose-built ML training infrastructure. This paper fills that gap through a systematization of knowledge on the network architectural requirements of distributed ML training, synthesizing peer-reviewed literature from 2013 to 2024. Sources were selected through a keyword-based search of ACM Digital Library, IEEE Xplore, and USENIX proceedings, filtered by deployment scale and relevance to network performance. The analysis characterizes communication patterns across data-parallel, tensor-parallel, and pipeline-parallel training strategies; reviews leaf-spine, multi-rail, and optical circuit-switched topologies; and examines congestion dynamics arising from collective communication operations at scale. Evidence from the synthesized literature suggests that purpose-built ML training networks achieve substantially higher cluster efficiency than general-purpose cloud fabrics, and that elevated per-operation latency introduces measurable overhead accumulating across multi-day training runs. A production deployment scenario and practical design recommendations organized by cluster scale tier are also presented.

Issue

Volume 2026, Issue 1

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details