Transparent MultiNIC routing for large AI Models

Yingjie Gu

Takshak Chahande

TOPIC: Systems and Networking

@SCALE SERIES: Networking @Scale

TYPE: video

YEAR: 2025

TAGS:

In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover latency.

We present a transparent multi-NIC routing solution that eliminates these bottlenecks for both egress and ingress traffic, improving NIC utilization for large-scale AI models.

SUBSCRIBE TO @SCALE

Go back