Transparent MultiNIC routing for large AI Models

In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover latency.

We present a transparent multi-NIC routing solution that eliminates these bottlenecks for both egress and ingress traffic, improving NIC utilization for large-scale AI models.

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy