In large scale AI training models necessitates the transfer of terabits of data per second for various needs eg checkpointing, data ingestion, and hot sparing. However, current network configurations and a lack of application awareness regarding underlying hardware resources often result in suboptimal resource utilization, leading to delayed checkpoint flushes, increased GPU idle, and failover latency.
We present a transparent multi-NIC routing solution that eliminates these bottlenecks for both egress and ingress traffic, improving NIC utilization for large-scale AI models.