Meta is facing a challenging and exciting future as it expands beyond its current capabilities in the social space. Ensuring our platform is open to as many diverse cultures, languages and perspectives is a significant challenge that requires intensive large-scale AI models. The complexities of adding a virtual reality Metaverse further increases the challenge space, requiring much larger models with greater numbers of modalities and parameters.
Meta anticipated these challenges and has built a dedicated high-performance state-of-the art cluster to accelerate AI research. We present the architectural choices that went into building the cluster composed of 16K GPUs, high-performance storage and a non-blocking Infiniband network. We will discuss some of the lessons learned and how they have been applied to Meta in general.
Finally, we reflect on the impact the RSC has had on our research projects, and provide some insight into future directions.