This talk discusses the diversity, volume and freshness of data required for GenAI, as well as the need to extract and prepare data differently based on its type, including interleaved data and multi-step trajectories for learning agentic behaviors. The talk also presents some of investments we have done to improve researcher productivity.