Improving Cloud Reliability at Scale Using Gen AI

Building and operating reliable hyper-scale cloud services requires a significant amount of domain knowledge and human effort. Generative AI has been proven to be effective for specialized domains including software engineering tasks like code authoring. However, leveraging vanilla LLMs for specialized tasks like Incident management is not feasible due to the lack of domain knowledge and relevant context. In this talk, I will present our research and findings from designing and deploying a multi-tiered framework using LLMs for end-to-end diagnosis of production incidents across Microsoft. I will also present our framework, AIOpsLab, aimed at developing and evaluating agents for Cloud Ops for improving resiliency of cloud services in a principled manner.


To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy