Improving Cloud Reliability at Scale Using Gen AI

Chetan Bansal

Microsoft

TOPIC: Data, Systems and Networking

@SCALE SERIES: Reliability @Scale

TYPE: video

YEAR: 2024

TAGS:

Building and operating reliable hyper-scale cloud services requires a significant amount of domain knowledge and human effort. Generative AI has been proven to be effective for specialized domains including software engineering tasks like code authoring. However, leveraging vanilla LLMs for specialized tasks like Incident management is not feasible due to the lack of domain knowledge and relevant context. In this talk, I will present our research and findings from designing and deploying a multi-tiered framework using LLMs for end-to-end diagnosis of production incidents across Microsoft. I will also present our framework, AIOpsLab, aimed at developing and evaluating agents for Cloud Ops for improving resiliency of cloud services in a principled manner.