Building reliable hyper-scale cloud services can be challenging. We need to quickly detect, analyze and mitigate incidents, which largely rely on human effort today. Recent breakthroughs in Large-Language Models (LLMs) have motivated us to explore their potential for automated incident diagnosis. By leveraging LLMs, we aim to accelerate the incident resolution process, leading to improved service reliability and better customer experience. For the first time, we have demonstrated the effectiveness of LLMs in improving cloud reliability. In this talk, we will share our findings, research innovations, and visions in this space.
- WATCH NOW
- 2024 EVENTS
- PAST EVENTS
- 2023
- 2022
- February
- RTC @Scale 2022
- March
- Systems @Scale Spring 2022
- April
- Product @Scale Spring 2022
- May
- Data @Scale Spring 2022
- June
- Systems @Scale Summer 2022
- Networking @Scale Summer 2022
- August
- Reliability @Scale Summer 2022
- September
- AI @Scale 2022
- November
- Networking @Scale Fall 2022
- Video @Scale Fall 2022
- December
- Systems @Scale Winter 2022
- 2021
- 2020
- 2019
- 2018
- 2017
- 2016
- 2015
- Blog & Video Archive
- Speaker Submissions