EVENT AGENDA
Event times below are displayed in PT.
Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions of people, providing reliable experiences for Systems at this scale present unique technical challenges.
Register today and check back for upcoming speaker and agenda announcements!
Reliability @Scale 2023 will be hosted virtually. Joining us are speakers from AWS, Google, Meta, Microsoft and Netflix. The event will feature talks themed around reactive and proactive reliability strategies for incident management, reliability monitoring, and building strong reliability cultures.
Event times below are displayed in PT.
To truly build resilient systems and products, reliability needs to be incorporated at each layer of the stack. This panel will discuss how to incorporate a “reliability mindset” across all layers of the stack, how this mindset shift has helped Meta grow and scale new products and AI, and how to build a strong culture of reliability within an organization.
Disaster Recovery (DR) is our program to prepare Meta’s infrastructure to handle capacity outages. In this talk, we present our story on how the DR program has evolved. From handling single region failures to new risks on the horizon, like power and multi-region outages. Also, DR strategies such as Power Storms and Site Degradation towards readiness and remediation will also be discussed.
Building reliable hyper-scale cloud services can be challenging. We need to quickly detect, analyze and mitigate incidents, which largely rely on human effort today. Recent breakthroughs in Large-Language Models (LLMs) have motivated us to explore their potential for automated incident diagnosis. By leveraging LLMs, we aim to accelerate the incident resolution process, leading to improved service reliability and better customer experience. For the first time, we have demonstrated the effectiveness of LLMs in improving cloud reliability. In this talk, we will share our findings, research innovations, and visions in this space.
Meta’s monitoring infrastructure is responsible for monitoring the health of thousands of systems deployed on millions of heterogeneous, geographically distributed hosts. Monitoring the health of Meta’s infrastructure is crucial to both our users and our business. And, monitoring is especially important during widespread failures. This talk explains the journey of hardening Meta’s monitoring systems to be among our most resilient infrastructure - available when most other systems are degraded (and when we most need monitoring). This session will touch on both engineering culture and the technical strategies (e.g., workload isolation, graceful degradation) and cultural/process strategies (e.g. meetings, tracking) that we leveraged to improve the resiliency of monitoring systems at Meta. Outline Overview of Monitoring @ Meta When Monitoring Fails: What we learned from the Facebook 2021 outage Building Culture: Resiliency as a core value Building Resilient Systems: Know thy enemy Iterative Improvement: Measuring resilience
Zero-downtime migrations have become an indispensable part of modern software engineering, enabling organizations to smoothly transition complex systems while ensuring uninterrupted operations. In this session, we will deep dive into the intricacies of zero-downtime migrations, delving into the "what, why, and how" behind these seamless transitions at scale.
Scribe is at the heart of data transport at Meta. From revenue critical data to important monitoring datasets flow through the system. This sets an extremely high reliability bar for the system. In this talk, we will take you through our journey of adding the 5th 9 to our SLAs. We’ll talk about how we detect incidents before our customers without overloading our oncalls and how we hardened the system against regional and dependency outages.
In the past 2 years, we have leveraged a data-driven strategy to improve the state of system reliability across Monetization. The first step was to quantify the impact of reliability work by defining a longitudinal metric that measures the impact of reliability failures on the business. For the Advertising business at Meta, the negative impact from SEVs can be characterized in terms of advertiser value lost (our systems cannot deliver optimal value for advertisers), which translates to short and long-term revenue lost for the company. Teams across the company then took clear data-driven goals to reduce the negative impact on advertisers of SEVs affecting their systems. These goals were based on data-mining of past SEVs (system outages) that helped identify hot spots actionable in terms of the four levers of reliability (prevention, detection, mitigation, culture)
Operational excellence is hard but it's impossible without a strong culture. One of Amazon's founding ideals is operational excellence; our store wouldn't work without it and AWS especially demands deep operational excellence. Learn about how Amazon's culture of obsessive, rigorous ownership and our unrelenting focus on building mechanisms & truth-seeking undergird everything we do at AWS -- including and especially handling events and resiliency
Building reliability engineering programs requires not only technological approaches, but also attention to the underlying culture of an organization and how it can help or hinder efforts to enhance reliability. This talk explores how and why cultural values, as part of a broader systems analysis, can help drive reliability. Specifically, this talk presents an anthropologist’s perspective on Meta’s journey from a mantra of “move fast and break things” to one where reliability is increasingly becoming an important value for our engineering community. The audience should take away actionable practices that can be used to evaluate their underlying reliability culture, take data-driven approaches to measuring reliability sentiment, and identify practices that align with existing cultural values and solidify actions to bring reliability to the fore.
In this presentation, we focus on the organization and the ethos that contribute to the reliability that the organization is capable of sustaining for its products. We describe stages of reliability maturity that an engineering organization can transverse; how to determine where in the continuum of maturity your organization currently falls and provide some ideas that can be used to improve the reliability maturity of the organization. Most importantly, we will provide some thoughts on how to determine the reliability maturity phase the organization requires and what attributes contribute to it. Not all teams need to achieve or maintain the same level of maturity, and knowing what best suits your organization is critically important.
We will add interesting case studies of some products at Google as they transverse this model to find a reliability equilibrium based on what fits by addressing organizational and culture norms.
Francois Richard is Engineering Director responsible for the Reliability Infra at Meta. Reliability Infra... read more
Syamla is currently the Senior Director of Production Engineering for Products at Meta which... read more
Surupa Biswas is the Vice President of Engineering responsible for Core Systems at Meta.... read more
Chris is part of the Infra Data Center organization where he helps define Meta’s... read more
Jason leads the Production Engineering teams globally at Meta as well as the Security... read more
Ahmed is a software engineer working on teams across Meta’s Infrastructure for the past... read more
Raghavendra (Raghu) D. Prabhu is a Software Engineer in the Reliability Infra org at... read more
Rujia Wang is a principal research PM at Microsoft, leading the research and product... read more
Adam Phillabaum is the Production Engineering Manager for Meta's Monitoring Products, ensuring our monitoring... read more
David Pariag is a software engineer focusing on Meta’s Monitoring Products. He’s spent the... read more
Kishore Banala is a Software Engineer at Netflix, and one of the core contributors... read more
Anoop Panicker is a Netflix engineer, specializing in the development of large scale distributed... read more
Mohamed has been a software engineer at Meta for the past 6 years. His... read more
Tiziano is a software engineer with 15 years work experience. He is a versatile... read more
Yuri is one of the lead engineers in the Data Infrastructure team at Meta... read more
Laine is the Production Engineering Director supporting Core Data, Meta's online data stack. She... read more
David Amsallem is a Data Science Manager at Meta. He has been working in... read more
Gursharan works on Ads Data Infrastructure at Meta, and also leads the Ads Reliability... read more
Peter M. O’Donnell is an AWS Principal Solutions Architect, specializing in security, risk, and... read more
Bouskill is a senior researcher at Meta where she works on Reliability Engineering. An... read more
A 10 year Google SRE, now serves as a Product Area SRE lead, has... read more
A senior Technical Program Manager, now serving as lead in Cloud Security, as a... read more
Marina is a Production Engineering director in Meta Enterprise Infra and Security group. With... read more