EVENT AGENDA
Event times below are displayed in PT.
In the digital age, where systems operate at unprecedented scales, the importance of robust configuration management cannot be overstated. This year’s Reliability @Scale will focus on a central theme of "Move Safely", emphasizing the critical role of configuration and code safety in achieving and maintaining system reliability at scale.
This event will be hosted virtually on October 9th. Joining us will be speakers from Amazon, Bloomberg, Fauna, Honeycomb, Meta and Microsoft, who will share innovative strategies, best practices, and cutting-edge tools and processes designed to enhance configuration safety and reliability in large-scale systems.
Event times below are displayed in PT.
Join us for a fascinating fireside chat as we explore the evolution of Production Engineering and Reliability over the past decade. Our panelists, including the current and previous VPs of Production Engineering and the Head of Reliability Infrastructure, will share their insights on how the role of Production Engineering has adapted to meet the challenges of scale, complexity, and emerging technologies.
From the early days of ensuring reliability to the present focus on configuration safety and automated checks, our panelists will discuss the key lessons learned and how the role of PEs has helped drive the success Meta has seen over the past decade. They'll also delve into the importance of balancing speed and safety in deployment processes, and how this balance is crucial for maintaining reliability at scale.
As we look to the future, our panelists will share their thoughts on the emerging trends and technologies that will require new approaches to Reliability and Production Engineering, and how we can better collaborate across teams and organizations to improve Reliability.
Join us for an engaging and informative discussion on the evolution of Production Engineering and Reliability at Meta!
The end to end journey for every user request originating from our apps traverses hundreds of hops across Meta’s distributed architecture such as client-libraries, privacy and security frameworks, microservices, data stores and hardware. These dependencies are interconnected, making our infra a massive, tightly-coupled graph. We aim to minimize the impact and the duration for dependency outages by making dependency health a core part of reliability at Meta. Dependency safety will enable us to protect the business, improve user experience on our apps and enhance incident response.
Building and operating reliable hyper-scale cloud services requires a significant amount of domain knowledge and human effort. Generative AI has been proven to be effective for specialized domains including software engineering tasks like code authoring. However, leveraging vanilla LLMs for specialized tasks like Incident management is not feasible due to the lack of domain knowledge and relevant context. In this talk, I will present our research and findings from designing and deploying a multi-tiered framework using LLMs for end-to-end diagnosis of production incidents across Microsoft. I will also present our framework, AIOpsLab, aimed at developing and evaluating agents for Cloud Ops for improving resiliency of cloud services in a principled manner.
On a daily basis, engineers at hyper-scale companies build cutting-edge products that are quickly shipped to massive user bases. Does operating this way always require compromises in reliability? In this talk, Justin will discuss how lessons he learned from his experience working in aviation — where reliability is not optional — are applied at Meta to achieve high reliability along with innovation and short time to market.
After years of working and coaching teams to implement SLOs, it’s becoming incredibly clear to me that the greatest challenge that engineering and product teams face is finding the right SLIs. SLOs are hard to get right, and it generally takes time and multiple iterations to tweak, tune, and adjust them so they’re providing value to inform when we need to take action to defend the reliability of our systems. Collectively all our teams want to release at a time when it’s safe to do so, have awareness when it’s necessary to roll back, or when to restore data. However there is an underlying assumption that the SLI itself is/has been providing value.
As hard as SLOs are to get right, thinking of a good SLI is also difficult. This especially complicates things for engineering teams that don’t have a product person. As a result, they often struggle to identify what are key user / customer journeys. This talk will attempt to provide attendees with additional guidance to help them think more clearly about and create better SLIs.
Outages at Meta prevent billions of people around the world from communicating with each other. We are constantly striving to improve the reliability of our products and systems to ensure they are functioning as expected. We’ll dive into the critical role of deployment health checks in enhancing the reliability of thousands of systems. We’ll share strategies around keeping a high bar for change safety while minimizing noise and raising trust in the deployment process. Gain insights into our vision and ongoing efforts to bolster infrastructure reliability.
While Continuous Delivery (CD) has revolutionized application deployment, database schema changes often remain a manual, high-risk process. This talk explores how to extend CD practices to schema management in databases, reducing risk and accelerating delivery. Tyson Trautmann, VP of Engineering at Fauna, illustrates the challenges of traditional schema change processes and presents strategies for implementing Continuous Schema Delivery.
Attendees will learn about the unique challenges that data and associated schema present for CD, essential requirements for successful implementation, and practical techniques for integrating schema changes into CI/CD pipelines. The talk covers version control for schemas, zero-downtime migration techniques, and automated testing strategies. Tyson also demonstrates Fauna's approach to schema management, which includes progressive schema enforcement, schema as code capabilities, and zero-downtime migrations, supports implementing CD best practices as database schema evolves.
This talk provides valuable insights for engineers looking to reduce risk and increase delivery speed by bringing their database schema into their CD workflow. Learn how to pipeline all the things — including your schema changes — for a more robust and efficient development process.
Configuration changes can easily be catastrophic, with the potential to create broad, instantaneous system outages. We use datacenter-scale health metrics to validate configuration changes before they deploy to all of production. By adopting this validation step broadly at Meta, we have been able to prevent several major incidents. In addition, we use the signature generated by single-datacenter deployments to quickly root cause many other incidents. This talk will delve into the technique of region-scale health checks, the successes achieved, and our ongoing work in the space to prevent future incidents.
Charity, Christine, Ben, and Ian began using Scuba at Facebook in 2012 in order to diagnose complex problems with the multi-tenant systems of the Parse acquisition. The columnar, in-memory data store, despite being fronted by a user-hostile UI, was lightning-fast and completely unlike the traditional log analytics or metrics TSDB systems that they'd used before. Upon leaving in 2016, they created Honeycomb to enable teams at non-FAANG companies to benefit from the modern approach to analytics and observability they'd seen at Facebook. In this talk, you'll learn about how Honeycomb's columnar datastore, named Retriever, uses commodity blob storage and serverless functions to achieve the same kind of fast iteration speed, and is coupled with an intuitive user interface.
In our fast-paced network environment, velocity and safety are often at odds with each other. However, the devastating consequences of the access_denied SEV0 highlighted the need for a new approach to change management. In response, Netinfra has adopted a revolutionary Safe Deployment strategy that prioritizes both speed and reliability. Join us as we explore the evolution of change safety management for over 300+ network devices in Meta datacenters and delve into the methodologies and tools that underpin Safe Deployment. We'll discuss how automated testing, network simulation, and controlled rollouts come together to minimize the risk of network disruptions and downtime. We'll also examine the benefits and technical challenges of Netinfra's centralized safety service, which streamlines safety checks and ensures consistency across all network operations and changes. Take away valuable insights on balancing velocity and safety in your network environment and learn from Netinfra's experiences in creating a more reliable and secure network infrastructure.
Resilience is important, but it's not enough. Even the most robust systems may face failures and outages at some point. In this talk, Joe will explore the critical importance of building recoverable systems - ones that don't just withstand disruptions, but can be recovered quickly and predictably, even in the face of the most complex failures.
In this talk, we will explore the key components that make Meta's WWW release process sustainable, effective, and robust in the face of rapid growth. Meta's competitive advantage lies in its reliable and frequent releases, a process that has been continuous since 2017. This process has not only stood the test of time but has also evolved to meet the demands of an expanding organization.
Michael Chang works on Fault Tolerance within Meta Infrastructure. For the last four years,... read more
Francois Richard is Engineering Director responsible for the Reliability Infra at Meta. Reliability Infra... read more
Pedro Canahuati is the chief technology officer (CTO) of 1Password. Prior to 1Password, Pedro... read more
Peter Hoose is the head of Production Engineering at Meta. PE is a unique... read more
Ankita Vimal is a software engineer in the Reliability Infra org at Meta. She... read more
Antonio has been a production engineer at Meta for the past 5 and a... read more
Chetan Bansal is a Senior Principal Research Manager at Microsoft. He works on building... read more
Justin has spent the last 9 years working on distributed infrastructure systems at Meta.... read more
Sal Furino is a Customer Reliability Engineer at Bloomberg. During his career he’s worked... read more
Joe is currently a production engineer on Meta’s Reliability Engineering initiative. Over the last... read more
Christopher Hegre joined Meta in 2017 as a front end software engineer. Before joining... read more
Anton is a software engineer working on Monitoring & Observability at Meta. Currently, he... read more
Tyson Trautmann, Fauna's VP of Engineering, is a seasoned technology leader with a passion... read more
Zach is a Production Engineer at Meta who works on platform reliability and incident... read more
Joe Romano is a software engineer working on deployment products for code, config, and... read more
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE)... read more
Ben Hartshorne’s journey has been from racking servers and aggregating metrics in RRDs to... read more
Abhinav Sharma is a production engineer with over 5 years of experience at Meta,... read more
Francisco is currently a Production Engineer at Meta where He works with the team... read more
Joe is a Distinguished Engineer at AWS. He is a builder, who enjoys building... read more
Vladimirs is a passionate software engineer with broad experience in areas ranging from middleware... read more
Casey McGinty has been a Senior Software Engineer on the Release Engineering team at... read more