Networking @Scale California 2019

SEPTEMBER 09, 2019 @ 8:30 AM PDT - 5:30 PM PDT

Designed for engineers that build and manage large-scale networks. Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve.

RSVPS CLOSED

AGENDA SPEAKERS

ABOUT EVENT

Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.

Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges. This year’s conference will feature technical speakers from leading network operators with a focus on how we maintain high network availability and deal with outages.

Networking @Scale will be held at the Computer History Museum in Mountain View, California, on Monday, September 9th beginning at 9:30 AM. Be sure to also stick around for Happy Hour in the evening.

EVENT AGENDA

Event times below are displayed in PT.

September 9

08:30 AM - 09:30 AM

Registration and Breakfast

09:30 AM - 09:45 AM

Welcome Remarks

Speaker Omar Baldonado,Meta

09:45 AM - 10:15 AM

Secure Reliability: Tales from Mysterious Platforms

WATCH NOW

Infrastructure is more critical than ever; we push for standards/automation to build repeatable designs with precise handling of frequent events, maintaining the high availability of our systems. But what about issues that are not easily solved by just software? What about problems and functions that most system designers are not considering? How does security fit into the big picture?. These are some of the questions we want to start exploring in this presentation by examining how security is an element in the overall reliability of infrastructure we depend on every day.

For this, we will describe from our POV, at a high level, the concept of reliability design, what are usually some goals when architecting these systems and the tradeoffs made. To be able to talk about these subjects, we will cover topics around incident response and analysis with a few examples.

We will describe an exploration that we did in some of our optical platforms on crucial areas that could negatively affect the overall reliability of our design and service provided. To conclude, we will summarize takeaways and recommendations for the audience to take back to their respective organizations.

Speaker Jade Auer,Facebook

Speaker Jose Leitao,Meta

10:15 AM - 10:45 AM

Failing Last and Least: Design Principles for Network Availability

WATCH NOW

The network is among the most critical components of any computing infrastructure. It is an enabler for modern distributed systems architecture, with a trend toward ever increasing functionality and offloads moving into the network. It must constantly be expanded and reconfigured as a prerequisite to deploying compute and storage infrastructure that are themselves growing exponentially. Most importantly however, the network must deliver the highest levels of reliability. A failure in the network often has an outsized blast radius and one of the leading causes of correlated and cascading failures, which only exacerbates the impact of a network failure.

In this talk, we draw upon our experience at Google building some of the world's largest networks to discuss: i) the importance of network reliability, ii) some of the leading causes of failure, and iii) the principles we draw upon to deliver the necessary levels of network reliability. We chart a course toward the potential for common community infrastructure that can fundamentally move the reliability needle.

Speaker Amin Vahdat,Google

10:45 AM - 11:15 AM

BGP++ Deployment and Outages

WATCH NOW

With Facebook's increasing user base (around 2.5B monthly active users!), our DC network is growing fast and FBOSS switches are getting provisioned daily. We are in urgent need of a scalable and reliable routing solution. Looking at existing open source BGP solutions, users have to deal with a lot of unused features, and complexity in managing these excess networking features. At a result, we decided to build our own routing agent - BGP++.

In this talk we’ll first discuss motivations and tradeoffs of building a new in-house bgp agent. Then we’ll describe how we build and scale BGP++ within Facebook DC, also what we did to catch up with FBOSS deployment during last year. Specifically, we’ll zoom in to look at challenges we faced during deployment, our reaction of putting a large amount of effort into testing and push tooling. Finally we’ll share our current push performance and key takeaways from all lessons learnt on our thorny road.

Speaker ,

11:15 AM - 11:45 AM

Preventing Network Changes from Becoming Network Outages

WATCH NOW

“The network never sleeps…” Complex network changes must be carried out without any impact to production traffic, which makes it critical for every change to be thoroughly tested before being implemented on the network. However, at cloud scale there is no way to have a separate full-scale network on which to test.

In this talk, I will describe how Microsoft built the Open Network Emulator (ONE) system to be bug-compatible with the software running on our production switches, and how we have integrated the emulator into our processes to validate and deploy all changes.

Speaker Dave Maltz,Microsoft

11:45 AM - 12:45 PM

Lunch

12:45 PM - 01:15 PM

What Have We Learned from Bootstrapping 1.1.1.1

WATCH NOW

What have we learned from bootstrapping 1.1.1.1. When we launched the public recursive DNS service in April 2018, we had no idea if or how useful would it be. Prior to launch, we hit the roadblocks with the IP being used internally, and the client-side support for encrypted DNS. After the launch day, we fought with scale and broken DNS infrastructure out there. This talk is going to cover both the problems we encountered bootstrapping the service, but also the evolution of the service architecture, and things we would have done differently knowing what we know after more than a year operating a public recursive DNS service.

Speaker Marek Vavrusa,Cloudflare

01:15 PM - 01:45 PM

Operating Facebook’s SD-Wan Network

WATCH NOW

Facebook’s centrally controlled wide-area network connects data centers serving a few billion users. The backbone design faced some challenges: growing demand, quality of service, provisioning, operation in scale, slow convergence. These challenges led to a holistic centrally controlled network solution called Express Backbone (EBB). This talk will focus on the operations challenges we faced on EBB, and how we met them.

We will describe high level designs and recent updates on EBB, and then dive into specific reliability challenges and how we achieved operational reliability with software sharding and replication to be in sync with the network design.

Furthermore, we will talk about how we solved some of the challenges specific to software defined network involving traditional network activities, e.g. network maintenance and outages. To conclude, we will summarize takeaways, realizations and how a multi-disciplinary team (network production engineers and software engineers) together operates facebook’s software defined network.

Speaker ,

Speaker Palak Mehta,Facebook

01:45 PM - 02:15 PM

Diversity Panel

02:15 PM - 02:45 PM

Getting a Taste of Your Network

WATCH NOW

Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.

Speaker Sergey Fedorov,Netflix

02:45 PM - 03:15 PM

Break

03:15 PM - 03:45 PM

Enforcing Encryption @Scale

WATCH NOW

At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management.

This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.

Speaker Mingtao Yang,Facebook

Speaker Ajanthan Asogamoorthy,Facebook

03:45 PM - 04:15 PM

Safe: How AWS Prevents and Recovers from Operational Events

WATCH NOW

At AWS nothing comes before our top priorities of security and operational excellence. With over 25 years of experience operating Amazon.com and AWS, we have refined procedures and techniques for reducing the incidence, duration, and severity of operational events. In this session we’ll share many of our lessons that apply before, during, and after events occur and we will dive into our ‘SAFE’ protocol for incident response, and our ‘Correction of Error’ process for rigorously analyzing events and ensuring that lessons are learned and painful problems are not repeated.

04:15 PM - 04:30 PM

Closing Remarks

Speaker Najam Ahmad,Facebook

04:30 PM - 05:30 PM

Happy Hour

SPEAKERS AND MODERATORS

Omar Baldonado leads the groups that develop/operate Meta's global data center networks. These networks... read more

Omar Baldonado

Jade Auer

Facebook

Jose Leitao is a production network engineer in the Network organization at Meta. His... read more

Jose Leitao

Amin Vahdat

Google

Dave Maltz

Microsoft

Marek Vavrusa

Cloudflare

Palak Mehta

Facebook

Sergey Fedorov

Netflix

Mingtao Yang

Facebook

Ajanthan Asogamoorthy

Facebook

Najam Ahmad

Facebook

UPCOMING EVENT | Systems and Networking

Networking 2026

August 25, 2026 Santa Clara Convention Center, Santa Clara, CA In 2026, @Scale: Networking will continue to focus on the evolution of AI Networking. To address the growing complexity of network operations, we will examine...

UPCOMING EVENT | Mobile, Video and Web

Product 2026

October 28, 2026 Meta Campus, Menlo Park, CA @Scale: Product is an exciting evolution of the @Scale conference series, uniting the best of Product, RTC, Mobile, and Video under a single AI-native theme. We are...

PAST EVENT 06/17/2026 | Data, Machine Learning and AI

AI & Data 2026

June 17, 2026 Meta Campus, Menlo Park, CA Meta’s Engineering and Infrastructure teams are excited to bring together a global contingent of engineers who are interested in building, operating, and using AI and data systems...

PAST EVENT 06/25/2026 | Systems and Networking

Systems & Reliability 2026

June 25, 2026 Meydenbauer Center, Bellevue, Washington Building the advanced infrastructure necessary to power today's sophisticated AI models represents a monumental engineering challenge. This endeavor demands the creation of highly scalable, high-performance, and supremely reliable...