TOPIC: Data, Systems and Networking

Networking @Scale California 2019

SEPTEMBER 09, 2019 @ 8:30 AM PDT - 5:30 PM PDT
Designed for engineers that build and manage large-scale networks. Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.

Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges. This year’s conference will feature technical speakers from leading network operators with a focus on how we maintain high network availability and deal with outages.

Networking @Scale will be held at the Computer History Museum in Mountain View, California, on Monday, September 9th beginning at 9:30 AM. Be sure to also stick around for Happy Hour in the evening.

EVENT AGENDA

Event times below are displayed in PT.

September 9

08:30 AM - 09:30 AM
Registration and Breakfast
09:30 AM - 09:45 AM
Welcome Remarks
Speaker Omar Baldonado,Meta
09:45 AM - 10:15 AM
Secure Reliability: Tales from Mysterious Platforms

Infrastructure is more critical than ever; we push for standards/automation to build repeatable designs with precise handling of frequent events, maintaining the high availability of our systems. But what about issues that are not easily solved by just software? What about problems and functions that most system designers are not considering? How does security fit into the big picture?. These are some of the questions we want to start exploring in this presentation by examining how security is an element in the overall reliability of infrastructure we depend on every day.

For this, we will describe from our POV, at a high level, the concept of reliability design, what are usually some goals when architecting these systems and the tradeoffs made. To be able to talk about these subjects, we will cover topics around incident response and analysis with a few examples.

We will describe an exploration that we did in some of our optical platforms on crucial areas that could negatively affect the overall reliability of our design and service provided. To conclude, we will summarize takeaways and recommendations for the audience to take back to their respective organizations.

Speaker Jade Auer,Facebook
Speaker Jose Leitao,Facebook
10:15 AM - 10:45 AM
Failing Last and Least: Design Principles for Network Availability

The network is among the most critical components of any computing infrastructure. It is an enabler for modern distributed systems architecture, with a trend toward ever increasing functionality and offloads moving into the network. It must constantly be expanded and reconfigured as a prerequisite to deploying compute and storage infrastructure that are themselves growing exponentially. Most importantly however, the network must deliver the highest levels of reliability. A failure in the network often has an outsized blast radius and one of the leading causes of correlated and cascading failures, which only exacerbates the impact of a network failure.

In this talk, we draw upon our experience at Google building some of the world's largest networks to discuss: i) the importance of network reliability, ii) some of the leading causes of failure, and iii) the principles we draw upon to deliver the necessary levels of network reliability. We chart a course toward the potential for common community infrastructure that can fundamentally move the reliability needle.

Speaker Amin Vahdat,Google
10:45 AM - 11:15 AM
BGP++ Deployment and Outages

With Facebook's increasing user base (around 2.5B monthly active users!), our DC network is growing fast and FBOSS switches are getting provisioned daily. We are in urgent need of a scalable and reliable routing solution. Looking at existing open source BGP solutions, users have to deal with a lot of unused features, and complexity in managing these excess networking features. At a result, we decided to build our own routing agent - BGP++.

In this talk we’ll first discuss motivations and tradeoffs of building a new in-house bgp agent. Then we’ll describe how we build and scale BGP++ within Facebook DC, also what we did to catch up with FBOSS deployment during last year. Specifically, we’ll zoom in to look at challenges we faced during deployment, our reaction of putting a large amount of effort into testing and push tooling. Finally we’ll share our current push performance and key takeaways from all lessons learnt on our thorny road.

Speaker ,
11:15 AM - 11:45 AM
Preventing Network Changes from Becoming Network Outages

“The network never sleeps…” Complex network changes must be carried out without any impact to production traffic, which makes it critical for every change to be thoroughly tested before being implemented on the network. However, at cloud scale there is no way to have a separate full-scale network on which to test.

In this talk, I will describe how Microsoft built the Open Network Emulator (ONE) system to be bug-compatible with the software running on our production switches, and how we have integrated the emulator into our processes to validate and deploy all changes.

Speaker Dave Maltz,Microsoft
11:45 AM - 12:45 PM
Lunch
12:45 PM - 01:15 PM
What Have We Learned from Bootstrapping 1.1.1.1

What have we learned from bootstrapping 1.1.1.1. When we launched the public recursive DNS service in April 2018, we had no idea if or how useful would it be. Prior to launch, we hit the roadblocks with the IP being used internally, and the client-side support for encrypted DNS. After the launch day, we fought with scale and broken DNS infrastructure out there. This talk is going to cover both the problems we encountered bootstrapping the service, but also the evolution of the service architecture, and things we would have done differently knowing what we know after more than a year operating a public recursive DNS service.

Speaker Marek Vavrusa,Cloudflare
01:15 PM - 01:45 PM
Operating Facebook’s SD-Wan Network

Facebook’s centrally controlled wide-area network connects data centers serving a few billion users. The backbone design faced some challenges: growing demand, quality of service, provisioning, operation in scale, slow convergence. These challenges led to a holistic centrally controlled network solution called Express Backbone (EBB). This talk will focus on the operations challenges we faced on EBB, and how we met them.

We will describe high level designs and recent updates on EBB, and then dive into specific reliability challenges and how we achieved operational reliability with software sharding and replication to be in sync with the network design.

Furthermore, we will talk about how we solved some of the challenges specific to software defined network involving traditional network activities, e.g. network maintenance and outages. To conclude, we will summarize takeaways, realizations and how a multi-disciplinary team (network production engineers and software engineers) together operates facebook’s software defined network.

Speaker ,
Speaker Palak Mehta,Facebook
01:45 PM - 02:15 PM
Diversity Panel
02:15 PM - 02:45 PM
Getting a Taste of Your Network

Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.

Speaker Sergey Fedorov,Netflix
02:45 PM - 03:15 PM
Break
03:15 PM - 03:45 PM
Enforcing Encryption @Scale

At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management.

This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.

Speaker Mingtao Yang,Facebook
Speaker Ajanthan Asogamoorthy,Facebook
03:45 PM - 04:15 PM
Safe: How AWS Prevents and Recovers from Operational Events

At AWS nothing comes before our top priorities of security and operational excellence. With over 25 years of experience operating Amazon.com and AWS, we have refined procedures and techniques for reducing the incidence, duration, and severity of operational events. In this session we’ll share many of our lessons that apply before, during, and after events occur and we will dive into our ‘SAFE’ protocol for incident response, and our ‘Correction of Error’ process for rigorously analyzing events and ensuring that lessons are learned and painful problems are not repeated.

04:15 PM - 04:30 PM
Closing Remarks
Speaker Najam Ahmad,Facebook
04:30 PM - 05:30 PM
Happy Hour

SPEAKERS AND MODERATORS

Omar supports the teams developing, deploying, and operating Meta's global data center networks. This... read more

Omar Baldonado

Meta

Jade Auer

Facebook

Jose Leitao

Facebook

Amin Vahdat

Google

Dave Maltz

Microsoft

Marek Vavrusa

Cloudflare

Palak Mehta

Facebook

Sergey Fedorov

Netflix

Mingtao Yang

Facebook

Ajanthan Asogamoorthy

Facebook

Najam Ahmad

Facebook
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy