Event times below are displayed in PT.
Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.
Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges. This year’s conference will feature technical speakers from leading network operators with a focus on how we maintain high network availability and deal with outages.
Networking @Scale will be held at the Computer History Museum in Mountain View, California, on Monday, September 9th beginning at 9:30 AM. Be sure to also stick around for Happy Hour in the evening.
Event times below are displayed in PT.
Infrastructure is more critical than ever; we push for standards/automation to build repeatable designs with precise handling of frequent events, maintaining the high availability of our systems. But what about issues that are not easily solved by just software? What about problems and functions that most system designers are not considering? How does security fit into the big picture?. These are some of the questions we want to start exploring in this presentation by examining how security is an element in the overall reliability of infrastructure we depend on every day.
For this, we will describe from our POV, at a high level, the concept of reliability design, what are usually some goals when architecting these systems and the tradeoffs made. To be able to talk about these subjects, we will cover topics around incident response and analysis with a few examples.
We will describe an exploration that we did in some of our optical platforms on crucial areas that could negatively affect the overall reliability of our design and service provided. To conclude, we will summarize takeaways and recommendations for the audience to take back to their respective organizations.
The network is among the most critical components of any computing infrastructure. It is an enabler for modern distributed systems architecture, with a trend toward ever increasing functionality and offloads moving into the network. It must constantly be expanded and reconfigured as a prerequisite to deploying compute and storage infrastructure that are themselves growing exponentially. Most importantly however, the network must deliver the highest levels of reliability. A failure in the network often has an outsized blast radius and one of the leading causes of correlated and cascading failures, which only exacerbates the impact of a network failure.
In this talk, we draw upon our experience at Google building some of the world's largest networks to discuss: i) the importance of network reliability, ii) some of the leading causes of failure, and iii) the principles we draw upon to deliver the necessary levels of network reliability. We chart a course toward the potential for common community infrastructure that can fundamentally move the reliability needle.
With Facebook's increasing user base (around 2.5B monthly active users!), our DC network is growing fast and FBOSS switches are getting provisioned daily. We are in urgent need of a scalable and reliable routing solution. Looking at existing open source BGP solutions, users have to deal with a lot of unused features, and complexity in managing these excess networking features. At a result, we decided to build our own routing agent - BGP++.
In this talk we’ll first discuss motivations and tradeoffs of building a new in-house bgp agent. Then we’ll describe how we build and scale BGP++ within Facebook DC, also what we did to catch up with FBOSS deployment during last year. Specifically, we’ll zoom in to look at challenges we faced during deployment, our reaction of putting a large amount of effort into testing and push tooling. Finally we’ll share our current push performance and key takeaways from all lessons learnt on our thorny road.
“The network never sleeps…” Complex network changes must be carried out without any impact to production traffic, which makes it critical for every change to be thoroughly tested before being implemented on the network. However, at cloud scale there is no way to have a separate full-scale network on which to test.
In this talk, I will describe how Microsoft built the Open Network Emulator (ONE) system to be bug-compatible with the software running on our production switches, and how we have integrated the emulator into our processes to validate and deploy all changes.
What have we learned from bootstrapping 220.127.116.11. When we launched the public recursive DNS service in April 2018, we had no idea if or how useful would it be. Prior to launch, we hit the roadblocks with the IP being used internally, and the client-side support for encrypted DNS. After the launch day, we fought with scale and broken DNS infrastructure out there. This talk is going to cover both the problems we encountered bootstrapping the service, but also the evolution of the service architecture, and things we would have done differently knowing what we know after more than a year operating a public recursive DNS service.
Facebook’s centrally controlled wide-area network connects data centers serving a few billion users. The backbone design faced some challenges: growing demand, quality of service, provisioning, operation in scale, slow convergence. These challenges led to a holistic centrally controlled network solution called Express Backbone (EBB). This talk will focus on the operations challenges we faced on EBB, and how we met them.
We will describe high level designs and recent updates on EBB, and then dive into specific reliability challenges and how we achieved operational reliability with software sharding and replication to be in sync with the network design.
Furthermore, we will talk about how we solved some of the challenges specific to software defined network involving traditional network activities, e.g. network maintenance and outages. To conclude, we will summarize takeaways, realizations and how a multi-disciplinary team (network production engineers and software engineers) together operates facebook’s software defined network.
Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.
At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management.
This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.
At AWS nothing comes before our top priorities of security and operational excellence. With over 25 years of experience operating Amazon.com and AWS, we have refined procedures and techniques for reducing the incidence, duration, and severity of operational events. In this session we’ll share many of our lessons that apply before, during, and after events occur and we will dive into our ‘SAFE’ protocol for incident response, and our ‘Correction of Error’ process for rigorously analyzing events and ensuring that lessons are learned and painful problems are not repeated.
Omar supports the teams developing, deploying, and operating Meta's global data center networks. This... read more