Networking @Scale

Computer History Museum 9:30am - 5:30pm

Event Completed

Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.

Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve. The Networking @Scale community focuses on bringing people together to discuss these challenges. This year’s conference will feature technical speakers from leading network operators with a focus on how we maintain high network availability and deal with outages.

Networking @Scale will be held at the Computer History Museum in Mountain View, California, on Monday, September 9th beginning at 9:30 AM. Be sure to also stick around for Happy Hour in the evening.

Read More Read Less

@Scale brings thousands of engineers together throughout the year to discuss complex engineering challenges and to work on the development of new solutions. We're committed to providing a safe and welcoming environment — one that encourages collaboration and sparks innovation.

Every @Scale event participant has the right to enjoy his or her experience without fear of harassment, discrimination, or condescension. The @Scale code of conduct outlines the behavior that we support and don't support at @Scale events and conferences. We expect participants to follow these rules at all @Scale event venues, online communities, and event-related social activities. These guidelines will keep the @Scale community a safe and enjoyable one for everyone.

Be welcoming. Everyone is welcome at @Scale events, inclusive of (but not limited to) gender, gender identity or expression, sexual orientation, body size, differing abilities, ethnicity, national origin, language, religion, political beliefs, socioeconomic status, age, color and neurodiversity. We have a zero-tolerance policy for discrimination.

Choose your words carefully. Treat one another with respect and in a professional manner. We're here to collaborate. Conflict is not part of the equation.

Know where the line is, and don't cross it. Harassment, threats, or intimidation of any kind will not be tolerated. This includes verbal, physical, sexual (such as sexualized imagery on clothing, presentations, in print, or onscreen), written, or any other form of aggression (whether outright, subtle, or micro). Behavior that is offensive, as determined by @Scale organizers, security staff, or conference management, will not be tolerated. Participants who are asked to stop a behavior or an action are expected to comply immediately or will be asked to leave.

Don't be afraid to call out bad behavior. If you're the target of harmful or offensive behavior, or if you witness someone else being harassed, threatened, or intimidated, don't look away. Tell an @Scale staff member, a security staff member, or a conference organizer immediately. Please notify our event staff, security staff, or conference organizers of any harmful or offensive behavior that you've experienced or witnessed in any form, whether in person or online.

We at @Scale want our events to be safe for everyone, and we have a zero-tolerance policy for violations of our code of conduct. @Scale conference organizers will investigate any allegation of problematic behavior, and we will respond accordingly. We reserve the right to take any follow-up actions we determine are needed. These include being warned, being refused admittance, being ejected from the conference with no refund, and being banned from future @Scale events.

Event Completed
Agenda
8:30am - 9:30am

Registration and Breakfast

9:30am - 9:45am

Welcome Remarks

9:45am - 10:15am

Secure Reliability: Tales from Mysterious Platforms

Infrastructure is more critical than ever; we push for standards/automation to build repeatable designs with precise handling of frequent events, maintaining the high availability of our systems. But what about issues that are not easily solved by just software? What about problems and functions that most system designers are not considering? How does security fit into the big picture?. These are some of the questions we want to start exploring in this presentation by examining how security is an element in the overall reliability of infrastructure we depend on every day. For this, we will describe from our POV, at a high level, the concept of reliability design, what are usually some goals when architecting these systems and the tradeoffs made. To be able to talk about these subjects, we will cover topics around incident response and analysis with a few examples. We will describe an exploration that we did in some of our optical platforms on crucial areas that could negatively affect the overall reliability of our design and service provided. To conclude, we will summarize takeaways and recommendations for the audience to take back to their respective organizations.
10:15am - 10:45am

Failing Last and Least: Design Principles for Network Availability

The network is among the most critical components of any computing infrastructure. It is an enabler for modern distributed systems architecture, with a trend toward ever increasing functionality and offloads moving into the network. It must constantly be expanded and reconfigured as a prerequisite to deploying compute and storage infrastructure that are themselves growing exponentially. Most importantly however, the network must deliver the highest levels of reliability. A failure in the network often has an outsized blast radius and one of the leading causes of correlated and cascading failures, which only exacerbates the impact of a network failure. In this talk, we draw upon our experience at Google building some of the world's largest networks to discuss: i) the importance of network reliability, ii) some of the leading causes of failure, and iii) the principles we draw upon to deliver the necessary levels of network reliability. We chart a course toward the potential for common community infrastructure that can fundamentally move the reliability needle.
10:45am - 11:15am

BGP++ Deployment and Outages

With Facebook's increasing user base (around 2.5B monthly active users!), our DC network is growing fast and FBOSS switches are getting provisioned daily. We are in urgent need of a scalable and reliable routing solution. Looking at existing open source BGP solutions, users have to deal with a lot of unused features, and complexity in managing these excess networking features. At a result, we decided to build our own routing agent - BGP++. In this talk we’ll first discuss motivations and tradeoffs of building a new in-house bgp agent. Then we’ll describe how we build and scale BGP++ within Facebook DC, also what we did to catch up with FBOSS deployment during last year. Specifically, we’ll zoom in to look at challenges we faced during deployment, our reaction of putting a large amount of effort into testing and push tooling. Finally we’ll share our current push performance and key takeaways from all lessons learnt on our thorny road.
11:15am - 11:45am

Preventing Network Changes from Becoming Network Outages

“The network never sleeps…” Complex network changes must be carried out without any impact to production traffic, which makes it critical for every change to be thoroughly tested before being implemented on the network. However, at cloud scale there is no way to have a separate full-scale network on which to test. In this talk, I will describe how Microsoft built the Open Network Emulator (ONE) system to be bug-compatible with the software running on our production switches, and how we have integrated the emulator into our processes to validate and deploy all changes.
11:45am - 12:45pm

Lunch

12:45pm - 1:15pm

What Have We Learned from Bootstrapping 1.1.1.1

What have we learned from bootstrapping 1.1.1.1. When we launched the public recursive DNS service in April 2018, we had no idea if or how useful would it be. Prior to launch, we hit the roadblocks with the IP being used internally, and the client-side support for encrypted DNS. After the launch day, we fought with scale and broken DNS infrastructure out there. This talk is going to cover both the problems we encountered bootstrapping the service, but also the evolution of the service architecture, and things we would have done differently knowing what we know after more than a year operating a public recursive DNS service.
1:15pm - 1:45pm

Operating Facebook’s SD-Wan Network

Facebook’s centrally controlled wide-area network connects data centers serving a few billion users. The backbone design faced some challenges: growing demand, quality of service, provisioning, operation in scale, slow convergence. These challenges led to a holistic centrally controlled network solution called Express Backbone (EBB). This talk will focus on the operations challenges we faced on EBB, and how we met them. We will describe high level designs and recent updates on EBB, and then dive into specific reliability challenges and how we achieved operational reliability with software sharding and replication to be in sync with the network design. Furthermore, we will talk about how we solved some of the challenges specific to software defined network involving traditional network activities, e.g. network maintenance and outages. To conclude, we will summarize takeaways, realizations and how a multi-disciplinary team (network production engineers and software engineers) together operates facebook’s software defined network.
1:45pm - 2:15pm

Diversity Panel

2:15pm - 2:45pm

Getting a Taste of Your Network

Sergey Fedorov, Senior Software Engineer at Netflix, describes a client-side network measurement system called "Probnik", and how it can be used to improve performance, reliability and control of client-server network interactions.
2:45pm - 3:15pm

Break

3:15pm - 3:45pm

Enforcing Encryption @Scale

At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.
3:45pm - 4:15pm

Safe: How AWS Prevents and Recovers from Operational Events

At AWS nothing comes before our top priorities of security and operational excellence. With over 25 years of experience operating Amazon.com and AWS, we have refined procedures and techniques for reducing the incidence, duration, and severity of operational events. In this session we’ll share many of our lessons that apply before, during, and after events occur and we will dive into our ‘SAFE’ protocol for incident response, and our ‘Correction of Error’ process for rigorously analyzing events and ensuring that lessons are learned and painful problems are not repeated.
4:15pm - 4:30pm

Closing Remarks

4:30pm - 5:30pm

Happy Hour

Join the @Scale Mailing List and Get the Latest News & Event Info

Code of Conduct

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy