Event times below are displayed in PT.
Building and operating large-scale networks hosting applications that serve billions of people worldwide often present complex engineering challenges to solve. At the recently held Networking@Scale 2022 virtual conference hosted by Meta on Jun 01 and Jun 02, 2022, engineers from Cloudflare, Fastly, Google, Microsoft Azure, Netflix, and Meta presented talks and engaged in live panel discussions with the audience around these challenges.
The conference was held virtually and saw a great turnout of attendees from industry and academia alike. This summer edition of Networking@Scale was themed around Transport Innovation – more specifically, on how to efficiently and quickly move data across the network, addressing congestion, performance, reliability, and extensibility through innovations in the transport layer. The conference was spread across two days and focused on transport protocols such as QUIC, TCP, and RDMA.
Day 1 of the conference focused on the value proposition and innovations in using the QUIC protocol in the Internet architecture and specific use-case studies demonstrating high performance and lower latencies achieved with QUIC at the CDN, Edge, and Backbone layers.
Day 2 pivoted to the challenges in Datacenter (DC) and WANs around networking and how innovations in TCP and other protocols (e.g.RoCE) help tackle these.
The Q&A sessions saw great engagement from the audience and presenters where they discussed topics such as QUIC’s agility, QUIC/HTTP3 adoption on the web both on browsers and servers. On the TCP side, there were discussions around BPF tuning vs in-kernel changes, deploying changes at scale, RoCE security and congestion management etc.
Recordings of the presentations are below. If you are interested in future events, please visit the @Scale website, follow the @Scale Facebook page, or join the Networking@Scale attendees Facebook group.
Event times below are displayed in PT.
We've all heard much about QUIC in the past few years, and a lot has been made of its performance benefits for HTTP/3. For some of us however, HTTP/3 was always just the beginning, just the vehicle for us to get QUIC out into the world. This talk will go beyond these immediate benefits of QUIC and present my view on our somewhat anticipated sleight of hand. The talk will discuss QUIC's long-term value proposition for the Internet's architecture, including some recent projects and a broad sketch of where it can go.
In a typical CDN architecture the caching tier is fronted by a load-balancing tier; response content flows from the cache to the requester through the load-balancer. With this architecture extra I/O, CPU cycles and intra-cluster network bandwidth are spent to stream the content through multiple hops. We present a solution utilizing QUIC's properties to implement a form of Direct Server Return (DSR) from the caching layer, directly to the client. This form of DSR obviates the need for most intra-cluster communication when serving cached content. In this talk we go over the technical challenges in implementing QUIC cache DSR, its security properties, the expected performance improvements, and future applications.
Transfers in high-BDP links incur a startup delay for congestion control to probe the bandwidth of the underlying link. The impact of this delay is inversely proportional to the size of the transfer since small transfers may repeatedly spend all their transfer time probing for the available bandwidth and never reach it or utilize it. While this is necessary for links with rapidly changing capacity, it can be avoided in more predictable links such as backbone links. Existing TCP approaches are either limited to specific pairs of endpoints or require intermediate proxies. In this presentation, we share the approach we’ve developed for use with QUIC deployments in Meta’s backbone network. We use a modified congestion controller that tracks the average congestion control state for connections using each backbone path. This state is then used to “jumpstart” new connections across the same path, significantly reducing the startup delay. This, coupled with QUIC 0-rtt, offers significant savings compared to existing TCP-based approaches for transfers of size close to the path BDP. Screen reader support enabled.
LIVE Q&A featuring Jana Iyengar, Matt Joras, Yair Gottdenker & Joseph Beshay
Nestled between transport protocols (TCP, UDP, QUIC) and application protocols (HTTP, etc.) is a layer few are familiar with. Layer 4¾ sits hiding in plain sight, often only being glimpsed during curious events that raise its prominence, such as edge cases under scale of deployment or diverse usage. In this talk, we'll take a look at the Cloudflare Protocol's team view of the Internet edge and explore some of the fantastic cases we've seen, and what that might mean for future developments of Layer 4 and Layer 7 and the eponymous inbetween.
A key feature of HTTP/3 over QUIC is the ability to send a request in the first flight with the ClientHello. 0-RTT in IETF QUIC is notably more complex than gQUIC, with multiple packet number spaces and a limit on the amplification factor. Walk through some issues we hit and the tooling we used to identify and debug them before 0-RTT became a performance win for applications.
LIVE Q&A featuring Lucas Pardue & Ian Swett
A talk about two specific DC transport tuning initiatives (a) handling sustained congestion in the network (b) tackling bursts in network. Covers the motivation, implementation overview, wins and lessons learnt for both these initiatives.
We will share the design, implementation, and production experience of BPF based platform used to tune the network transport across millions of servers at Meta.
LIVE Q&A featuring Prashanth Kannan, Balasubramanian Madhavan, Abhishek Dhamija, Prankur Gupta & Kumar Saurabh Arora
We will demonstrate a performant and novel approach to performing NAT, that uses a unique transition mechanism utilizing a new flag introduced to the seccomp() system call, to intercept egress connect calls to opportunistically use a transition IPv4 address when possible, saving applications the pain of dealing with the end host not being reachable, while still living in an IPv6-only environment.
The Wide Area Network (WAN) connects many datacenter (DC) regions and hundreds of Points of Presence (POPs) of Meta. The WAN resource is shared by several high network demand services at Meta. The network must be built for peak demand and also account for failure scenarios to reduce the impact on Meta products. However, building a resilient network that is over-provisioned for all service peak demands at our current growth rates is practically infeasible due to fiber sourcing, deployment constraints and the costs involved. This talk presents Meta’s production traffic classification and WAN Entitlement solution that is currently used by our services to share the network safely and efficiently. Network Entitlement framework aims to provide a simple, stable, and operations-friendly abstraction of network for sharing the backbone. Our framework includes two key parts: (1) an hose-based entitlement granting system that establishes an agile contract while achieving network efficiency and meeting long-term SLO guarantees, and (2) a flexible large-scale distributed host-based traffic admission system that enforces the contract on the production traffic.
LIVE Q&A featuring Keerti Lakshminarayan, Alok Tiagi, Guanqing Yan, Manikandan Somasundaram & Jitu Padhye
Omar supports the teams developing, deploying, and operating Meta's global data center networks. This... read more
Jana Iyengar is the Product Lead for Infrastructure Services at Fastly, where he is... read more
Matt Joras is a Software Engineer at Meta where he primarily works on their... read more
Yair Gottdenker is a Production Engineer at Meta. He has over 14 years of... read more
Joseph is a Research Scientist at Meta. He is part of the Traffic Protocols... read more
Bharat is a Software Engineering Manager in the Traffic Infrastructure group at Meta. He... read more
Lucas is a Senior Software Engineer on the Protocols teams at Cloudflare, and co-chair... read more
Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams.... read more
Luca is a Software Engineer working on network protocols, improving applications performance at scale.... read more
I am Bala Madhavan. I am a networking enthusiast. In the past, I have... read more
I am a Production Engineer working on Host Networking team at Meta. I work... read more
Prashanth Kannan is a software engineer working in the host network team at Meta... read more
Prankur Gupta is a Software Engineer for Meta Platforms, Inc. working towards unifying all... read more
Neil is a Research Scientist at Meta, working on tools to measure and improve... read more
Keerti is a software engineer currently working in the Network Platform at Netflix. read more
Alok is a software engineer currently working in the Network Platform at Netflix. read more
I joined Meta as a software engineer in 2017. My focus is on building... read more
I am a Software Engineer at Meta. Over the many years at Meta, I... read more
Jitendra Padhye received his PhD from UMass Amherst in 2000. He has been at... read more