Networking solutions are important for building applications and services that serve billions of people around the world. At this year’s Networking @Scale conference in Boston, attendees gathered to hear engineers from Akamai, Boston University, Facebook, Google, and others discuss this year’s theme of reliable networking at scale. Speakers shared challenges and solutions related to improving the reliability, security, and performance of large-scale networks.
If you missed the event, you can view recordings of the presentations below. If you are interested in future events, visit the @Scale website or join the @Scale community.
Keynote — Network Reliability: Where we have been and where we are going
Najam Ahmad, Vice President at Facebook
Network grew up as a mostly best-effort service but has evolved into one of the foundational elements of modern cloud-based computer systems. Networking solutions are critical for building applications and services that serve billions of people around the world. Today’s networks are expected to be highly reliable. Najam takes a retrospective look at how networks, networking technologies, and network professionals have evolved over the past several years. More important, he touches on the areas on which we need to focus to advance network reliability an inch closer to the mythical 100 percent reliable system.
All the Bits, Everywhere, All of the Time: Challenges in Building and Automating a Reliable Global Network
Ashok Narayanan, Tech Lead Manager at Google
Large global WAN networks have unique reliability and capacity delivery requirements. They typically connect to the internet, meaning they use distributed routing protocols. They are often much more sparse and irregular than large cluster networks, and can have poorer reachability depending on where in the world they are. However, we depend on these networks to reach our customers. Therefore, we need to build and maintain these networks at an extremely high level of reliability, while growing the capacity on these networks at heretofore unseen speeds, and at cheaper costs than ever before. These needs are often in conflict with one another. Ashok covers some of his experiences in building and automating Google’s network backbone. He touches on the perceived and real reliability differences between SDN and on-box routed networks, the importance of network automation and programmatic network management, the risks introduced by these management paradigms, and how they can be mitigated. He further dives into the importance of defining and measuring network SLOs, tracking network health and capacity availability over time against these SLOs alongside some of the hard problems in global WAN availability today, such as global routes, BGP, and MPLS. He concludes his talk by suggesting where we could go from here in search of a truly six-nines network.
Detecting Unusually Routed ASes
Evimaria Terzi, Associate Professor and Associate Chair of Academics, Computer Science, at Boston University
The routes used in the internet’s interdomain routing system are a rich information source that could be exploited to answer a wide range of questions. However, analyzing routes is difficult, because the fundamental object of study is a set of paths. Evimaria presents new analytical tools — metrics and methods — for analyzing AS paths, and applying them to study interdomain routing in the internet over a recent 13-year period. She demonstrates a quantitative understanding of changes in internet routing at the micro level (of individual ASes) as well as at the macro level (of the set of all ASes). At the micro level, she conveys that these tools can identify clusters of ASes that have the most unusual routing at each time (interestingly, such clusters often correspond to sets of jointly owned ASes). She also demonstrates that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes. These strategies are often related to content delivery or service replication. At the macro level, she shows that ASes with the most unusual routing define discernible and interpretable phases of the internet’s evolution and concludes with a discussion on how they can be used to provide a quantitative measure of the “flattening” of the internet.
Anycast Content Delivery at Akamai
Igor Lubashev, Engineering Manager at Akamai
Akamai is well known as a DNS-based CDN. Instead of building a few dozen very large POPs, Akamai tries to serve content from a few thousand small POPs very close to the end users and uses DNS in order to direct end users to a POP that is best for them. This generally performs and scales better. However, there are some unique cases for which Anycast-based content delivery is a better alternative. Igor presents Akamai’s “hybrid Anycast” architecture that allows Akamai to serve traffic from thousands of edge deployments but over anycast addresses announced from dozens of POPs. He’ll discuss the advantages of this architecture, as well as the hurdles and experiences.
Self-Organizing Mesh Access (SOMA)
Derek Schuster, Software Engineer at Facebook
SOMA focuses on an enterprise-level Wi-Fi mesh network optimized for providing connectivity in unconnected and underserved markets. By lowering the total cost of ownership, simplifying connectivity installations, and reducing operational overhead, Facebook aims to help ISPs all over the world to expand their footprint. Derek shares several successful mesh deployments in Africa with over 200 mesh APs, demonstrating Facebook’s technological effectiveness for public Wi-Fi use cases.
Security Performance Management
Marc Light, VP of Data & Research at BitSight Technologies
Dan Dahlberg, Director of Security Research at BitSight Technologies
Ethan Geil, Technical Director at BitSight Technologies
Security threats arising from supply chains pose a serious and growing danger, but traditional risk management techniques are largely subjective, often ambiguous, and scale poorly. We have developed an objective set of metrics that describe the security performance of organizations, using a variety of external observations (including compromised systems, endpoint telemetry, file sharing activity, and server configurations); we compute daily updates to these metrics for hundred of thousands of organizations worldwide. Marc, Dan, and Ethan discuss some of the key challenges in collecting, storing, and processing cybersecurity observations on a global scale. This data also provides a unique perspective into widespread security events and trends; as an example, they present an analysis of the attack surface introduced by recent vulnerabilities and use this to gain insight into the effectiveness of security controls across various industries and localities.
Enforcing Encryption @Scale
Kyle Nekritz, Software Engineer at Facebook
Facebook runs a global infrastructure that supports thousands of services, with many new ones spinning up daily. Protecting network traffic is taken very seriously, prioritizing a sustainable way to enforce security policies transparently and globally. One of the requirements is that all traffic that crosses “unsafe” network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. Kyle describes the infrastructure Facebook has built for enforcing the “encrypt all” policy on the end hosts. He discusses alternatives and trade-offs alongside how they use BPF programs. He also covers some of the challenges experienced when realizing this plan and presents one of their solutions, Transparent TLS (TTLS), which they built for services that either could not easily enable TLS or upgrade to a newer version of it.
Improving QUIC CPU Performance
Ian Swett, QUIC Tech Lead at Google
QUIC is a new internet transport that forms the foundation of HTTP/3 at the IETF. The 2017 SIGCOMM paper on QUIC estimated it constituted 7 percent of public internet traffic, making the CPU efficiency of QUIC extremely important. However, as of 2017, QUIC consumed over 2x the CPU of HTTPS over TCP. Ian explains how the QUIC and YouTube teams massively reduced QUIC CPU consumption, reaching parity with TCP in some cases.
Using SmartNICs to Offload Connection Encryption in the Data Center
Nick Viljoen, Director, Software Engineering at Netronome
In an age where ensuring data privacy is becoming more essential than ever, encryption within the data center is becoming a reality. However, this incurs a significant CPU cost. Nick explains how SmartNICs can be used to offload TLS encryption while ensuring that the host TCP stack is not compromised and how the NIC can keep all the necessary states of a socket-based mechanism. In this context, dealing with the myriad exception cases, such as packet drops, out-of-order packets, and host-side packet mangling. He demonstrates the benefits to be gained from this type of offload in a variety of cases. Finally, he evaluates the possibilities of applying this type of technology to emerging protocols such as QUIC, and the benefits of integrating encryption and congestion control mechanisms to ensure optimal performance.
Performance Tools and Techniques to Improve Envoy Scalability
Josh Marantz, Tech Lead Manager at Google Envoy Cloud Proxy
As Envoy scales with traffic growth, service complexity, and processor count, to achieve performance goals engineers need an increasing array of tools. Specifically, to help visualize latency, throughput, memory, CPU load, and thread contention. Some of these tools already exist, such as kcachegrind and Google’s performance benchmarking library. Others needed to be built, such as a new OSS L7 load tester that is based on the Envoy networking stack and capable of driving HTTP2 traffic through proxies. Josh discusses these tools and how Google applied them to find and fix bottlenecks in Envoy, ultimately factoring them into decisions about how to improve the system and its usage.
Adaptive Cache Networks with Optimality Guarantees
Stratis Ioannidis, Assistant Professor, Electrical and Computer Engineering, at Northeastern University
Given the content demand, described by content requests and paths they follow, engineers wish to determine the content placement that maximizes the expected caching gain, i.e., the reduction of routing costs due to intermediate caching. The offline version of this problem is NP-hard. To make matters worse, in most cases both the demand and the network topology may be a priori unknown; hence, distributed, adaptive content-placement algorithms that yield constant approximation guarantees are desired. Stratis demonstrates that path replication, an algorithm encountered often in both networking literature and in practice, can be arbitrarily suboptimal when combined with traditional cache eviction policies, like LRU, LFU, or FIFO. He proposes a distributed, adaptive algorithm that provably constructs a probabilistic content placement within 1−1/e factor from the optimal, in expectation.
Building Stadia’s Edge Compute Platform
Andrew Oates, Senior Staff Software Engineer at Google
Building an edge platform to support Stadia (Google’s gaming platform) has presented a number of challenges. To ensure the best performance for users on a product of Stadia’s scope, engineers have had to scale Google’s edge platform and build new networking, compute, and storage services. Andrew explores some of the challenges Google has faced scaling its stack both up and down to support the reach and performance requirements of the new gaming platform.