TOPIC: Data, Systems and Networking

Networking @Scale Boston 2019

NOVEMBER 12, 2019 @ 08:30 AM - NOVEMBER 12, 2019 @ 06:30 PM PT
Designed for engineers that build and manage large-scale networks. Networking solutions are critical for building applications and services that serve billions of people around the world. Building and operating such large-scale networks often present complex engineering challenges to solve.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.

Networking solutions are critical for building applications and services that serve millions, and sometimes billions, of people around the world. At this scale, there are always complex engineering challenges to solve. We’ll spend the day sharing experiences in improving reliability, security and performance in large-scale networks and collaborating on the development of new solutions.

Networking @Scale will be held at Hotel Commonwealth in Boston, Massachusetts on Tuesday, November 12th beginning at 8:30 AM ET. Be sure to also stick around for Happy Hour in the evening.

EVENT AGENDA

Event times below are displayed in PT.

November 12

08:30 AM - 10:00 AM
Registration & Breakfast
08:30 AM - 10:00 AM
Women in Tech Breakfast & Panel Discussion
10:00 AM - 10:05 AM
Welcome
SPEAKER Leonidas Kontothanassis,Facebook
10:05 AM - 10:30 AM
Keynote - Network Reliability: Where we have been and where we are going

Network grew up as a mostly best-effort service but has evolved into one of the foundational elements of modern cloud-based computer systems. Networking solutions are critical for building applications and services that serve billions of people around the world. Today’s networks are expected to be highly reliable. This talk is a retrospective look at how networks, networking technologies and network professionals have evolved over the last several years. More importantly, the talk touches on areas we need to focus on in order to advance network reliability an inch closer to the mythical 100% reliable system.

SPEAKER Najam Ahmad,Facebook
10:30 AM - 11:00 AM
All the Bits, Everywhere, All of the Time: Challenges in Building and Automating a Reliable Global Network

Large global WAN networks have unique reliability and capacity delivery requirements. They typically connect to the Internet, which means they use distributed routing protocols. They are typically much more sparse and irregular than large cluster networks, and can have significantly poorer reachability depending on where in the world they are. Yet, we depend on these networks to reach our customers. We need to build and maintain these networks at an extremely high level of reliability, while at the same time, growing the capacity on these network at hithertofore unseen speeds, while doing it cheaper than ever before. These needs are often directly in conflict.

In this talk, Ashok will go over some of his experiences in building and automating Google’s network backbone. He will cover:

-- The perceived and real reliability differences between SDN and on-box routed networks.

-- The importance of network automation and programmatic network management to capacity delivery as well as reliability.

-- The risks introduced by these management paradigms, and how they can be mitigated.

-- The importance of defining and measuring network SLOs, and tracking network health and capacity availability over time against these SLOs.

-- Some of the hard problems in global WAN availability today, such as global routes, BGP and MPLS, and where we could go from here in the search for a truly 6-nines network.

SPEAKER Ashok Narayanan,Google
11:00 AM - 11:30 AM
Detecting Unusually-Routed ASes

The routes used in the Internet's interdomain routing system are a rich information source that could be exploited to answer a wide range of questions. However, analyzing routes is difficult, because the fundamental object of study is a set of paths. In this talk we will present new analysis tools -- metrics and methods -- for analyzing AS paths, and apply them to study interdomain routing in the Internet over a recent 13-year period. Using these tools we will try to present a quantitative understanding of changes in Internet routing at the micro level (of individual ASes) as well as at the macro level (of the set of all ASes). More specifically, we will show that at the micro level, our tools can identify clusters of ASes that have the most unusual routing at each time (interestingly, such clusters often correspond to sets of jointly-owned ASes). We will also show that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes. These strategies are often related to content delivery or service replication. At the macro level, we will show that ASes with the most unusual routing define discernible and interpretable phases of the Internet's evolution. Furthermore, we will discuss how our tools can be used to provide a quantitative measure of the "flattening" of the Internet.

SPEAKER Evimaria Terzi,Boston University
11:30 AM - 12:00 PM
Anycast Content Delivery at Akamai

Akamai is well-known as a DNS-based CDN. Instead of building a few dozens of very large POPs, Akamai tries to serve content from a few thousands of small POPs very close to the end users and use DNS to direct end users to a POP that is best for them. This generally gets better performance and scale. However, there are some unique cases where the alternative, anycast-based content delivery, is a better option. Igor will present Akamai's "hybrid anycast" architecture that allows Akamai to serve traffic from thousands of edge deployments but over anycast addresses announced from dozens of POPs. He'll discuss advantages of this architecture as well as hurdles and experiences.

SPEAKER Igor Lubashev,Akamai
12:00 PM - 12:45 PM
Lunch
12:45 PM - 01:15 PM
Self Organizing Mesh Access (SOMA)

SOMA focuses on an enterprise-level Wi-Fi mesh network optimized for providing connectivity in unconnected and underserved markets. By lowering the total cost of ownership (TCO), simplifying connectivity installations and reducing operational overhead, Facebook’s goal is to help ISPs all over the world in expanding their footprint. We currently have several successful mesh deployments in Africa with over 200 mesh APs that demonstrate this Facebook technology very effectively for public Wi-Fi use cases.

SPEAKER Derek Schuster,Facebook
01:15 PM - 01:45 PM
Security Performance Management

Security threats arising from supply chains pose a serious and growing danger, but traditional risk management techniques are largely subjective, often ambiguous, and scale poorly. We have developed an objective set of metrics that describe security performance of organizations, using a variety of external observations (including compromised systems, endpoint telemetry, file sharing activity, and server configurations, among others); we compute daily updates to these metrics for hundred of thousands of organizations worldwide. In this session, we will discuss some of the key challenges in collecting, storing, and processing cybersecurity observations on a global scale. These data also provide a unique perspective into widespread security events and trends; as an example, we will present an analysis of the attack surface introduced by recent vulnerabilities and use this to gain insight into the effectiveness of security controls across various industries and localities.

SPEAKER Marc Light,BitSight Technologies
SPEAKER Dan Dahlberg,BitSight Technologies
SPEAKER Ethan Geil,BitSight Technologies
01:45 PM - 02:15 PM
Enforcing Encryption @Scale

At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.

SPEAKER Kyle Nekritz,Facebook
02:15 PM - 02:35 PM
Break
02:35 PM - 03:05 PM
Improving QUIC CPU Performance

QUIC is a new internet transport that forms the foundation of HTTP/3 at the IETF. The 2017 SIGCOMM paper on QUIC estimated it constituted 7% of public internet traffic, making the CPU efficiency of QUIC extremely important. However, as of 2017, QUIC consumed over 2x the CPU of HTTPS over TCP. Learn how the QUIC and YouTube teams massively reduced QUIC CPU consumption, reaching parity with TCP in some cases.

SPEAKER Ian Swett,Google
03:05 PM - 03:35 PM
Using SmartNICs to Offload Connection Encryption in the Data Center

In an age where ensuring data privacy is becoming more essential than ever, encryption within the datacenter is becoming a reality. However, this incurs a significant CPU cost. This talk will explain how SmartNICs can be used to offload TLS encryption, both ensuring that the host TCP stack is not compromised and how the NIC can keep all the necessary state of a socket based mechanism, dealing with the myriad of exception cases such as packet drops, out of order packets and host side packet mangling. We will then demonstrate the benefits to be gained from this type of offload in a variety of cases. Finally, we will look at the possibilities of applying this type of technology to emerging protocols such as QUIC and the benefits of integrating encryption and congestion control mechanisms to ensure optimal performance.

SPEAKER Nick Viljoen,Netronome
03:35 PM - 04:05 PM
Performance Tools and Techniques to Improve Envoy Scalability

As Envoy scales with traffic growth, service complexity, and processor-count, to achieve our performance goals we need an increasing array of tools. We need tools to help visualize latency, throughput, memory, CPU-load, and thread contention. Some of these tools already exist, such as kcachegrind and Google’s performance benchmarking library. Others needed to be built, such as a new OSS L7 load-tester based on the Envoy networking stack, that is capable of driving HTTP2 traffic through proxies. In this talk, we’ll discuss these tools and how we’ve applied them to find and fix bottlenecks in Envoy, and help us make decisions about how to improve the system and its usage.

SPEAKER Josh Marantz,Google Envoy Cloud Proxy
04:05 PM - 04:25 PM
Break
04:25 PM - 04:55 PM
Adaptive Cache Networks with Optimality Guarantees

Optimally placing content over a network of arises in many networking applications. Given the content demand, described by content requests and paths they follow, we wish to determine the content placement that maximizes the expected caching gain, i.e., the reduction of routing costs due to intermediate caching. The offline version of this problem is NP-hard. To make matters worse, in most cases, both the demand and the network topology may be a priori unknown; hence, distributed, adaptive content placement algorithms that yield constant approximation guarantees are desired. We show that path replication, an algorithm encountered often in both networking literature and in practice, can be arbitrarily suboptimal when combined with traditional cache eviction policies, like LRU, LFU, or FIFO. We propose a distributed, adaptive algorithm that provably constructs a probabilistic content placement within 1−1/e factor from the optimal, in expectation.

SPEAKER Stratis Ioannidis,Northeastern University
04:55 PM - 05:25 PM
Building Stadia's Edge Compute Platform

Building an edge platform to support Stadia (Google's gaming platform) has presented a number of challenges. To ensure the best performance for users on a product of Stadia's scope, we've had to scale Google's edge platform and build new networking, compute, and storage services. This talk will explore some of the challenges we've faced scaling Google's stack both up and down to support the reach and performance requirements of a new gaming platform.

SPEAKER Andrew Oates,Google
05:25 PM - 05:30 PM
Closing Remarks
SPEAKER Leonidas Kontothanassis,Facebook
05:30 PM - 06:30 PM
Happy Hour

SPEAKERS AND MODERATORS

Leonidas Kontothanassis

Facebook

Najam Ahmad

Facebook

Ashok Narayanan

Google

Evimaria Terzi

Boston University

Igor Lubashev

Akamai

Derek Schuster

Facebook

Marc Light

BitSight Technologies

Dan Dahlberg

BitSight Technologies

Ethan Geil

BitSight Technologies

Kyle Nekritz

Facebook

Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams. Ian was heavily involved in the... read more

Ian Swett

Google

Nick Viljoen

Netronome

Josh Marantz

Google Envoy Cloud Proxy

Stratis Ioannidis

Northeastern University

Andrew Oates

Google

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy