Event times below are displayed in PT.
Networking @Scale is an invitation-only technical conference for engineers that build and manage large-scale networks.
Networking solutions are critical for building applications and services that serve millions, and sometimes billions, of people around the world. At this scale, there are always complex engineering challenges to solve. We’ll spend the day sharing experiences in improving reliability, security and performance in large-scale networks and collaborating on the development of new solutions.
Networking @Scale will be held at Hotel Commonwealth in Boston, Massachusetts on Tuesday, November 12th beginning at 8:30 AM ET. Be sure to also stick around for Happy Hour in the evening.
Event times below are displayed in PT.
Network grew up as a mostly best-effort service but has evolved into one of the foundational elements of modern cloud-based computer systems. Networking solutions are critical for building applications and services that serve billions of people around the world. Today’s networks are expected to be highly reliable. This talk is a retrospective look at how networks, networking technologies and network professionals have evolved over the last several years. More importantly, the talk touches on areas we need to focus on in order to advance network reliability an inch closer to the mythical 100% reliable system.
Large global WAN networks have unique reliability and capacity delivery requirements. They typically connect to the Internet, which means they use distributed routing protocols. They are typically much more sparse and irregular than large cluster networks, and can have significantly poorer reachability depending on where in the world they are. Yet, we depend on these networks to reach our customers. We need to build and maintain these networks at an extremely high level of reliability, while at the same time, growing the capacity on these network at hithertofore unseen speeds, while doing it cheaper than ever before. These needs are often directly in conflict.
In this talk, Ashok will go over some of his experiences in building and automating Google’s network backbone. He will cover:
-- The perceived and real reliability differences between SDN and on-box routed networks.
-- The importance of network automation and programmatic network management to capacity delivery as well as reliability.
-- The risks introduced by these management paradigms, and how they can be mitigated.
-- The importance of defining and measuring network SLOs, and tracking network health and capacity availability over time against these SLOs.
-- Some of the hard problems in global WAN availability today, such as global routes, BGP and MPLS, and where we could go from here in the search for a truly 6-nines network.
The routes used in the Internet's interdomain routing system are a rich information source that could be exploited to answer a wide range of questions. However, analyzing routes is difficult, because the fundamental object of study is a set of paths. In this talk we will present new analysis tools -- metrics and methods -- for analyzing AS paths, and apply them to study interdomain routing in the Internet over a recent 13-year period. Using these tools we will try to present a quantitative understanding of changes in Internet routing at the micro level (of individual ASes) as well as at the macro level (of the set of all ASes). More specifically, we will show that at the micro level, our tools can identify clusters of ASes that have the most unusual routing at each time (interestingly, such clusters often correspond to sets of jointly-owned ASes). We will also show that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes. These strategies are often related to content delivery or service replication. At the macro level, we will show that ASes with the most unusual routing define discernible and interpretable phases of the Internet's evolution. Furthermore, we will discuss how our tools can be used to provide a quantitative measure of the "flattening" of the Internet.
Akamai is well-known as a DNS-based CDN. Instead of building a few dozens of very large POPs, Akamai tries to serve content from a few thousands of small POPs very close to the end users and use DNS to direct end users to a POP that is best for them. This generally gets better performance and scale. However, there are some unique cases where the alternative, anycast-based content delivery, is a better option. Igor will present Akamai's "hybrid anycast" architecture that allows Akamai to serve traffic from thousands of edge deployments but over anycast addresses announced from dozens of POPs. He'll discuss advantages of this architecture as well as hurdles and experiences.
SOMA focuses on an enterprise-level Wi-Fi mesh network optimized for providing connectivity in unconnected and underserved markets. By lowering the total cost of ownership (TCO), simplifying connectivity installations and reducing operational overhead, Facebook’s goal is to help ISPs all over the world in expanding their footprint. We currently have several successful mesh deployments in Africa with over 200 mesh APs that demonstrate this Facebook technology very effectively for public Wi-Fi use cases.
Security threats arising from supply chains pose a serious and growing danger, but traditional risk management techniques are largely subjective, often ambiguous, and scale poorly. We have developed an objective set of metrics that describe security performance of organizations, using a variety of external observations (including compromised systems, endpoint telemetry, file sharing activity, and server configurations, among others); we compute daily updates to these metrics for hundred of thousands of organizations worldwide. In this session, we will discuss some of the key challenges in collecting, storing, and processing cybersecurity observations on a global scale. These data also provide a unique perspective into widespread security events and trends; as an example, we will present an analysis of the attack surface introduced by recent vulnerabilities and use this to gain insight into the effectiveness of security controls across various industries and localities.
At Facebook, we run a global infrastructure that supports thousands of services, with many new ones spinning up daily. We take protecting our network traffic very seriously, so we must have a sustainable way to enforce our security policies transparently and globally. One of the requirements is that all traffic that crosses "unsafe" network links must be encrypted with TLS 1.2 or above using secure modern ciphers and robust key management. This talk describes the infrastructure we built for enforcing the "encrypt all" policy on the end-hosts. We discuss alternatives and tradeoffs and how we use BPF programs. We also go over some of the numerous challenges we faced when realizing this plan. Additionally, we talk about one of our solutions, Transparent TLS (TTLS), that we've built for services that either could not enable TLS natively or could not upgrade to a newer version of TLS easily.
QUIC is a new internet transport that forms the foundation of HTTP/3 at the IETF. The 2017 SIGCOMM paper on QUIC estimated it constituted 7% of public internet traffic, making the CPU efficiency of QUIC extremely important. However, as of 2017, QUIC consumed over 2x the CPU of HTTPS over TCP. Learn how the QUIC and YouTube teams massively reduced QUIC CPU consumption, reaching parity with TCP in some cases.
In an age where ensuring data privacy is becoming more essential than ever, encryption within the datacenter is becoming a reality. However, this incurs a significant CPU cost. This talk will explain how SmartNICs can be used to offload TLS encryption, both ensuring that the host TCP stack is not compromised and how the NIC can keep all the necessary state of a socket based mechanism, dealing with the myriad of exception cases such as packet drops, out of order packets and host side packet mangling. We will then demonstrate the benefits to be gained from this type of offload in a variety of cases. Finally, we will look at the possibilities of applying this type of technology to emerging protocols such as QUIC and the benefits of integrating encryption and congestion control mechanisms to ensure optimal performance.
As Envoy scales with traffic growth, service complexity, and processor-count, to achieve our performance goals we need an increasing array of tools. We need tools to help visualize latency, throughput, memory, CPU-load, and thread contention. Some of these tools already exist, such as kcachegrind and Google’s performance benchmarking library. Others needed to be built, such as a new OSS L7 load-tester based on the Envoy networking stack, that is capable of driving HTTP2 traffic through proxies. In this talk, we’ll discuss these tools and how we’ve applied them to find and fix bottlenecks in Envoy, and help us make decisions about how to improve the system and its usage.
Optimally placing content over a network of arises in many networking applications. Given the content demand, described by content requests and paths they follow, we wish to determine the content placement that maximizes the expected caching gain, i.e., the reduction of routing costs due to intermediate caching. The offline version of this problem is NP-hard. To make matters worse, in most cases, both the demand and the network topology may be a priori unknown; hence, distributed, adaptive content placement algorithms that yield constant approximation guarantees are desired. We show that path replication, an algorithm encountered often in both networking literature and in practice, can be arbitrarily suboptimal when combined with traditional cache eviction policies, like LRU, LFU, or FIFO. We propose a distributed, adaptive algorithm that provably constructs a probabilistic content placement within 1−1/e factor from the optimal, in expectation.
Building an edge platform to support Stadia (Google's gaming platform) has presented a number of challenges. To ensure the best performance for users on a product of Stadia's scope, we've had to scale Google's edge platform and build new networking, compute, and storage services. This talk will explore some of the challenges we've faced scaling Google's stack both up and down to support the reach and performance requirements of a new gaming platform.
Ian Swett is the Manager of Google Cloud Networking's Protocols and Web Performance teams. Ian was heavily involved in the... read more