EVENT AGENDA
Event times below are displayed in PT.
Systems @Scale is a technical conference for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges. The @Scale community focuses on bringing people together to discuss these challenges and collaborate on the development of new solutions.
The 2021 fall series will be hosted virtually. Joining us are speakers from AWS, BigSpring, Confluent, Databricks, Lyft, Twitter, and Facebook. The event spans three weeks, with talks themed around capacity/efficiency, reliability/testing, and challenges at scale in the context of large-scale distributed systems.
Starting October 13th, for three weeks, we will livestream a recorded session followed by live panel discussion on Wednesdays.
Event times below are displayed in PT.
The success of Amazon Redshift inspired a lot of innovation in the analytics industry which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading performance and scalability. We then talk about Amazon Redshift’s autonomics. In particular, we present how Redshift continuously monitors the system and uses machine learning to improve its performance and operational health without the need of dedicated administration resources. Finally, we discuss how Amazon Redshift extends beyond traditional data warehousing workloads, but integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for analytics.
Facebook is undergoing a massive design shift in capacity management and service placement to scale the efficiency of our datacenter resources. At the core of this shift is the Resource Allowance System (RAS) that continuously optimizes for the assignment of service demand to capacity supply. RAS ensures available capacity to all services despite challenges of random failures, correlated failures, maintenances, and overloading shared resources. Additionally, the quality of assignment determines how efficiently datacenter resources can be used which is critical at our scale. Please attend to the talk to learn about the challenges we are faced with, and the solution that we have already deployed for 80% of all servers at Facebook.
At Twitter, hundreds of thousands of microservices emit important events triggered by user interactions on the platform. The Data Platform team has the requirement to aggregate these events by service type and generate consolidated datasets. These datasets are made available at different storage destinations for data processing jobs or analytical queries. In this presentation we discuss the architecture behind supporting event log pipelines which can handle billions of events per minute with data volumes of tens of petabytes of data every day. We discuss our challenges at scale and lay out our solution using both open source and in house software stack. This presentation describes our resource utilization and optimizations we had to do at scale. Towards the end we also introduce our improvements to move our event log pipeline to event stream pipelines. We show a use case which uses these event streams for real time analytics.
Transitive Resource Accounting (TRA) is a system that builds on top of Facebook’s distributed traces platform, Canopy, with the goal of capturing end-2-end request cost metrics and attributing them back to the originating caller. This data is pivotal in attributing capacity usage and understanding the efficiency of our infrastructure as a whole. Let’s say that a product is going to grow by 20% in the coming year. By how much do its backend service dependencies need to grow to support this added demand? To determine this we must understand what percentage of demand to the dependent services is driven by the product. Attributing this demand becomes increasingly more difficult as a distributed system grows and becomes ever more complex. Service owners can collect service specific data about who is calling them and who they are calling but this is a very narrow view that provides little insight about capacity & efficiency for the system as a whole. With the power of distributed tracing TRA measures the cost of the request from start to finish through every hop in the request path. We can then attribute the source of demand at any depth in the system.
Attribution of reliability in a microservice architecture can be solved, and has been solved, in very different ways due to how services are cataloged across the industry. Our hypothesis at Lyft was that service catalogs can become stale, but ownership derived from an on-call rotation will be significantly more reliable for attribution. We'd like to share our journey through combining Envoy, Pagerduty, and an organizational hierarchy to identify reliability concerns across Lyft through standardized SLOs and Director-level rollups.
Developing at speed and scale across Facebook’s many services requires testing frameworks that help developers iterate on features quickly and with minimal friction, while helping to catch bugs early. Learn why we’ve built our own integration testing framework for services and how we combine it with ideas from fuzzing to enable fully autonomous testing.
BigSpring is a mobile first platform for lifelong skilling with measurable ROI. We use GraphQL to power our services. We would love to talk about how we use Jest to integration test our resolvers and other business logic built in our TypeScript Node.js code base. We'll cover how we setup test data, mock connections and collect code coverage as part of our CI/CD process on Github Actions. Being a global and a business critical app, reliability and development velocity is key and we hope to share some hard earned lessons in introducing comprehensive testing to an existing GraphQL repo.
Power outages cause the majority of unplanned server downtime in Facebook data centers. During a power outage, thousands of servers can go offline simultaneously for several hours, which can lead to service degradations. At Facebook, all data center racks are equipped with batteries that can provide backup power for a few minutes after power outages. Power Loss Siren (PLS) is a rack level, low latency, distributed power outage detection and alerting system. PLS leverages existing in-rack batteries to notify services about impending power outages and helps mitigate the impact of power outages on services. Typical mitigations include promoting remote database secondaries when primaries are experiencing power outages, routing requests away from hosts experiencing power outages, flushing memory to disk, etc. PLS also helps simplify physical infrastructure management by not requiring additional power source redundancy for critical services.
Confluent Inc provides cloud based data stream platforms based on Apache Kafka. Running an open source product like Kafka on the public cloud offerings of Amazon, Google, and Microsoft offers an interesting array of challenges. This presentation gives an overview of these challenges and discusses some of the solutions that Confluent uses to meet them.
Facebook Ordered Queue Service (FOQS) is a distributed priority queue service that powers hundreds of services and products across the Facebook stack. Facebook users have come to rely on its services to remain connected to their friends and families. As such, it is absolutely crucial for Facebook to continue operating with high availability during events which may impact its data centers. Being a core building block of Facebook infrastructure, FOQS is expected to handle loss of a data center gracefully and transparently to its clients. Dillon and Jasmit will talk about how the architecture of FOQS has evolved to be resilient to disasters, the technical challenges of hosting a globally available system, and the operational challenges that came with migrating existing tenants of FOQS to disaster ready installations with zero downtime at Facebook scale.
The cloud is becoming one of the most attractive ways for enterprises to store, analyze, and get value from their data, but building and operating a data platform in the cloud has a number of new challenges compared to traditional on-premises data systems. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS, Azure, and Google Cloud. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers.
We present how Facebook's unified Continuous Deployment (CD) system, Conveyor, powers safe and flexible service deployment across all services at Facebook. Conveyor enables services owners to build highly customized deployment pipelines from easy-to-use components and supports building, testing, and deploying a diverse range of services. To go further, we explore how Conveyor's deployment system, Push, provides users with tools to deploy new binary versions safely and gradually across the Facebook fleet.
Richard is an engineer in the Capacity Infrastructure org with a focus on resource... read more
@Scale engineers pencil blogs, articles, and academic papers to further inform and inspire the engineering community.