TOPIC: Data, Systems and Networking

Systems @Scale Tel Aviv Fall 2019

NOVEMBER 19, 2019 @ 9:00 AM PST - 6:00 PM PST
Designed for engineers that manage large-scale information systems serving millions of people. The operation of large-scale systems often introduces complex, unprecedented engineering challenges.
RSVPS CLOSED
AGENDA SPEAKERS

ABOUT EVENT

Systems @Scale Tel Aviv: Building Distributed Systems is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.

As our systems continue to scale, the problem of understanding whether they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.

We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.

EVENT AGENDA

Event times below are displayed in PT.

November 19

09:00 AM - 10:00 AM
Registration & Breakfast
10:00 AM - 10:05 AM
Welcome
Speaker Tamar Bar Lev,Facebook
10:05 AM - 10:30 AM
Scaling Facebook’s Data Center Infrastructure

To run services such as Facebook requires a highly reliable, scalable and efficient Data Center infrastructure.

Learn more about the constant innovation of technology pushing the boundaries of physical infrastructure, allowing Facebook to scale to serve and connect billions of people around the planet.

Speaker Joel Kjellgren,Facebook
10:30 AM - 11:00 AM
Kill the mutants – cause it is about time to test your tests

Unit tests are part of our day today. Some of us are even practicing TDD. But we don't have a good measure of the quality of the tests. Tests are supposed to prove the correctness of the code, and together with CI, you also get registration for free. But the big question that we don't address is what is the quality of the test?

If you are using Chaos Monkey then you are already familiar with the concept: Inject failures to your system and check the system robustness as well as the quality of your monitoring and alerts. Mutation Testing adopts the ‘Chaos Monkey’ methodology to the world of unit tests: Inject bugs to your code to see whether the test suite covers it. Or in other words, create mutations to the tested code and validate your tests can identify the mutations and kill them.

Mutation Testing is not a new idea, but considered as too theoretical and was an academic thing. Noways that CPUs are faster and tools are better it is raising up again as a practical quality technique.

Speaker Yonatan Maman,Outbrain
11:00 AM - 11:30 AM
Managing Tradeoffs for Data Prefetching

The ability to prefetch data is a key lever in improving FBLite responsiveness. It gives the perception of instant data availability served from local cache.

However, excessive prefetching can lead to data usage that’s not used by the user and performance regressions. This talk will explore the technical challenges we face when serving cached content to FBLite users, and how we balance data usage and resources while maximizing prefetching.

Speaker Michal Trudler,Facebook
11:30 AM - 12:00 PM
Break
12:00 PM - 12:30 PM
The world changed. Did our designs?

When we build systems our design and tradeoffs reflect the different scales of the system: the speed of disks, latency of network; They reflect the constraints and abilities of the underlying technologies.

But as technology advances some of these assumptions have become invalid. We are no longer running on physical machines for which RDBMS systems were designed; SSD changed pretty much everything in the storage world, but our software was designed for magnetic disks; NVRAM? O/S design is way off.

This talk will show how changes in hardware technologies impact design rational of various systems, highlighting the importance of understanding and rethinking the design rational and explore new designs that arise from the new rational.

Speaker Avishai Ish-Shalom,Aleph VC
12:30 PM - 01:00 PM
The journey for a new ORM in Go

Over the course of the last year, Go became the main programming language for developing services in Facebook Connectivity. Some of them, have a complicated data-model with tens of types and relations.

At Facebook we like to think about our data-model in graph concepts. We've had a good experience with this model internally. The lack of a proper Graph-based ORM for Go, led us to write one and open-source it.

In this talk I’ll share the journey of taking this concept from idea to implementation, and will deep dive into some of the challenges and the technical decisions.

Speaker Ariel Mashraki,Facebook
01:00 PM - 02:00 PM
Lunch
02:00 PM - 02:00 PM
Memory Analysis @Scale

In Facebook we run huge Java services, this applies both to the size of a single process and to scale of our servers fleet.
Facebook Lite is one of these dominant Java services within Facebook, serving hundreds of millions of users every month. The architecture of Facebook Lite is unique, as it offloads client’s typical work (data retrieval, business logic, layout calculation, etc.) to the server, causing it to evolve into a memory bound service.

This architecture provides clear advantages to Facebook Lite users and developers, however it also imposes difficulties on service owners for keeping the service healthy and safe from memory regressions. For instance, even a memory regression of 1% has high stability and cost implications on our production system. Therefore, should be detected and blocked as soon as possible.

In this session we will go through the evolution of the Facebook Lite service from a point in time in which it was occasionally suffering from massive memory regressions that put it at risk, through building a scalable and advanced memory analysis infrastructure, to providing high granularity memory visibility to developers and enabling them to push our service to its efficiency limits with massive memory wins.

Speaker Erez Alon,Facebook
02:30 PM - 03:00 PM
The Challenge to Align Data Points @Scale

At Singular, we combine data pulled periodically from 2500+ sources and streamlined data that we receive in real-time. Joining these data sets, we encountered a few unique challenges: frequent changes in the periodic data that was pulled from our different sources which affect our real-time data retroactively and periodic and real-time data arrive at different times and should always be aligned and matched.

In this session, we’ll share some of the tricks we use to keep the data aligned @ scale, including separating frequently and infrequently changed data to streamline alignment, detecting changes in the data using consistent hashing and storing data to efficiently apply changes with our bz2 inline-block edit optimization

Speaker Ron Konigsberg,Singular
03:00 PM - 03:30 PM
Monorepos: Moving Fast in a Huge Repository

Keeping all of your code in a single repository has huge benefits, but comes with equally huge obstacles. In this session I’ll talk about the challenges Facebook has faced with its massive codebase, and how we’re radically extending our source control system to enable our entire ecosystem of developer tools to remain fast in the face of tremendous growth.

I’ll briefly introduce the concept of a monorepo, give a rough sense of our repository scale, talk about the problems it causes in development (slow source control, slow builds, complex test infrastructure, difficulties maintaining release quality, etc), then talk about a few source control innovations we’ve made to tackle these challenges.

Speaker Durham Goode,Facebook
03:30 PM - 04:00 PM
Operating low-latency fraud prevention systems at scale

At Forter, we’re on a mission to build the foundations for a more credible internet by blocking fraudsters and abusers on e-commerce platforms. To achieve that, we need to take millions of high-risk, low-latency decisions per day while processing billions of events.

We’re doing all of this with a very lean and mean R&D team. We had to invent many solutions from the ground up, and we’ll share some of our insights with you.

Speaker Re'em Bensimhon,Forter
04:00 PM - 04:30 PM
Detection & Alerting at FB: Detecting significant metric movements @ Scale

Monitoring metrics for any significant movements is key to detecting problems with systems and products. This talk provides an overview of our detection and alerting framework: the scale in the number of timeseries we monitor, the different detection algorithms we offer (rule-based and ML-based) and the ability to auto-slice data along multiple dimensions to identify deeper issues.

Deriving signal without being inundated with noise is crucial at our scale, and we have built tools to empower teams to maintain high signal-to-noise ratio.

To cater to our future scale needs, we are currently focused on automatic monitoring: proactively logging and monitoring the right metrics for different artifacts, proactively analyzing any flagged events and hopefully predicting potential critical incidents.

Speaker Ben Southgate,Facebook
04:30 PM - 04:45 PM
Closing Remarks
04:45 PM - 06:00 PM
Networking Happy Hour

SPEAKERS AND MODERATORS

Tamar Bar Lev

Facebook

Joel Kjellgren

Facebook

Yonatan Maman

Outbrain

Michal Trudler

Facebook

Avishai Ish-Shalom

Aleph VC

Ariel Mashraki

Facebook

Erez Alon

Facebook

Ron Konigsberg

Singular

Durham Goode

Facebook

Re'em Bensimhon

Forter

Ben Southgate

Facebook
UPCOMING EVENT   May 22, 2024 Data @Scale

Data @Scale 2024

Data @Scale is a technical conference for engineers who are interested in building, operating, and using data systems at scale. Companies across the industry use data and underlying infrastructure to build products with user empathy,...
UPCOMING EVENT   June 12, 2024 Systems @Scale

Systems @Scale 2024

Systems @Scale 2024 is a technical conference intended for engineers that build and manage large-scale distributed systems serving millions or billions of users. The development and operation of such systems often introduces complex, unprecedented engineering...
UPCOMING EVENT   07/31/2024 AI @Scale

AI Infra @Scale 2024

Meta's Engineering and Infrastructure teams are excited to host AI Infra @Scale, a one-day virtual event featuring a range of speakers from Meta who will unveil the latest AI infrastructure investments and innovations powering Meta's...
UPCOMING EVENT   August 7, 2024 Product @Scale

Product @Scale 2024

Product @Scale conferences are designed for technologists who work on solving complex product problems at scale. This year focuses on discussions that explore the creator ecosystem, and how AI will play a role in scaling...
UPCOMING EVENT   September 4-5, 2024 (2 day event) Networking @Scale

Networking @Scale 2024

Networking @Scale is a technical conference for engineers that build and manage large-scale networks. Meta’s Networking Infrastructure team is excited to host Networking @Scale, a two-day virtual event featuring a range of speakers from Meta...
UPCOMING EVENT   September 25, 2024 Reliability @Scale

Reliability @Scale 2024

Reliability @Scale is a technical conference for engineers who are passionate about building and understanding highly resilient and reliable systems and products at massive scale. Whether it’s novel design decisions, or outages that impact billions...
UPCOMING EVENT   October 23, 2024 Mobile @Scale

Mobile @Scale 2024

Mobile @Scale is a technical conference designed for the engineers, product managers, and engineering leaders building mobile experiences at significant scale (millions to billions of daily users). Mobile @Scale provides a rare opportunity to gather...
UPCOMING EVENT   November 20, 2024 Video @Scale

Video @Scale 2024

Video @Scale 2024 is a technical conference designed for engineers that develop or manage large-scale video systems serving millions of people. The development of large-scale video systems includes complex, unprecedented engineering challenges. The @Scale community...
PAST EVENT   March 20, 2024 @ 9am PT - 3pm PT RTC @Scale

RTC @Scale 2024

RTC @Scale is for engineers who develop and manage large-scale real-time communication (RTC) systems serving millions of people. The operations of large-scale RTC systems have always involved complex engineering challenges which continue to attract attention...

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy