Event times below are displayed in PT.
Systems @Scale Tel Aviv: Building Distributed Systems is an invitation-only technical conference for engineers that manage large-scale information systems serving millions of people.
As our systems continue to scale, the problem of understanding whether they’re behaving as desired gets progressively harder. As a community, we have developed tools, techniques, and approaches that can be applied to observing the state of these complex distributed systems with the goal of understanding system availability, reliability, performance, and efficiency.
We’ll spend the day covering a wide range of topics exploring these challenges and collaborating on the development of new solutions.
Event times below are displayed in PT.
To run services such as Facebook requires a highly reliable, scalable and efficient Data Center infrastructure.
Learn more about the constant innovation of technology pushing the boundaries of physical infrastructure, allowing Facebook to scale to serve and connect billions of people around the planet.
Unit tests are part of our day today. Some of us are even practicing TDD. But we don't have a good measure of the quality of the tests. Tests are supposed to prove the correctness of the code, and together with CI, you also get registration for free. But the big question that we don't address is what is the quality of the test?
If you are using Chaos Monkey then you are already familiar with the concept: Inject failures to your system and check the system robustness as well as the quality of your monitoring and alerts. Mutation Testing adopts the ‘Chaos Monkey’ methodology to the world of unit tests: Inject bugs to your code to see whether the test suite covers it. Or in other words, create mutations to the tested code and validate your tests can identify the mutations and kill them.
Mutation Testing is not a new idea, but considered as too theoretical and was an academic thing. Noways that CPUs are faster and tools are better it is raising up again as a practical quality technique.
The ability to prefetch data is a key lever in improving FBLite responsiveness. It gives the perception of instant data availability served from local cache.
However, excessive prefetching can lead to data usage that’s not used by the user and performance regressions. This talk will explore the technical challenges we face when serving cached content to FBLite users, and how we balance data usage and resources while maximizing prefetching.
When we build systems our design and tradeoffs reflect the different scales of the system: the speed of disks, latency of network; They reflect the constraints and abilities of the underlying technologies.
But as technology advances some of these assumptions have become invalid. We are no longer running on physical machines for which RDBMS systems were designed; SSD changed pretty much everything in the storage world, but our software was designed for magnetic disks; NVRAM? O/S design is way off.
This talk will show how changes in hardware technologies impact design rational of various systems, highlighting the importance of understanding and rethinking the design rational and explore new designs that arise from the new rational.
Over the course of the last year, Go became the main programming language for developing services in Facebook Connectivity. Some of them, have a complicated data-model with tens of types and relations.
At Facebook we like to think about our data-model in graph concepts. We've had a good experience with this model internally. The lack of a proper Graph-based ORM for Go, led us to write one and open-source it.
In this talk I’ll share the journey of taking this concept from idea to implementation, and will deep dive into some of the challenges and the technical decisions.
In Facebook we run huge Java services, this applies both to the size of a single process and to scale of our servers fleet.
Facebook Lite is one of these dominant Java services within Facebook, serving hundreds of millions of users every month. The architecture of Facebook Lite is unique, as it offloads client’s typical work (data retrieval, business logic, layout calculation, etc.) to the server, causing it to evolve into a memory bound service.
This architecture provides clear advantages to Facebook Lite users and developers, however it also imposes difficulties on service owners for keeping the service healthy and safe from memory regressions. For instance, even a memory regression of 1% has high stability and cost implications on our production system. Therefore, should be detected and blocked as soon as possible.
In this session we will go through the evolution of the Facebook Lite service from a point in time in which it was occasionally suffering from massive memory regressions that put it at risk, through building a scalable and advanced memory analysis infrastructure, to providing high granularity memory visibility to developers and enabling them to push our service to its efficiency limits with massive memory wins.
At Singular, we combine data pulled periodically from 2500+ sources and streamlined data that we receive in real-time. Joining these data sets, we encountered a few unique challenges: frequent changes in the periodic data that was pulled from our different sources which affect our real-time data retroactively and periodic and real-time data arrive at different times and should always be aligned and matched.
In this session, we’ll share some of the tricks we use to keep the data aligned @ scale, including separating frequently and infrequently changed data to streamline alignment, detecting changes in the data using consistent hashing and storing data to efficiently apply changes with our bz2 inline-block edit optimization
Keeping all of your code in a single repository has huge benefits, but comes with equally huge obstacles. In this session I’ll talk about the challenges Facebook has faced with its massive codebase, and how we’re radically extending our source control system to enable our entire ecosystem of developer tools to remain fast in the face of tremendous growth.
I’ll briefly introduce the concept of a monorepo, give a rough sense of our repository scale, talk about the problems it causes in development (slow source control, slow builds, complex test infrastructure, difficulties maintaining release quality, etc), then talk about a few source control innovations we’ve made to tackle these challenges.
At Forter, we’re on a mission to build the foundations for a more credible internet by blocking fraudsters and abusers on e-commerce platforms. To achieve that, we need to take millions of high-risk, low-latency decisions per day while processing billions of events.
We’re doing all of this with a very lean and mean R&D team. We had to invent many solutions from the ground up, and we’ll share some of our insights with you.
Monitoring metrics for any significant movements is key to detecting problems with systems and products. This talk provides an overview of our detection and alerting framework: the scale in the number of timeseries we monitor, the different detection algorithms we offer (rule-based and ML-based) and the ability to auto-slice data along multiple dimensions to identify deeper issues.
Deriving signal without being inundated with noise is crucial at our scale, and we have built tools to empower teams to maintain high signal-to-noise ratio.
To cater to our future scale needs, we are currently focused on automatic monitoring: proactively logging and monitoring the right metrics for different artifacts, proactively analyzing any flagged events and hopefully predicting potential critical incidents.