Better Resilience Through Self Tuning
William previously worked at Netflix, and this presentation will highlight some of the strategies he used while working there. He has the permission of Netflix to discuss them at this conference. As companies grow and the number of services and the traffic to those services increases we often rely on hand tuning and developer experience to increase reliability and resilience. It feels at times like an endless amount of adding functionality, noticing new system behavior, tuning services, and closely monitoring impact.
Even still, there’s always a new way our systems can fail leading to the need to revisit our carefully tuned system parameters. This constant battle slows progress, but at Netflix we rely on a low barrier to entry “paved road” which allows developers to build new services in minutes which adapt to runtime behaviors. These services leverage a suite of dynamic algorithms which automatically learn system behavior, reacting to outages, slow responses, and sudden changes to upstream and downstream traffic patterns. In this talk, we’ll dive into some of those dynamic approaches and show the result of various common failure scenarios which Netflix engineers no longer need to think about. These approaches are so good we’ll even see how in some situations they make calling remote services more reliable than in process code.