rz blog

ruizhou2024/1/27大约 25 分钟

Transcript

[Rob] Good afternoon, everyone. How you doing? (audience members whoop) That's what I like to hear. It is my honor to welcome you to NFX 303, better known as, The Evolution of Chaos Engineering at Netflix. My name is Rob Hilton, and I'm a Principal Solutions Architect supporting Netflix as a customer.And basically what that means is, I've spent roughly the last four years of my life learning from, and collaborating with all of Netflix's best and brightest engineers, including my friend right over here to my right, Ales Plsek, who's gonna tell you a little bit more about resilience. Netflix is well known as being an innovator in a wide array of different tech spaces.One of the arguably most compelling of those is their contributions to application resilience. Now, that's a huge and widely varying topic, but one of the magic words in that space that's been coined by Netflix, and one of the ones that brings us all together today, is that magic word, chaos. But technology spaces evolve constantly.As Ales knows, one year is roughly 10 years in tech speak. So given the fact that chaos engineering was coined by Netflix over a decade ago, one wonders how they've continued to evolve in that space today. And with that, give it up for Ales, so we can all find out. (audience applauds) - Hi, thank you. Thank you for joining us for The Evolution of Chaos Engineering at Netflix.My name is Ales Plsek. I'm a Senior Software Engineer at Netflix. I'm part of the Resilience Team, and I lead our chaos engineering efforts in this space. And today, I will hopefully be your favorite historian, because we'll be talking for the next hour about history, and evolution of chaos engineering.And there is a lot to talk about. Chaos engineering has been alive for more than a decade. In software engineering years, that's more than 100. So let's get into it, we have a century to cover. First, I would like to talk about the history of chaos engineering at Netflix. We'll talk about some of the turning points and technologies that we've built over the years, and that will hopefully illustrate how our thinking about this discipline has evolved over these years.And then, I would like to show you how the chaos experimentation looks like at Netflix today. And finally, we'll talk about how chaos engineering has grown beyond the scope of its original goal, and how it transformed many phases of our software development lifecycle. And how it impacts the lives of not only SREs or chaos engineers, or resilience engineers, but any software engineer at our company.And it would not be a Netflix presentation without screenshots of some of my colleague software engineers who are hard at work. So let's start with this brief history of chaos. So your Netflix experience, as you may probably know, is largely composed of two large technologies. One, is the cloud, made out of these little instances, and then there's the Open Connect CDNs that we own.And those are running and storing all the video bits that are served to your device when you click play. Once you click play, we call it a stream start. That stream start, we keep number of those, and that we call SPS, Stream Start Per Second. That is the key metric that we monitor. Here is a beautiful vintage graph from our Atlas monitoring system showing how SPS is changing day by day.And focusing on the cloud itself, that one is made of these little tiny instances, and as we all know, they can be quite finicky. They can come and go as they want, they can be impacted by failures, and it's just something that we need to learn to accept. That's kind of this cycle of life, or something like that.And so, when Netflix moved its services into the cloud, there was a need to make sure that the streaming experience is not impacted by these little instances going down or failing. And that's why, originally, we have built Chaos Monkey. Chaos Monkey was used to test the resilience of our streaming infrastructure towards the failures of individual nodes.And the tool would randomly pick a couple of instances in the data center, and would start shutting them down on a pretty fine schedule. And early on in this chaos journey, it was a very simple, yet very effective approach. It helped our services to build in that resiliency towards the nodes failing down. And although it's a very crude tool, it has become a synonym of chaos engineering for many years.In chaos engineering terminology, we can categorize the treatment that Chaos Monkey is serving as a failure. And we say that, that failure is scope to an individual instance. And as you will see, we'll be using this terminology throughout this talk. Today, Chaos Monkey is still used at Netflix, but the value it provides is minimal.The services have simply moved on. They are resilient toward failures of individual nodes. And so we don't really see much benefit of running Chaos Monkey today in the data center. The use of that tool has declined, but it's also because our backend architecture is a lot more complex than just a set of instances.Speaking of which, here can see a visualization of our cloud architecture. (audience members chuckle) And for those of you taking notes, I will pause this for a minute. And here's another, kind of a bird's view of that same diagram. But if we really simplified it, what we have here is, our members and their devices on the right side talking to our Netflix cloud data center, sending the request to our Netflix cloud data center.These requests are entering the cloud through the edge gateway. And then they start propagating through these layers of microservices or applications, or services as we call them, and here represented as individual graphs, individual nodes of that graph. So taking this architecture into account, we wanted to be more resilient than just at the level of individual nodes.Moreover, in 2014, Netflix experienced an outage, a service responsible for managing the subscriber information was experiencing some internal errors. And although this service is very important for the business, for the company, because it's managing the subscriber or it's managing the signup of the new users.It's managing whenever you want to create or modify a profile. It's not really critical for the streaming experience. It's not really demanding that their service needs to be up, because you as a authenticated device, you should still be able to stream your favorite TV show. Yet during this outage, our members were not able to stream their favorite shows.And what happened was that the service was in trouble, and was correctly serving the fallbacks. But the colors of these fallbacks, the colors of the service were not able to process these fallbacks correct correctly. Because this outage never happened before, there was never any time to kind of test that the fallbacks are being properly processed in the layers above this service.And so the streaming architecture was not really resilient towards these failures of an uncritical service, and that's why we had a failure. So we realized that we needed to simulate these kind of scenarios, and we needed to be able to inject failures into the services themselves. And that is why we built FIT.FIT is our Failure Injection Technology. And it allows us to achieve that precision of failure injection. So for any given service in our infrastructure, FIT allows us to define injection points. And those are the points where we will be able to inject these failures. We support many different types of injection points.We have, for example, injection points for the IPC libraries. That's when you when talk through gRPC or REST, or GraphQL. To honor the service, we are able to inject failures either on a server side or client side of the communication. We have database library injection points where you can inject failures into the call that is trying to retrieve a certain data point from your Cassandra key space, and so on.We have injection points for caches or Kafka, S3 buckets and so on. Also, FIT allows us to define the treatment that we wanna inject in that injection point. The treatment will be set by that injection point, and it can be of many different types. Right now we are recognizing the failure. We can either serve the failure delay or both.Both meaning that for that request impacted into the injection point, we would first delay the request, and then we would fail it as well. And then there's the scope. The scope is used to define which injection point's gonna get triggered. So far we have seen scoping to instances, and now we are also able to scope to clusters.And then a scenario. Scenario is a blueprint of are chaos experiment. It defines a set of injection points, set of treatments, and the scope. So, excuse me. So it exactly defines what is gonna happen during that chaos experiment. And finally, if the scenario is a blueprint, the session is a running scenario for that given FIT scenario.And let me show you how that works in practice. So we can create a scenario affecting integration communication between service A and service B. And we can be more precise than that. We can say that injection point is gonna be a gRPC client talking to service B, and we'll scope this to cluster A, and we'll still serve the treatment, which will be the failure.And as you can see, in terms of precision, comparing to Chaos Monkey, that is much better. And this enable us to do game days. During such a day, a team owning a service would gather in a room, they would create a FIT scenario and then they would launch it, and then they would observe how their service is behaving.They would look at the errors for that service, other system metrics, errors in services that are calling their service, and so on. They would also be looking at the SPS graph. And that way, they would be able to emulate a certain outage or certain failures. And we use the game day technique to validate that the subscriber fallback would be properly handled by the clients.We use the game technique, so we created a FIT scenario where the gRPC client, the injection point would be a gRPC client talking to the subscriber. We would scope that to the playback service, which is the service that's handling our streaming functionality. And we would inject the failures. And using this FIT session, we were able to trigger the fallbacks and validate that these fallbacks are either properly or not properly handled.And that way we could file tickets and those issues will be addressed by the engineers. So using that FIT session, we were able to make sure that, that 2014 subscriber outage would never occur again. And game days actually represent another phase of chaos engineering after the Chaos Monkey. But as you can imagine, it is an evolution in a right step direction, but it requires lots of amount of effort and time from many people to sit in that room and manually monitor what is happening with your system.And this is still quite a crude approach because here we are failing all the requests between a service A and service B. And so we decided to increase or improve our scoping ability even more. And to do that we used a request context technology or as we call it, CRR, and let me explain how that works. So for a single request, as you probably know, the single request is composed of headers and a body.And CRR allows us to attach custom tags to every header, every request that we are seeing in our infrastructure. And that way we will be able to mark a certain request with the FIT failures. And as this request is propagating through our services and microservices, this request context information is passed through.And as the request is going through every individual injection point, that injection point will always be able to inspect the headers and see whether that injection point should get triggered for that particular request. And that way we can precisely mark a given request to fail for a certain FIT scenario. So we have dramatically improved our scoping ability going from instance, cluster, to a request.It doesn't get more granular than that. And so we have built FIT filter, and incorporated it into our edge gateway. So FIT filter will actually do this tagging for us. Whenever we see a request we wanna tag, the FIT filter will attach these headers to the request, and then the request is propagating through our infrastructure.Here's another example that we have. If we revisit the scenario, but we wanna fail only those requests coming from this simple, single device. So we still define injection point as gRPC client talking to service B, but in the scope, not only we say the cluster A, but also we define that only the requests coming from the customer ID of this particular customer ID, of this particular device will get the the treatment, which will be the failure.And as these requests are traveling from the device through the edge gateway where they are marked with the FIT tags, these FIT tags get propagated through the infrastructure, and only those requests that are tagged and actually reach that injection point of a gRPC client to service B from cluster A, those will fail.And this is scoped a single device only, only that one device is impacted. And as simple as it is, this was a turning point in our chaos engineering, because it enabled our software developers to just create these FIT sessions and apply chaos testing directly on their own devices. And they would be able to see firsthand how every particular FIT scenario is impacting the customer.And we build this into a simple product, it's called FIT. So we have a UI where any software engineer that company can go create, or select where they want to inject the failure, which scenario they wanna use. And then they would also scope that session to a particular customer id. And once they launch this FIT session, they can see whether it is actually affecting the user experience.And this really proved to be groundbreaking, for the first time ever, the Netflix engineer would be able to go and experiment with chaos without requiring assistance of a chaos engineer or resilience engineer. And they would be able to configure this FIT session, and within seconds they would see if there is an impact or not.One of the examples where this was used very recently was when we were launching this new double thumbs up feature. So we let our users rate any different content that they watched with thumbs down, thumbs up, or double thumbs up. And so, by the way, if you're gonna be using your rating session for this talk, you can use double thumbs up as well, before I forget.So when we launched this feature, this feature is served by the rating service. And again, it's not really a feature that is critical to our streaming experience. It doesn't matter whether you're able to retrieve whatever your rating was for that given content, or whether you are able to set it correctly, you should still be able to press play and the content should start.And so it's really a non-critical feature. And so to validate that, the developer themselves who was launching this feature, was able to create a FIT scenario where they would be injecting failures to the ratings service from the playback session, again, for their certain device. And that way, again, they would be able to validate that there was zero SPS impact.There was zero SPS impact to their device, and that way they can extrapolate that there will be no SPS impact to our members. And that's why they were able to launch the feature safely and correctly. Now, if there was an impact, since for every FIT session that we are running, we are also recording all the traces.So this is a very useful information, because you'll be able to see where the failure was injected, how the system reacted, which fallbacks were triggered, how this failure was returned to callers, and where this experience actually broke. Because many times, the fallback is handled correctly by the immediate color, but as this fallback is propagated back to subsequent colors, this whole experience gets broken.So here the software engineers can just go and see what are the traces for this given failure. And the same interaction is available in our prod and test environments. Soon engineers started actually integrating these FIT headers into their smoke tests and integration tests, and that way the FIT become kind of a part of the regular release life cycle.So this was really a turning point. And even now years later, FIT testing is still a very popular method by our engineers, especially when something breaks, they can very quickly and easily go create a FIT session and see for themselves, kind of like reproduce that back on their own device. So this drum particularly reduces also the lead time to debugging any issue.Around the same time, a different technology was making its debut. So we have realized, as you can see on this footage of one of our engineers deploying to production without any safeguards, (audience members chuckle) that it can be relatively risky or stressful experience. So how do we provide safe methods of deployment without adding extra barriers that would slow the innovation, because Netflix engineers are making hundreds of changes to production every day? So in the face of these overwhelming odds, we were left with only one option.We decided to use science. So that is why we took the randomized control experiment and we applied it to their risking our deployments. It is a very popular experiment design. Let's say you are developing a new drug, and you wanna see what is the effect of the drug. So you take a population of people, users, and let's say, a hundred thousand people, and you split them randomly into two groups, the treatment group, and the control group.And then the treatment group will get the treatment, while the control group gets the placebo or nothing. And then you let the experiment run for a while, and then you follow and collect results. And once you are comparing the results, since these groups were randomly assigned, there should be no bias. And any effect or difference between the results of these two groups can be safely associated to the impact of that treatment that you applied.So we took this approach and we applied it to our infrastructure, and that's why we created this canary strategy, or started using the canary strategy. So when running a canary for a certain service, service A, it means that we set up two additional clusters for that service, baseline cluster and canary cluster.And then these clusters register to take some small and random portion of the traffic in production. And then we observe how these two clusters behave against each other. So we look at system metrics and we evaluate how the system is doing comparing the baseline and canary signals. And this is a very simple, yet very effective technique, and it has become defacto deployment, and deployment standard of practice in the industry.So it was only natural that we decided to marry these two technologies, canary strategy and FIT. So we still created a canary experiment, but also we would create a FIT session for that canary experiment. So that way we would still have the injection point for the gRPC client talking to service B, but the scope would be updated to only affect the cluster, A-canary, that's the canary cluster that we set up.And then we would still be injecting the failure. And that way as the experiment is running, we would be comparing how the canary cluster is behaving in terms of system metrics, error counters, in comparison to the baseline. So this was already our first kind of venture into a large scale chaos experimentation, but automated chaos experimentation, because here we would be able to automatically measure and compare the signals from the baseline and canary clusters.And this was very well, this was a very successful avenue or approach. And canary's in general provide tremendous value to our company. But if you look at the canary itself, you realize that this approach does not tell the complete story. While this approach is very relatively low risk and easy to set up, it does not explicitly tell us what is the member experiencing, what are the member devices experiencing? Because we are focusing only on service level metrics.We look at these two clusters, maybe look at client-side service metrics, but that's all. In reality, servers may be happily serving requests, yet our members may not be able to stream. Let me illustrate it on these two examples. First problem that we can see here is that the request that goes to the canary cluster will fail.And since these requests are randomly assigned to clusters, when we do the retry, randomly, that request can go to the baseline cluster, and succeed. So that will mask this problem that is happening associated with the canary cluster. And the service owner may not even realize that there's something wrong happening because these requests are non-sticky, and they kind of are bouncing back from the canary to the baseline and succeeding.So we may not spot this problem until we actually deploy to production. Second example that we have is, the request succeeds going back from the canary cluster, and it is correctly processed by the caller. But then as this request is then becoming part of the results for the upstream dependencies, it may actually cause a problem to our member device, and fail there.And we would never know because we are looking only at the canary reports for those two clusters. So therefore, we needed a solution that would give us the ability to directly measure member impact. And so to solve the retry problem of the request, we extended our FIT filter, and we added a new type of a header, we added a FIT override header, and that way we would be able to explicitly tell for every request where it's supposed to go.We say that we make the the request sticky, let me show you how that works. So when the requests are being assigned in our edge gateway to a canary or based on population, they get the tag. They get the tag, either they are canary or they are a baseline request. And then as these tags are again propagating through our infrastructure, once they reach the point where they are supposed to go to service A, the VIP override, the override in the tag will kick in and will direct the request either to a canary cluster or a baseline cluster.And even if that request fails and comes back to the client call and retries, the overrides again will kick in and send a request back to the canary cluster. So this is really locking or sticking the request to its population, and to its cluster. So it's creating this strong signal because the request is locked in into this experience for the entire duration of the experiment, and we can clearly see whether it's being impacted.A second improvement we've made was related to the way how we assign requests and members, or users to our experiment. So we extended again our FIT filter with a user allocation algorithm. And that way the user allocation algorithm is implemented as a consistent hashing function. And that way, each time we see a request from a member device, we are able to hash the request into the range.Here on this example, we are assigning 1% of the request to the canary population, 1% to the baseline population. And the rest is just not affected by the experiment. So that way for each request, we can see from which device it came from, and that way we hash the device ID and we get where into the range it is belonging.If it's a canary or a baseline request. And that way we can always consistently look at the request and determine whether it was in the experiment or not. And this is the first step to actually know which devices were in that experiment, and somehow the first step that would enable us to compute that effect of the experiment on these two populations of users.Speaking about measuring and monitoring, running these experiments may not always go as expected, as illustrated here by our chaos engineers from the Umbrella Academy project. That is why we really needed a good monitoring solution. So at Netflix we've been always using our Atlas monitoring system. That is a really good system that is monitoring the system metrics of all our backend services and nodes, and infrastructure.But we have also built a different monitoring system that is more focused on the member experience. Here, if we zoom only on this part of our infrastructure, we are seeing the customer devices, the gateway, and the first tier of services. So we started collecting logs from the devices. And these logs would be describing to us what is exactly happening on the device.So indeed, they would be sent along with the regular requests through our infrastructure to the Zuul gateway, where they would be redirected to our queue processing infrastructure. And these log events would be for example, representing events such as stream start, whenever the device started streaming the content, or stream start error whenever the device attempted to stream and failed, or app crashes and so on.All these events are being collected in real time by our events infrastructure. And similarly, our first year services also collect these member specific events. For example, stream start, again, per each member. And those stream start represent the state for every member as we see it from our backend infrastructure.So that way, collectively these two sources of events collect the full picture about what is happening for any particular device or member device. And since we are sending these requests to our real time queues, we can further process them in real time. So that's why we implemented this experiment monitoring system that only look at these events.So we have constructed this event-based monitoring system. So here we are collecting all the events coming from the devices and from the first tier services, and we filter them down because we have that user allocation algorithm that can help us to filter down only to those events that are coming from the devices in the experiment.So we filter them down, and we do two things with them. First, we push them into elastic search. So later on we can debug to see what was happening for any particular device, but also we turn them into counters, and that way we can monitor these signals in real time. For example, here we have been running an experiment, and this graph represents all the events happening on the devices that led to an SPS error on the client side.And as these events are pushed to our infrastructure and turned into the counters, we can see that the number of events happening for our canary population was larger than the baseline population. You can see there's always some noise where there's always some errors happening in both populations, but as long as there is no deviation, we let the experiment run.But as soon as we detect there is a deviation, there is an increase in canary population errors, we terminate the experiment earlier. And so what is important here to realize is that the signal we are looking at only represents the errors happening on the devices in the experiment. So that is a major difference between this event-based monitoring system, and the generic monitoring systems that we've been using at Netflix.And this data has per second resolution. So as the experiment's progressing, we are looking at these baseline canary population signals that are along many dimensions. For example, we are looking at the SPS errors or SPS access events on both server and client side. And since the experiment is sticky, are not only the requests, but also the users are locked into the experiment for better or worse.So if they're experiencing any kind of pain, they are locked in that experiment. We usually run experiments for 20, 40 minutes to collect enough data so that we can evaluate, get that confidence that the experiment is not, or the change is not impacting our SPS. But we are monitoring the data with per second granularity, because for 40 minutes that user cannot escape the experiment.And that's why each time there's a deviation, we are able to detect that, and within seconds we are able to shut down that experiment. And that brings us to where we are today. From a simple Chaos Monkey, to game days, to that single user FIT session, to an automated chaos experiments. So we got to the point where we can final execute, save, and precise and autonomous chaos experiments.So let's review what were all the technologies that we needed to get here. So in our FIT filter, we are annotating the requests with the proper tags, then our CRR request routing technology lets us route the request where we need them. And then we are using canary strategy, and the FIT scoping and FIT treatment to actually scope the treatment only to the canary cluster.And finally, we are using the user allocation algorithm to assign members to the experiment, and also to filter out the events so that we can monitor how the devices are doing. And all of this is orchestrated by our chaos experimentation platform. And today, so we built this into a tool called CHAP. So today, any software engineer can go in the tool, set up an experiment, pick a certain FIT scenario, and run this experiment randomly in production on a random set of actual production users and see whether there is an impact.And if there is an impact, the experiment's terminated early and the user can then investigate. So what would previously take numerous hours and numerous people sitting in a game day room for all day, can now be done automatically in a couple of minutes. Here we can see we've been running a subscriber experiment for that same outage that happened years ago.So now we are able to run that experiment monthly, and see what is the impact to our members. And if there is any impact, the experiment will shut down as you can see here, within a couple of minutes. And then service owners can go and investigate. And so we have built this, what we believe is an amazing tool that runs chaos experiments.But in doing so, we have discovered something more. We have developed a tool that can precisely measure impact of a failure in a software on our members. But not only that, we can measure impact of any software change on our members. It allows us to measure the butterfly effect of a software change somewhere deep in the infrastructure, on a certain little service, directly on our members.And we can quantify what is the butterfly effect of that change. And we call this a sticky canary, or a sticky experiment. And our engineers quickly noticed. So they have realized that when they run a sticky canary, it is the only way how can they be absolutely sure that there is no negative impact to SPS, to our members, whenever they wanna push a new change or new software change.So over the past decade, in search of this perfect chaos experiment that would only benefit this narrow group of engineers, we have evolved into a tool that can redefine how we measure and deploy all the different changes into our production. And as a result of perfecting chaos, we have developed this precision and safety when manipulating our infrastructure.And chaos has become just one of many tools that we are using in our tool box. And so from chaos experiments we have moved on to infrastructure experimentation. And so this is the term that we use that when we are running these experiments. And over the years we have identified many use cases where infrastructure experimentation can be applied.Let me walk you through some of them. So first, it's the sticky experiment. That's the base experiment that we use to measure effect of a software change. And in this scenario, the treatment is actually a change. Change can be a code change, a property change running on a new instance type, and you can still measure how that impacts our members.Then chaos just becomes one different type of a sticky experiment where the change we are validating is that, that fault injection in that cluster is not impacting our members. And then we have the unscoped chaos. So this is an interesting application of an unscoped experiment where we are actually not scoping the failures or the experimentation to any particular cluster.Excuse me. So, we are still annotating, or we are still assigning member devices to the population, to the experiment. So we have two groups of devices, we have the canary and baseline members, and we still tag their requests with canary or baseline tags, but we do not scope to a single cluster. Instead, we still wanna inject to our gRPC client injection points talking to service B.Which means that as these requests are propagating through the infrastructure, whenever they reach such an injection point and they wanna talk to service B, the error will be injected, and those requests will fail. And that way we can real nicely simulate, for example, scenario where the whole service is experiencing the whole data center wide outage, and it just becomes unavailable for any system call, from wherever it's being called.So this way we can simulate these outages, and we can validate and quickly determine that a certain service is either critical or non-critical to our streaming experience. Last year I've been running this with a user, with our software engineer, and we've been running an unscoped chaos on a simple database, and the owner of the database said, "this is not a critical path, it's gonna be great." And couple of minutes later, a different engineer comes running into the room saying that "Bandersnatch is broken, because we've broken that interactive experience in the Bandersnatch episodes." And that way how we have realized this service is actually critical, and we need to treat it differently.Another type of an experiment that we have here, is a data experiment. Here we have extended the way how the number of treatments that we can serve. In this case the treatment we are serving is the new data file. And what is happening here is that when the canary cluster is trying to access a certain data object, they will be served, for example, a new version of the data object that we are currently canarying.So that is the change that is happening, and that we can measure how that new data object is actually impacting our devices. And then there's the squeeze. Squeeze is another interesting experiment type, because here we are still using the canary strategy, but we have also used that ability that we can precisely control how much traffic we send to every cluster.So as the squeeze experiment is running, it is our kind of a performance evolution tool or experiment where we are increasing how many requests there goes into the canary cluster. And we do that in well-defined steps, where we are every five minutes, we bump up the number of requests that go to the cluster, and then we can still see how that cluster is behaving comparing to the baseline.And that is a very popular expert type because we can look at different performance characteristics of the service. We can tune, concurrence, limit or threat pools. We can for example also tune how your service is behaving when you give it a different number or different container resources. And you can also define or determine what is your max throughput for the given service.And that is valuable information when you do autoscaling or failovers, and cost resizing. So that is also very popular. And then we have the priority load-shedding experiment, which is another type where we actually take the experimentation discipline, and we are not experimenting on anything that is happening in our data center, but we have extended the injection point family with a new injection point, which is the edge.And that way we can serve different treatments in that edge layer. And here we've been able to run experiments that are validating our load-shedding algorithms. We assign users to the experiment, and those that are coming from the canary population, those requests, if they are lower priority requests, then they're throttled, but higher priority requests are still getting through.And that way we can see where that experience is impacting our members. And here's an animation that shows the priority load-shedding in action. It was captured by one of our engineers. Again, they run a FIT session on their device, and they were able to verify that even though the errors are happening for low priority requests and they're not able to fetch some of the information for the content, when they click play, they're still able to initiate that streaming session.Another new avenue that we opened was the orchestration of chaos experiments outside of our backend infrastructure. So if you remember those little open connect boxes that are storing in our catalog. So we started injecting chaos into those, and we were still able to measure how that affects our member experience.And we are for example, still verify that if that single box goes down, the devices can still reconnect through a different box or directly to our data center in AWS. So that way we've been able to exercise fall backs that were happening completely outside of our data center. So as you can see, there's many different types of experiments we have designed over the years, and it's a very modular approach.And when we create these new experiments, the only thing that is changing are these three parameters, treatment, allocation, and scope. And we can mix and match them. And you can see that it's a very modular approach that lets us create many different experiment types and FIT scenarios. So as you can see, for treatments we usually serve changes, failures or data changes.For allocation, we usually have sticky or non-sticky. And for scope, we either scope the experiment to a certain cluster, or we don't scope it at all. And every time our resilience team is approached in designing a new experiment, we are usually able to satisfy those requests by just designing an experiment type that is varying these three parameters.And then the experiment is still able to profit from all that real time monitoring and safety that we have built into the experimentation. Now a few words about adopting chaos engineering at your organizations. So introducing or building chaos engineering at your infrastructure, or that introducing that practice is a journey.It takes time, but no matter where you are on this journey and your where your company is, there's still that one next step that you can take. So, for example, if you feel like you could be affected by individual nodes going down, you can try the Chaos Monkey approach. You can either use Chaos Monkey that is open sourced, or you can shut down instances by yourself.There's a also high chance that by running on Kubernetes, you are probably already resilient toward node failures. And also you could probably use AWS Fault Injection Simulator to simulate these kind of failures. Then the next stage are canaries. I think that is a very good technology by itself that you can use in your organization and really quickly benefit from that technology right away without chaos.We have open source, Spinnaker and Kayenta. So you can really, either there's a canary stage that you can run a canary, and you can use Kayenta to analyze the system metrics. And that way, moving on to a release cycle, where you have the canaries in the pipelines, in the development or deployment workflows, is a great evolution of your organization because you are already measuring signals, comparing baseline and canary signals.So you will develop that muscle that is needed when you want to experiment with chaos. And then tracing is also an important and interesting technology that will bring you forward or closer to where, for example, our chaos experimentation is at Netflix. Because if you are using something like Zipkin, there's a high chance that your headers, request headers are already being propagated through the infrastructure, so that way you would be able to attach custom headers to those requests as well.And finally, fault injection is also something that can be added to the mix. We are as an industry in a much better place than we've been 10 years ago when Netflix was building FIT for the first time. Because with the evolution of service mesh, for example, Envoy is already supporting some kind of basic fault injection technology.And that way for example, the interceptors or site cars are an excellent injection point where you can implement that logic of injecting some kind of a failure or treatment. Also, spring boot fallbacks are already something that is available, and you can experiment fallbacks and make sure that you are actually having those fall backs, because if you don't, there's probably no point in running those chaos experiments in the first place.Going back to the overall discipline of chaos engineering and infrastructure experimentation, I think we can summarize and say that the single chaos engineering role that was originally focusing on just that chaos, has grown and extended our discipline, and we extended the range of experiments that we can run today.And those can be used by many teams in the company. So from that chaos experiment, we actually started running canneries that are run by any service in our infrastructure. Change experiments. Squeezes are usually running monthly to kind of monitor how your performance is doing over time. Unscoped chaos experiments are running quarterly whenever you need to verify that where your service is critical or non-critical, and so on.Data canaries, our priority load-shedding experiments, are also a popular experiment type. So the infrastructure experimentation today exists with that single purpose of enabling our software engineers to innovate as fast as possible, and also without compromising the safety and stability of Netflix. And at Netflix, it's no secret that we want to entertain the world.And as you could see in the stock, accomplishing this involves coordination of many systems that needs to work seamlessly in harmony with each other. But the final product, and delivering those moments of joy to our members makes it all worth it. Thank you very much for attention.