I have a whole new appreciation for Netflix.
It should be pretty simple, right? I turn on my smart TV. The Netflix app authenticates me & checks I’ve paid my subscription fee. It delivers content suggestions based on my profile & my previous viewing choices. And it does this on an enormous scale, globally. So you assume they have some pretty big, redundant networks and servers.
That’s the problem with consumer technology that’s this good – we assume it’s simple.
The Atlas team of 5 people (1 devops engineer and 4 platform engineers) might even agree with you, to a point. You see, they’ve made the complicated things simple. Or at least I think they’re complicated. And Matthew Johnson has an incredible way of explaining concepts and details in English, without talking down to you. So without understanding the nuances of his technology, I could follow the bouncing ball as he explained their telemetry system. It’s pretty damn cool.
Like all good telemetry systems, Atlas provides an automated process by which measurements are made and other data is collected and transmitted to receiving equipment for monitoring. Thanks, Wikipedia. I had no idea.
Before I start talking about some impressive numbers, it’s important to note that we’re talking about the data collection from one part of the Netflix machine – the Cloud control pane. This bit runs the Netflix API, handles the user data, signups, billing etc. Nothing to do with content delivery. Nope, you haven’t even pushed play yet.
Atlas is running in AWS in real time storing data in memory. It’ll hold the last 14 days and anything older will be persisted out to S3, accessible via Hive jobs. It’s designed for operational intelligence – do we have a problem now? Are we currently in a peak demand time or off peak time? This is operational monitoring on steroids.
It’s collecting on average 2 billion metrics per region per minute. Minute. Can you imagine how long it takes to query that amount of data?
Not 16 seconds. That’s not a smudge on your screen. One point six seconds.
Well, in one example anyway. But that example is fairly typical. Running a particular query on 2 weeks’ worth of data, 1.4 billion input points, 1.3 million output points across 2 output lines … renders your results graph in 1.6 seconds.
And I haven’t mentioned yet that it’s available as Open Source on Github: https://github.com/Netflix/atlas/wiki/Overview
So just how do they keep it so speedy?
The SLA for metrics is 5 minutes. Anything that takes longer than that will terminate.
Output lines are restricted to a maximum of 1024. Smaller number of output lines = faster response.
Maximum query times and max concurrent queries are enforced.
Alerts are limited to the last hour. Anything else is irrelevant. So only the last hour is feeding into engineers for investigation.
They keep stateful and historic data clusters separate.
The use rollups and filters on data.
This is just the tip of the iceberg regarding the technology in use in this company.
Servo comes into play as a Java library to create & collect metrics and it supports timers, counters and gauges. Spectator is a next gen API wrapper sitting on top of Servo.
Atlas plugin (not open source), handles automatic tagging, metrics batching, and a binary JSON codec which dramatically improves performance of sending metrics to the Atlas backend. That’s important when some apps send up to 350,000 metrics per instance per minute to the Atlas backend.
Their ‘critical metrics’ runs as a separate stack, collecting every 10 seconds and dumping after 2hrs.
They’ve also got on-instance alerts and Atlas polling to monitor the monitoring, providing health checks of the Atlas system itself.
The Atlas Publish component now autoscales. But Netflix had to write it to do that after they had manual scaling issues. They use a rolling red black deployment of instances for high availability which is normally a 12 step process. So Atlas Deploy was created as a ‘stateless multi-threaded deployment automation service that enforces immutable infrastructure with declarative desired end-state configuration’. Or, as I like to call it, ‘self-spawning’.
But let me end this geek fest with some observations.
Nothing out of the box would scale up to what Netflix needed. Most of the technology had to be built in-house, but Netflix have fed Atlas improvements back into the open source code.
Matthew’s advice was “If you have the opportunity to invest serious engineering time into metrics, it pays off.” He also mentioned about reporting on the unused metrics – stop collecting them, to get more efficient.
The most outstanding thing for me though was his understanding of how his role in the company and his code actually helps ‘the Business’. Yes, it’s very cool tech stuff. Yes, making it go faster and be more efficient is rewarding. But Matthew had the best expression of what business Netflix is in and how that business is successful, unscripted and not on the powerpoint slides when asked a question:
“Our goal is to win moments of truth. When you go home at the end of the day and sit down with your family & say what do I want to do to relax this evening, if you choose Netflix, we win that moment of truth. We’ve done our job right. We’ve made it easy for you to find something you’re interested in watching.”
I hope that in your company, your techs have the company vision in their hearts as well as Matthew does.
P.S. You can watch Matthew’s full presentation here: https://vimeo.com/146051830
Disclaimer: I attended the Netflix presentation as a guest of Tech Field Day at Data Field Day Roundtable 1. My travel expenses were covered, but I’m not being paid to write this post.