|
|
Log in / Subscribe / Register

Managing security for the cloud

By Jake Edge
October 8, 2014

AppSec USA

It seems likely that the folks at Netflix were unaware of the previous incarnation of Project Monterey before choosing to use that name. The first use was an attempt to unify Unix that ran aground and is part of what led SCO to attack Linux back in 2003. That's probably not the model the Netflix Cloud Security team had in mind for its tool to monitor the security state of its applications. Kevin Glisson, who is a member of that team, spoke at AppSec USA in Denver about the tool, which Netflix plans to release as open source sometime relatively soon.

Glisson started with a quick overview of some of the numbers that Netflix deals with, To start with, it has 50 million subscribers in 47 countries. Transferring movies to those folks uses 34% of the peak evening bandwidth in the US.

The engineering culture at Netflix is focused on "developer enablement", he said. Developers have the freedom to push code to the live systems at will. It is a "fail fast, learn fast" environment that stresses both freedom and responsibility. Netflix engineering also has a "DevOps" mentality, which means that engineering teams own the "end to end" problem. That includes more than just new deployments and maintenance of existing deployments; both capacity planning and procurement are added into the mix. That gives the teams better context to make decisions.

The entire deployment pipeline has been streamlined and optimized for the developers. The code for an application lives in Git, then gets built into a .deb or RPM file depending on where it will be running. That plus a "base image" of the operating system (OS) get "baked" into a new image that will never change—instead, new versions of the application will be used to create new images. This helps to keep a uniform deployment look throughout the entire system. The OS images get turned into virtual machine images for use on Amazon Web Services (AWS).

The company has "tens of thousands" of AWS instances running at any given time. The evening peak doubles the number of instances. Netflix relies on the ability of Amazon to provide more capacity based on Netflix's load, Glisson said. There are hundreds of applications running, sometimes with dozens of versions of a single application. Typically, that is for A/B testing, where the developers are looking to see if a new feature makes customers happier or provides some other measurable benefit.

As one might guess, all of that makes it difficult to do security checks on new code. Monterey was built to help with that problem. It is important that there are no checkpoints or security gates in the development process that would slow down developers, but the security team still needs to do its work.

At its core, Monterey is a framework for monitoring and evaluating the "security state" of applications. It makes it easy to integrate existing tools into the system so that fewer wheels need to be reinvented. It is Python-based, which made it "easy for security engineers to write tests", he said.

Monterey has four separate parts: discovery, inventory, scanning, and results. Discovery uses APIs from other tools (e.g. bug trackers) to gather information on the state of the environment. That can include the existence of a new application or things like "how many bugs are open" for an application. Inventory collects information on the security state of the known applications, for example: "how many times have I scanned that application this month?"

Scanning takes the information from the discovery phase and actively scans applications using both open source and proprietary tools. It uses tools like Nmap, Zed Attack Proxy (ZAP), and others to try to detect problems with the applications. Finally, results gathers all of the information and tries to make correlations. It also ranks the severity of problems that were found, which helps the Cloud Security team focus its efforts, Glisson said.

Traditionally, security teams work with an application for a few months, then make a report on the problems found. His team cannot do it that way, because it would slow down the developers too much. Instead, the team must find and fix problems in already-deployed applications.

A basic unit of work in Monterey is handled by a "monklet". They are simple, stateless workers that just "receive a job and execute it". A monklet is "like a soldier", he said, it just takes orders and reports success or failure. Monklets provide the integration point between Monterey and external tools and resources. They scale well and can be used for all of the different parts of Monterey, not just scanning.

Monterey was designed to easily and quickly integrate new third-party tools. That was done to prevent vendor lock-in and to be able to use the best-in-class tools. For example, Monterey was never intended to have its own scanning engine, but to use what is out there instead, Glisson said.

As an example, he described a test that is currently run. It queries AWS using Asgard, which is an open-source cloud-management tool that Netflix released, looking for applications that have internet access. Once found, it fingerprints them with Nmap to see if there are new ports exposed unexpectedly. It also uses the Arachni web application security scanner to scan the HTTP and HTTPS ports. The results are uploaded to ThreadFix, which can provide a high-level view of the results.

That kind of a scan could be done on a scheduled basis, but that's "not quick enough", he said. Instead, Monterey is notified of newly deployed applications. It then runs a basic web application scan on the application. Those results are sent to ThreadFix and alerts are generated for any regressions. Problems in the deployed applications have to be caught and fixed quickly. Normally, the fix is to roll that application back to a previous, working version. The system keeps many previous versions of each application around to facilitate that.

Monterey is "fairly full-featured", Glisson said, and is being used internally. There are still some things that could be done to make it more efficient and to scale better. The team is always looking for new workflows, use cases, and ways to add more context. It is also on the lookout for new tools that could be added as monklets. It is "not quite ready yet" for an open-source release, but he looks forward to that happening, so that others start getting involved in writing new monklets and by contributing in other ways.

His group at Netflix needed a different approach to application security. His team is like a consulting organization that "builds guard rails, not roadblocks". Current security tools are not built with the kind of scale that Netflix (and lots of other organizations) are dealing with in mind. They focus on the penetration test (pentest) workflow, but there needs to be more than just that kind of testing, Glisson said.

The Monterey project is not intended to replace pentesting, but to be complementary to it. The idea is for the framework to be a "force multiplier". Netflix has found that it makes sense to invest in tools that help use other tools; Monterey is an example of that. Based on audience response, there are other organizations struggling with similar problems, so it will be interesting to see if Monterey (perhaps re-branded?) can help fill those gaps once it has been released.


Index entries for this article
SecurityCloud
SecurityTools
ConferenceAppSec USA/2014


to post comments

Managing security for the cloud

Posted Oct 21, 2014 17:41 UTC (Tue) by nye (guest, #51576) [Link] (3 responses)

I keep seeing interesting articles about Netflix that describe a service that's being developed at a breakneck pace, and yet from an end-user perspective it seems like Netflix barely changes, and even then only in trivial ways.

Clearly I'm missing something. Does anyone know of any articles or resources that give any more information on what they're doing behind the scenes that requires such rapid development?

Managing security for the cloud

Posted Oct 21, 2014 19:59 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

They run a daemon called "Chaos Monkey" which causes failures in random spots (up to and including cutting off access between data centers) to ensure they are always reliable and resilient (disaster fuzzing? fault fuzzing?). I imagine that a non-zero amount of time is spent dealing with new issues it finds. Maybe they're getting ready to serve up 4K video? Though I wonder if ISPs will be ready for that…

Managing security for the cloud

Posted Oct 23, 2014 13:09 UTC (Thu) by nye (guest, #51576) [Link]

Yeah, I wonder how quickly Netflix is growing in terms of both users and bandwidth (as existing users increase their usage or view more streams in higher definitions).

Maybe it's almost all about constantly keeping up with a need for higher capacity.

Managing security for the cloud

Posted Oct 21, 2014 21:57 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

Frantic change with no _apparent_ benefits or even consequences for the end customer is the rule rather than the exception for some mature businesses. The companies charged with delivering my gas, electricity, and water are all engaged in colossal projects to replace and repair their networks. In my city they tore out practically every meter of mains gas pipe, digging up roads for weeks at a time. Do you know how long I didn't have gas for? Trick question, of course cutting off the supply of gas needed for heating would be politically impossible, the entire mains system was replaced without (except where accidents necessitated) cutting off anyone's supply. Ordinary customers only know the pipes were replaced because there were adverts about it, or because they saw trucks with the gas supply network's name on it parked up at all the major roadworks for months.

On the railways teams of workers slice out and replace sections of rail overnight, nobody on that first train the next morning says "Oh yeah, this railway service is much better now that the 0.8% twist fault on the up main just before S/443 is fixed" to them the service is unchanged, but a project manager probably spent a week arranging to replace that section of rail without disrupting service.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds