Next generation cloud automation: tcp syn flooding ... why does this keep happening?

Linux itself is just a kernel, all of the crap around the kernel, from bash, to KDE is considered part of the distro. Linux distros come in many flavors, shapes and sizes. This is benificial to everyone as it allows for a few generic "types" of distros that can then specialize with further customizations. For example, Ubuntu is derived from the Debian base type. Similarly CentOS is derived from Red Hat Enterprise Linux.

The differences between CentOS and RHEL are mostly related to licensing and support. Red Hat the company publishes, maintains and supports the Red Hat Enterprise Linux ( RHEL ) distribution. Their product isn't Linux, it's actually supporting and maintaining the distribution around Linux. CentOS is not a product, it's maintained by the community according to their perfectly adequate community standards.

Amazon also makes it's own AMI. There AMI is specifically designed to work in their EC2 environment. The expectation is that this AMI will be used in a high traffic environment.

When having a discussion about which distribution to go with it's important to pay attention to the expectations of use. These expectations aren't always as obvious as the expectations around the Amazon AMI. Amazon is making it's expectations very clear:

"the instance that you use with this AMI will take significant amounts of traffic from one or may sources"

This expectation is set by the engineers that were paid large sums of money to ensure that these expectations are reflected in every decision made by the maintainers of the Amazon AMI. Amazon pinned their reputation on this AMI and they use it in their own production environments, that tells me quite a bit about what went into making this distro happen. They also support it as part of an enterprise support contract. All of these facts make the Amazon AMI a very solid choice for cloud operations.

Other distros, like CentOS are a far less clear on their expectations. For instance, CentOS seems to live in many worlds with no clear, specific role or specialization. However, certain choices made by the maintainer can give us an general idea of what the intent might be.

For example, if a distro made a choice to, by default, protect the system from a type of attack known as a "syn flood" you would see the following:

cat /proc/sys/net/ipv4/tcp_syncookies
1

This means that tcp_syncookies is enabled. This is a setting which says "don't allow too many people from a single IP address flood this computer."

This is great for desktops, but it's directly in contrast to how servers are supposed to work. This is exactly opposite of what you want in a high-volume environment.

It's a very easy fix, and in our case, all I had to do is add a single line to the startup scrip for the service nodes.

"echo 0 > /proc/sys/net/ipv4/tcp_syncookies\n",

Easy peasy.

Here's what you would see in EC2 if you were using an ELB:

Instances are fine, everything is in "In Service" state.
You run ab -n 10000 -c 500 http://your_host/hb
About half way through the test the ELB goes nuts, all of the instances are in "Out of Service" state.
The application is fine, but the operating system seems to be completely offline from the perspective of the load balancer, but you can ssh into the instance, and telnet works fine.

What's happening is that the box is basically being DDOS'd from the haproxy's that run as the ELB. This type of mitigation is only useful on desktops because this is usually handled by other points upstream when in the cloud or any kind of professionally built data center environment.

I remember writing about this many years ago while working at Deep Rock Drive, I'm just astonished that we still see this today in professional environments.

Next generation cloud automation

Friday, October 4, 2013

tcp syn flooding ... why does this keep happening?

No comments:

Post a Comment

Blog Archive