Next generation cloud automation: September 2013

Automation and the cloud

Far too often I've worked for companies that are using the cloud incorrectly. It wasn't until recently that a college put it perfectly when he said "we use the cloud like it's just another data center." I couldn't have put it better myself.

When we talk about this "cloud" stuff we're talking about a fundamental shift in thinking and expectations. I see this shift in much the same was as the change in expectations between the first time I went to ESPN and going to ESPN now. About 10 years ago sites were flat, things were relatively easy and sites were not all that complicated. Take a look at any property out there what you get isn't a web page, it's an immersive experience tailored to your advertising profile.

We typically refer to this shift in expectations as "Web 2.0." Many years from now someone will make a documentary of sorts that tells the tale of the cloud story. In that story the word DevOps will be used to define the change in expectations that we're going through now. At the beginning of the story we have the traditional Linux/Unix engineers, or as I like to call them "bare metal folks." At the end of the story we have the DevOps engineers, or "cloud people" like me.

In my travels I have found that the traditional folks are generally more conservative, and thus an incredible value to me as a balance to my inventor energy. In most cases these minds forget that when moving into the cloud the change in expectations is absolutely vital for being successful in the cloud.

If you really do want to make this work there are a few basic ideas that you have to come to grips with.

Nodes are replaceable

The idea of web000.blah.us-west-1.aws.mydomain.tld is bad. What you're used to is an environment where a wiki page ( or similar documentation ) details each node, it's role, it's static IP assignment and other bits of information. In the cloud world this doesn't exist.

This is one of the reasons why people like me are constantly banging the automation drum. If a node has an issue in production, you simply trash it. There is the rare edge case where you might have to take a box out of rotation for investigation, RCA, whatever, but that node should never have a path back into production. Once it's out, it will never take production traffic again.

This is a significant hurdle for most people to get over, because it means that you have to trust the state of your environment to something like chef, puppet, ansibleworks.com or salt. In any case trusting that something else is going to be able to do this work is difficult.

Cloud people don't struggle with this as much since we start with a configuration management tool and work from there. If we know our automation works, then we know we can solve problems by simply trashing anything that appears broken.

Automate as much as you can

My mantra is "If it can't be automated, it shouldn't exist." That's a pretty lofty goal for sure, but we use this as a target, and it's okay if we don't hit the target every time. We allow for exceptions and focus our documentation efforts on those exceptions. Everything else we try to automate as much as we can.

+Brian Aker said recently in his keynote at LinuxCon/CloudOpen North America that they removed SSH from production.

Think about how this would work as a design philosophy. You start with the idea that you will have absolutely no access to a production resource, and no way to enable SSH at any point after it's deployed.

This would require that you have everything correct up to production, or in other words, your testing and automation, and automated testing is so locked down that you are absolutely confident that everything will work in production.

This is the fundamental lesson to be learned from cloud computing. It's not just a bunch of virtualization and "wasted cycles" it's an enforcement engine for proper design. If you can honestly say that you can safely remove SSH from all of your production bare metal resources and be perfectly fine with doing that, then you win. However, I know for a fact that the vast majority of data center operations would not work without SSH.

Good automation makes the cloud work. Remember, every time to solve a problem with a bash script a unicorn dies. Do your part to stop unicorn genocide and ban bash from your configuration management solution.

Virtualization doesn't suck

Let me clarify here: virtualization on the macro level does eat cycles to everything from memory usage to network IO. We're all well aware of this, we call it a cost, it's something you pay for to get something else. The cost buys us something that most of the bare metal people forget about: an API.

I can create 50 nodes that do something neat with a single command. You require six months of debate, committees, and purchase order red-tape to get new hardware. Once the hardware arrives you also eat the cost of racking everything, tracking the assets used in whatever horrible abomination tracks those bits, then, finally ( in most cases ) there is a semi-automated process to get the instance up.

After everything is said and done you've spend an enormous amount of time, energy and money on something you could have done in five minutes for much cheaper.

Adding capacity to a site is an example of such a use case.

I've always believed that the capacity of a single piece of hardware will never be able to compare to what can be done with many smaller things. If you need an actual use case for how this works take a look at how Amazon does ELB. Many virtualized haproxy nodes, not one giant piece of hardware. Heat ( part of OpenStack ) has a similar type of functionality that looks like ELB and just spins up haproxy nodes.

Summary

I predict that we're going to start seeing an imbalance of sorts when it comes to IT jobs. The bare metal folks will gravitate towards the few remaining companies that either specialize in a tailored "private cloud" experience ( like Blue Box Group or RackSpace ) or can justify running their own datacenters ( Facebook or Twitter being the obvious examples ).

In either case the pool of available jobs will decrease and employers will require more and more specialization in new hires. The sector will become a niche offering and eventually suffer from a dire lack of available, qualified people.

While this is happening, the cloud people will continue to accelerate on the cloud platforms and eventually reach the critical mass at which point we will have fully transitioned our expectations from "old and broken ESPN" to "rich and engaging ESPN." At which point I will eat cake.

Next generation cloud automation

Sunday, September 29, 2013

Chef snippet: Creating a self-signed cert

Saturday, September 28, 2013

Automation in the cloud

Cloud formation and VPC's

Blog Archive