Of the many discussions that I've been a part of in my career one of my most favorite is the discussion around AWS::CloudFormation. Specifically when it comes to the question of which is better: big stacks with lots of things, or many smaller stacks each with fewer things.
I'm going to tackle each side of the argument, and then give what I think is a resonable conclusion to the question.
tl;dr: Big stacks are better.
Let's first start with the argument for more smaller stacks. Usually the conversation starts by claiming something about smaller stacks being better because it's easier to change things down the road if something needs upgrading or changing.
Here is a use case that comes from one of the larger companies that I've worked for. In this case I was working as a DevOps engineer managing five mobile services. Each service had its own prod and pre-prod AWS account. Each account was created and managed by the companies AWS security team. The security team was responsible for ensuring that each service that ran in any of the AWS accounts ( including pre-prod things ) was adhering to the company wide security standards.
This team was also responsible for making sure that everyone was using their VPN rig, which include things like routes and security groups. Each element ( VPN, route, gateway, NAT ) was configured as a single AWS::CloudFormation stack.
The thinking here was that the security team could change out components as needed. So, for example, if at some point they needed to change a route, the would simply update the route stack and everything would be fine.
Good in theory, but we'll see how this plays out later on...
Now let's contrast that with the big stack model where we're putting the majority of our resources into a single stack. We do separate the database from cache from the application itself, so we do end up with a few stacks. However, the number of stacks with this model is noticeably different from small stack approach.
The security team would often argue that the problem with this approach is that if you update the application stack something might go wrong. This is an unfortunate line of thinking as it is driven from a position of not understanding how things work when things are upgraded. This happens to be far too common a thing in the world of cloud automation.
The argument usually comes down to the fear of not knowing what's going to happen when something is changed, so break everything up into smaller chunks so that a single update doesn't take down the body of the thing. Fear based cloud automation should be treated like a crime. What's worse is that the documentation for AWS::CF clearly, plainly and obviously points out exactly which conditions change resources, so everything in this domain is predictable.
The little stack approach, which was utilized by the security team, has a significant drawback in that it's moving the logic for what gets created by the end user away from AWS::CF. When a stack is created the AWS::CF engine makes certain optimizations about what gets created and when and utilizes every aspect of its own internal API's to create the stack in the most optimal way possible. That's not necessarily the fastest either. Optimal could mean many different things for many different scenarios.
The little stack approach forces the end user to come up with their own version of this logic. Now the end user has to create rules for what gets created and when. As well as deal with optimizations like parallelization.
In my view, the entire reason that I like using AWS::CF is because I get to hand off all of the work for creating my infrastructure to my cloud vendor. This means that I don't have to spend any time dealing with rules, or an engine for creating things. I simply create my stack and fire away.
The little stack approach is dangerous and illuminates lack of knowledge for a particular domain of cloud automation.
Solving complex problems with speed that creates delightful experiences in the world of cloud automation. Helping you get more out of your cloud.
Tuesday, March 1, 2016
GCP - Create or Aquire
Create or Acquire
Quite possibly the most painful and annoying feature of the GCP compute cloud. This actually happened to us in our production environment. This little feature ended up taking down our production stack for half an afternoon. Thankfully this was before we were fully launched as a product, so it was reasonably non-impacting.
I created a DM stack called “prod-gateway-0” which created a compute instance called “prod-gateway-0.” This was very early on during our development of the DM bits so we were still learning how everything was working. I had created all of the lower “gateway” stacks without any incident, so I had confidence that this wouldn’t be a problem in production.
prod-gateway-0 came up without a problem, but I noticed that I had made a mistake in the bootup sequence. It wasn’t anything catastrophic, but I wanted to take the stack down, rebuild it and make sure it came up correctly the way I intended. It’s important that everything come up properly so that we don’t have danglers or one-offs that might bite us down the road later on.
I deleted the stack and immediately noticed that prod had basically blown its brains out. Upon investigation we realized that our old management system ( Ansible ) had created the original prod gateway instance named “prod-gateway-0.”
I had assumed two things:
- Absolutely everything tied to stack is unique to the stack. This is not true in the case of things that already exist.
- An error would be thrown if the stack tried to create something with the same name as something that already exists.
Neither of these assumptions are true when it comes to GCP-Compute. Strangely enough, both points are true when it comes to disks. Apparently this rule only applies to compute instances.
It’s difficult to understand the design of a system that would decide to take ownership of something that already exists, and what’s more, would remove that object if the stack is removed. I would assume that if the stack took ownership of the object, it wouldn’t then delete the object since it wasn’t created by the stack.
Apparently GCP support seems to think otherwise. It’s important to remember that things are not as unique as they seem in GCP-Compute land.
Subscribe to:
Posts (Atom)