Tuesday, March 1, 2016

Big stack, little stack

Of the many discussions that I've been a part of in my career one of my most favorite is the discussion around AWS::CloudFormation.  Specifically when it comes to the question of which is better: big stacks with lots of things, or many smaller stacks each with fewer things.

I'm going to tackle each side of the argument, and then give what I think is a resonable conclusion to the question.

tl;dr: Big stacks are better.

Let's first start with the argument for more smaller stacks.  Usually the conversation starts by claiming something about smaller stacks being better because it's easier to change things down the road if something needs upgrading or changing.

Here is a use case that comes from one of the larger companies that I've worked for.  In this case I was working as a DevOps engineer managing five mobile services.  Each service had its own prod and pre-prod AWS account.  Each account was created and managed by the companies AWS security team.  The security team was responsible for ensuring that each service that ran in any of the AWS accounts ( including pre-prod things ) was adhering to the company wide security standards.

This team was also responsible for making sure that everyone was using their VPN rig, which include things like routes and security groups.  Each element ( VPN, route, gateway, NAT ) was configured as a single AWS::CloudFormation stack.

The thinking here was that the security team could change out components as needed.  So, for example, if at some point they needed to change a route, the would simply update the route stack and everything would be fine.

Good in theory, but we'll see how this plays out later on...

Now let's contrast that with the big stack model where we're putting the majority of our resources into a single stack.  We do separate the database from cache from the application itself, so we do end up with a few stacks.  However, the number of stacks with this model is noticeably different from small stack approach.

The security team would often argue that the problem with this approach is that if you update the application stack something might go wrong.  This is an unfortunate line of thinking as it is driven from a position of not understanding how things work when things are upgraded.  This happens to be far too common a thing in the world of cloud automation.

The argument usually comes down to the fear of not knowing what's going to happen when something is changed, so break everything up into smaller chunks so that a single update doesn't take down the body of the thing.  Fear based cloud automation should be treated like a crime.  What's worse is that the documentation for AWS::CF clearly, plainly and obviously points out exactly which conditions change resources, so everything in this domain is predictable.

The little stack approach, which was utilized by the security team, has a significant drawback in that it's moving the logic for what gets created by the end user away from AWS::CF.  When a stack is created the AWS::CF engine makes certain optimizations about what gets created and when and utilizes every aspect of its own internal API's to create the stack in the most optimal way possible.  That's not necessarily the fastest either.  Optimal could mean many different things for many different scenarios.

The little stack approach forces the end user to come up with their own version of this logic.  Now the end user has to create rules for what gets created and when.  As well as deal with optimizations like parallelization.

In my view, the entire reason that I like using AWS::CF is because I get to hand off all of the work for creating my infrastructure to my cloud vendor.  This means that I don't have to spend any time dealing with rules, or an engine for creating things.  I simply create my stack and fire away.

The little stack approach is dangerous and illuminates lack of knowledge for a particular domain of cloud automation.





GCP - Create or Aquire

Create or Acquire

Quite possibly the most painful and annoying feature of the GCP compute cloud.  This actually happened to us in our production environment.  This little feature ended up taking down our production stack for half an afternoon. Thankfully this was before we were fully launched as a product, so it was reasonably non-impacting.

I created a DM stack called “prod-gateway-0” which created a compute instance called “prod-gateway-0.”  This was very early on during our development of the DM bits so we were still learning how everything was working.  I had created all of the lower “gateway” stacks without any incident, so I had confidence that this wouldn’t be a problem in production.

prod-gateway-0 came up without a problem, but I noticed that I had made a mistake in the bootup sequence.  It wasn’t anything catastrophic, but I wanted to take the stack down, rebuild it and make sure it came up correctly the way I intended.  It’s important that everything come up properly so that we don’t have danglers or one-offs that might bite us down the road later on.

I deleted the stack and immediately noticed that prod had basically blown its brains out.  Upon investigation we realized that our old management system ( Ansible ) had created the original prod gateway instance named “prod-gateway-0.”  

I had assumed two things:

  1. Absolutely everything tied to stack is unique to the stack.  This is not true in the case of things that already exist.
  2. An error would be thrown if the stack tried to create something with the same name as something that already exists.

Neither of these assumptions are true when it comes to GCP-Compute.  Strangely enough, both points are true when it comes to disks.  Apparently this rule only applies to compute instances.

It’s difficult to understand the design of a system that would decide to take ownership of something that already exists, and what’s more, would remove that object if the stack is removed.  I would assume that if the stack took ownership of the object, it wouldn’t then delete the object since it wasn’t created by the stack.

Apparently GCP support seems to think otherwise.  It’s important to remember that things are not as unique as they seem in GCP-Compute land.

Tuesday, January 26, 2016

The FizzBuzz

http://blog.codinghorror.com/why-cant-programmers-program/

Write a program that prints the numbers from 1 to 100. But for multiples of three print "Fizz" instead of the number and for the multiples of five print "Buzz". For numbers which are multiples of both three and five print "FizzBuzz".

#!/usr/bin/env ruby
(1..100).each do |i|
   if( i % 3 == 0 && i % 5 != 0 )
      p "Fizz"
   elsif( i % 5 == 0 && i % 3 != 0 )
      p "Buzz"
   elsif( i % 5 == 0 && i % 3 == 0 )
      p "FizzBuzz"
   else
      p i
   end
end 


Seems like there should be a more elegant solution to this.  But this only took about 2 minutes to do.  Arg, now I'm curious to know if this would have passed the test!  Ahh well, guess I'll just stuff this in the corner for future reference.

Monday, January 4, 2016

docker-gen

This was a neat question:

https://stackoverflow.com/questions/34594654/nginx-proxy-running-multiple-ports-tied-to-different-virtual-hosts-on-one-conta

Which got me to these two repos:

https://github.com/jwilder/nginx-proxy
https://github.com/jwilder/docker-gen

This is a neat little project.  It looks like you can embed little bits of configuration hotness into a docker container and basically treat this system like a little mini chef implementation.

That is really cool, I'm going to have to look into this more later...

Saturday, December 13, 2014

Chef + AWS KMS

This was a ton of fun to figure out. A challenge and a journey to discover how to optimize the integration of Chef and the new AWS KMS system. 

KMS is an interesting new product from AWS. It's server-side encryption, which means you're going to send it a payload of unencrypted bits, and they'll return to you the encrypted payload. This is not an encryption service in that it's not designed to encrypt large bundles of data like application packages or, in fact, anything over 4k.

Because of this we use KMS to encrypt our encryption keys that we then use to decrypt our payloads.
This can get pretty confusing, but in a nutshell here's the workflow:

  1. Create 4k "password" that will be used with openssl to encrypt and decrypt large payloads.
  2. Encrypt this payload with KMS and store the result in s3.
  3. On the client side we pull the s3 payload down and use KMS to decrypt the payload, which will give us our decryption key.
  4. Use that key to decrypt things like edb keys or validation pem files.  Possibly even larger payloads like application tarballs.

Let's kick this off by digging right into the code. This is my rake task for encrypting a payload with KMS:

namespace :encrypt do 
  task :payload, :filename, :service_name, :env_name do |t,args|
    cloud = AWSCloudHelper.new( args[:service_name], args[:env_name] )
    local_archive = args[:filename]

    Log.debug( "Getting key from s3." )
    s3 = cloud.get_s3()
    bucket_name = cloud.get_profile_name() ## logging-preproduction 
    enc_secret = s3.get_object({
      :key => "my secret key location",
      :bucket => bucket_name
    })
    Log.debug( "Key get complete." )

    Log.debug( "Decrypting key using KMS." )
    kms = cloud.get_kms().decrypt({ :ciphertext_blob => enc_secret.body.read })  #this is just a helper for getting a Aws::KMS::Client.new() object
    decrypted_secret = kms[:plaintext]
    puts decrypted_secret
    Log.debug( "Decryption complete." )

    Log.debug( "Encrypting payload." )
    cipher = OpenSSL::Cipher.new('super secret encryption method')
    cipher.encrypt
    cipher.key = decrypted_secret
    encrypted = cipher.update(File.read( local_archive ).chomp) + cipher.final

    f = File.open( "%s.enc" % local_archive, "w" )
    f.print( encrypted )
    f.close()

    Log.debug( "Encryption complete." )

    Log.debug( "Pushing to s3." )
    cmd_s3_push = "aws %s s3 cp %s.enc s3://%s/" % [cloud.get_aws_opts, local_archive, bucket_name]
    Log.debug( "CMD(s3_push): %s" % cmd_s3_push )
    system( cmd_s3_push )
    Log.debug( "Push complete." )
  end
end

So, if we're encrypting a EDB key we would crate the, store it into a file and do something like:

rake encrypt:payload["/my_edb.key", logging, preproduction"]

And we would end up with an encrypted payload stored in S3.

This is my lib function for getting kms-encrypted payloads in a chef recipe:

require "aws-sdk-core"

def get_kms_payload( payload )
  aws_access_key_id = ACCESS_KEY
  aws_secret_access_key = SECRET_KEY
  creds = Aws::Credentials.new( aws_access_key_id, aws_secret_access_key)

  s3 = Aws::S3::Client.new( :credentials => creds, :region => "us-east-1" )

  bucket_name = "%s-%s" % node.chef_environment.to_s.split( "-" )  ## logger-preproduction

  ## Get the encryption key used to encrypt everything.
  kms_payload = s3.get_object({
    :key => "this is where I keep my special secret payload",
    :bucket => bucket_name
  })  

  kms = Aws::KMS::Client.new( :credentials => creds, :region => "us-east-1" )
  res = kms.decrypt({ :ciphertext_blob => kms_payload.body.read })
  kms_encryption_key = res[:plaintext]

  secret_payload = s3.get_object({
    :key => payload,
    :bucket => bucket_name
  })  

  ## Now use the main decryption key with openssl to decrypt
  cipher = OpenSSL::Cipher.new('super secret encryption method')
  cipher.decrypt
  cipher.key = kms_encryption_key
  cipher.update( secret_payload.body.read ) + cipher.final
end

And this is my implementation:


(service_name, env_name) = node.chef_environment.to_s.split( "-" )
edb_secret = get_kms_payload( "%s.pem.enc" % env_name )
users = Chef::EncryptedDataBagItem.load( "logging", "users", edb_secret )
magic = Chef::EncryptedDataBagItem.load( "logging", "magic", edb_secret )

There are several neat things about this:

  1. The EDB key is never actually stored on disk, so it's never persisted ( the security folks should enjoy this ).
  2. KMS access is logged via CloudTrail, another +1 for the security folks.
  3. IAM is used to control access to the s3 bucket, files, and of course KMS keys.
  4. Eventually we can extend this to be more dynamic and do something crazy like roll out a new KMS key every time we build a new stack.
Great learnings in this little adventure.

Wednesday, November 20, 2013

Gut check

Let's do a quick gut check on where we are in the evolution of our information technology.

First we start with this math problem:
http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem

Now we look at Moore's law:
http://en.wikipedia.org/wiki/Moore's_law

In the abstract Moore's law is stating that the capacity of our compute resources doubles every 18 months, but let's extend that to say that the pace of technology adoption and our overall use of technology doubles in that same period.

If we start at the epoch of Jan 1, 1970 and count the number of 18 month periods we would have 28.6 periods, so let's round this up to 29.

This puts us here:
1,073,741,823
http://www.wolframalpha.com/input/?i=%5Csum_%7Bi%3D0%7D%5E%7B29%7D+2%5Ei.%5C%2C+

Let's put this number in context, the last iteration, 28 was this number ( exactly half, of course ):
536,870,911

The next iteration will be ( exactly double, of course ):
2,147,483,647

These are some pretty huge numbers, in this model we can put numbers behind how fast things are about to start moving.  Look around you and watch how toddlers are running iPad's and learning how to adapt to the technology that the older generation is simply unequipped to deal with.

If this model is accurate, then the pace of technology is going to accelerate to a point where each successive cycle is going to double the velocity of the pace, which will accelerate everything even more.

This is a very exciting time to be alive!

Embedded stacks

The theater:
https://github.com/krogebry/pentecost/blob/master/templates/theater.json

This is what I refer to as the "root" template were we setup all of our core params and start building sub-stacks.

The first subnet, known as the Ops or OpsAuto subnet is created with this chunk:

"OpsAutoSubnet": {      "Type": "AWS::CloudFormation::Stack",      "Properties": {        "TemplateURL": "https://s3-us-west-2.amazonaws.com/cloudrim/core/subnet-opsauto.json",        "Parameters": {          "VPCId": { "Ref": "VPC" },          "CidrBlock": { "Fn::FindInMap": [ "SubnetConfig", "OpsAuto", "CIDR" ]},          "AvailabilityZone": "us-east-1b",          "InternetGatewayId": { "Ref": "InternetGateway" }        }      }    } 
The source for this can be found here:
https://github.com/krogebry/pentecost/blob/master/templates/core/subnet-opsauto.json
As you can see, the subnet-opsauto.json stack template is creating a subnet within the main VPC, then attaching ACL entries and a security group.
This is very handy for being able to encapsulate all of your security rules in once place for a given software package.
Now let's take a look at a generic subnet:
https://github.com/krogebry/pentecost/blob/master/templates/core/subnet-primary.json
I'll eventually get around to cleaning thing so it's more abstract and versatile.  The idea here is that we create a generic pattern for how we expect all of our applications to function.  At the moment I haven't defined any specific ACL's or security groups, so traffic will not be able to flow from one network to the next.
There are two ways of approaching this:
  1. Create a custom subnet definition for each application stack which defines the ACL's and security groups.
  2. Define the same groups in the root template by using Fn::GetAtt to reach into the stack and pull out the Output variables.
Either approach is fine, it's up to you to decide which method is going to be better for the long-term health of your organization.
In either case, CI/CD is still a valid possibility.
One final thought on this subject: Huge thanks to the CloudFormation team for being totally awesome!