Wednesday, November 20, 2013

Gut check

Let's do a quick gut check on where we are in the evolution of our information technology.

First we start with this math problem:
http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem

Now we look at Moore's law:
http://en.wikipedia.org/wiki/Moore's_law

In the abstract Moore's law is stating that the capacity of our compute resources doubles every 18 months, but let's extend that to say that the pace of technology adoption and our overall use of technology doubles in that same period.

If we start at the epoch of Jan 1, 1970 and count the number of 18 month periods we would have 28.6 periods, so let's round this up to 29.

This puts us here:
1,073,741,823
http://www.wolframalpha.com/input/?i=%5Csum_%7Bi%3D0%7D%5E%7B29%7D+2%5Ei.%5C%2C+

Let's put this number in context, the last iteration, 28 was this number ( exactly half, of course ):
536,870,911

The next iteration will be ( exactly double, of course ):
2,147,483,647

These are some pretty huge numbers, in this model we can put numbers behind how fast things are about to start moving.  Look around you and watch how toddlers are running iPad's and learning how to adapt to the technology that the older generation is simply unequipped to deal with.

If this model is accurate, then the pace of technology is going to accelerate to a point where each successive cycle is going to double the velocity of the pace, which will accelerate everything even more.

This is a very exciting time to be alive!

Embedded stacks

The theater:
https://github.com/krogebry/pentecost/blob/master/templates/theater.json

This is what I refer to as the "root" template were we setup all of our core params and start building sub-stacks.

The first subnet, known as the Ops or OpsAuto subnet is created with this chunk:

"OpsAutoSubnet": {      "Type": "AWS::CloudFormation::Stack",      "Properties": {        "TemplateURL": "https://s3-us-west-2.amazonaws.com/cloudrim/core/subnet-opsauto.json",        "Parameters": {          "VPCId": { "Ref": "VPC" },          "CidrBlock": { "Fn::FindInMap": [ "SubnetConfig", "OpsAuto", "CIDR" ]},          "AvailabilityZone": "us-east-1b",          "InternetGatewayId": { "Ref": "InternetGateway" }        }      }    } 
The source for this can be found here:
https://github.com/krogebry/pentecost/blob/master/templates/core/subnet-opsauto.json
As you can see, the subnet-opsauto.json stack template is creating a subnet within the main VPC, then attaching ACL entries and a security group.
This is very handy for being able to encapsulate all of your security rules in once place for a given software package.
Now let's take a look at a generic subnet:
https://github.com/krogebry/pentecost/blob/master/templates/core/subnet-primary.json
I'll eventually get around to cleaning thing so it's more abstract and versatile.  The idea here is that we create a generic pattern for how we expect all of our applications to function.  At the moment I haven't defined any specific ACL's or security groups, so traffic will not be able to flow from one network to the next.
There are two ways of approaching this:
  1. Create a custom subnet definition for each application stack which defines the ACL's and security groups.
  2. Define the same groups in the root template by using Fn::GetAtt to reach into the stack and pull out the Output variables.
Either approach is fine, it's up to you to decide which method is going to be better for the long-term health of your organization.
In either case, CI/CD is still a valid possibility.
One final thought on this subject: Huge thanks to the CloudFormation team for being totally awesome!

Embedded CloudFormation: using stacks in a non-compiled context

First off, I want to define the two ways in which I've learned to create CloudFormation templates:

  1. Compiled: this method uses a client/server model that very much looks like the chef-client/chef-server model.  In this method small chunks of JSON are used to kick off a template build.  This template build process creates a large template file from many component parts.
  2. Embedded: this process uses no pre-compile magic, but instead uses a "root" template to kick off n-number of sub-stacks using the AWS::Stack template resource.
Now let's define a use case that can help better illuminate the situation.  I'm going to use my favorite project of all times for this: CloudRim!

CloudRim has a fairly standard layout:
  1. Contained in a VPC with 3 subnets:
    1. Ops ( 10.0.0.0/16 - us-east-1b ): This is where the chef server, proxy, and jenkins slaves live, this is also the "jump box" also known as the "threshold box."
    2. Primary ( 10.0.1.0/16 - us-east-1b ): This is where the primary, production application will run
    3. Secondary ( 10.0.2.0/16 - us-east-1e ): This is where we put our HA backup for this region.
  2. The application itself is a node.js application that utilizes multiple instances, each instance on it's own port, all ports are tied together with ASG's, ELB's, and finally Route53.
  3. The data storage layer is a sharded MongoDB database where the mongos bits are running on the application servers.
  4. Everything is configured with chef.
When we use the compiled version we'll end up with a group of stacks that looks like this:
  1. VPC
  2. OpsAuto
  3. Application-A
  4. Application-B
  5. MongoDB-A
  6. MongoDB-B ( repl set )
When using the embedded approach we get a slightly different layout:
  1. Root
    1. VPC Subnet Ops
    2. VPC Subnet Primary
    3. VPC Subnet Secondary
    4. Application-A
    5. Application-B
    6. MongoDB-A
    7. MongoDB-B ( repl set )
They look basically the same, however, one of the advantages that comes with using the embedded approach is that everything can be removed by removing the Root stack.  Obviously that could be a bad thing as well, depending on the use case.

Some people might be tempted to state that the repl set should be handled in a different stack, and I'm inclined to agree with that statement.  However, for the purpose of this document we'll stick with this layout so we can keep everything together in one logical construct.

The embedded approach is very compelling, however, there is one aspect of this that happens to be a significant drawback: troubleshooting.

The embedded approach is what's known as a "derived" approach, which is to say that you have to think of every single element as an abstract idea.  This is very difficult for those that are just starting out with the technology as it forces you to have to visualize how things are going to play out, but never actually seeing the final, compiled version of the template.  Troubleshooting in this environment is very difficult, even for a seasoned veteran like myself.

In contrast, the compiled approach is specifically targeted at avoiding this problem.  The compiled approach is much easier to debug, maintain, and supports a much faster iteration cycle.  Templates are used to make things faster and more agile, however, the draw back to this speed is that it requires a compile layer ( client / server ) in order to actually work.

Either solution can be integrated into a CI/CD system, and both systems have an equal pro/con list.  Choosing which method you go with really comes down to how much work you're willing to do in the long run.

The embedded version requires a deep knowledge of AWS and specifically how CloudFormation works.  However, the compiled version requires running a client/server service that looks and acts much like chef.

If your team(s) are familiar with AWS and have a basic understanding of CF, then the embedded approach does make more sense.  However, for larger, more complex scenarios where many business groups are going to be using this system, a more modular approach with the compiled method could save time and headache in the long run.

Ping me for code snippets.

Saturday, November 9, 2013

More CF Automation Awesomeness

Everything just got condensed to this:

    "DragonGroup": {
      "Name": "HTC::OpsAuto::Plugin::ELBServiceGroup",
      "Type": "HTC::Plugin",
      "Properties": {
        "PortStart": 9000,
        "NumberOfPorts": 3,
        "DNSName": "cloudrim",
        "Weights": [ "40","30","30" ],
        "SecurityGroupName": "LoadBalancerSecurityGroup"
      }
    },

and this:

    "NumServicePorts": {
      "Type": "String",
      "Description": "Number of service instance ports"
    }


and this:

    "ConfigASG": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "Tags": [{ "Key": "DeploymentName", "Value": { "Ref": "DeploymentName" }, "PropagateAtLaunch": "true" }],
        "MinSize": { "Ref": "Min" },
        "MaxSize": { "Ref": "Max" },
        "DesiredCapacity": { "Ref": "DesiredCapacity" },
        "AvailabilityZones": [{ "Ref": "AvailabilityZone" }],
        "VPCZoneIdentifier": [{ "Ref": "PrivateSubnetId" }],
        "LoadBalancerNames": { "HTC::GetVar": "elb_names" },
        "LaunchConfigurationName": { "Ref": "ServiceLaunchConfig" }
      }
I wrote the "HTC::GetVar" chunk so I can create variable data in plugins and pass them back to the stack.  Best part about this is that the variables persist on the object!  Thanks Mongo!

Running single-threaded applications like a boss using CloudFormation and Chef.

NodeJS + CF + Chef = Unicorn Awesomeness!!

The problem statement is this: how do I make use of a single-threaded application in the cloud?  This is great practice for technologies like redis where there is a specific focus on accomplishing a very direct, targeted objective in a "weird" kind of way.  Redis is also single-threaded, but it's focus is on really really fast indexing of structured data.  In the case of Node it makes sense to run many instances of the run time on a single compute resources, thus using more of the systems resources to get more work done.

I'm using a formula of n+1 where n is the number of ECPU's.  An m1.large instance in AWS has 2 ECPU's.  The goal here is to find an automated path to running the cloudrim-api on many ports for a single compute instance.  Then load balancing the incoming requests across an array of ELB's, each ELB attached to a separate port, but the same group of nodes.  Also maintaining that all traffic coming into any given ELB group is always port 9000.

High-level summary

This is creating several AWS objects:

  • Security group for the load balancer which says "only allow traffic in from port 9000".
  • 2 Elastic Load Balancer objects, both of which listen on port 9000, but direct traffic to either 9000, or 9001 on the target.
  • Each ELB has it's own DNS Alias record.
  • A regional DNS entry is created in a way that Route53 will create a weighted set for each entry in the resource collection. 
The magic is in the last step, what we end up with is an end point cloudrim.us-east-1.opsautohtc.net which points at a weighted set of ELB's.  Each ELB is pointing to a specific port on a collection of nodes, this collection of nodes is the same for each ELB, but different ports.

This allows us to run many instances of a software package in a dynamic way ( more ECPU's will grow most of the system automatically ).  I'm combining this with runit for extra durability; runit ensures that the process is always running, so if the service crashes, runit will automatically create a new instance almost immediately.


CloudFormation bits:

    "LoadBalancerSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "VpcId": { "Ref": "VpcId" },
        "GroupDescription": "Enable HTTP access on port 9080",
        "SecurityGroupIngress": [
          { "IpProtocol": "tcp", "FromPort": "9000", "ToPort": "9000", "CidrIp": "0.0.0.0/0" }
        ],
        "SecurityGroupEgress": [ ]
      }
    },
    "ElasticLoadBalancer0": {
      "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties": {
        "Subnets": [{ "Ref": "PrivateSubnetId" }],
        "Scheme": "internal",
        "Listeners": [{ "LoadBalancerPort": "9000", "InstancePort": "9000", "Protocol": "TCP" }],
        "HealthCheck": {
          "Target": "TCP:9000",
          "Timeout": "2",
          "Interval": "20",
          "HealthyThreshold": "3",
          "UnhealthyThreshold": "5"
        },
        "SecurityGroups": [{ "Ref": "LoadBalancerSecurityGroup" }]
      }
    },
    "ElasticLoadBalancer1": {
      "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
      "Properties": {
        "Subnets": [{ "Ref": "PrivateSubnetId" }],
        "Scheme": "internal",
        "Listeners": [{ "LoadBalancerPort": "9000", "InstancePort": "9001", "Protocol": "TCP" }],
        "HealthCheck": {
          "Target": "TCP:9001",
          "Timeout": "2",
          "Interval": "20",
          "HealthyThreshold": "3",
          "UnhealthyThreshold": "5"
        },
        "SecurityGroups": [{ "Ref": "LoadBalancerSecurityGroup" }]
      }
    },
    "DNSEntry0": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": { "Fn::Join": [ "", [{ "Ref": "DomainName" },"."]]},
        "Comment": "DNS CName for the master redis ELB",
        "RecordSets": [{
          "Name": { "Fn::Join": [ "", [
            "cloudrim0.",
            { "Ref": "StackName" }, ".",
            { "Ref": "DeploymentName" }, ".",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }, "."
          ]]},
          "Type": "A",
          "AliasTarget": {
            "DNSName": { "Fn::GetAtt": [ "ElasticLoadBalancer0", "DNSName" ] },
            "HostedZoneId": { "Fn::GetAtt": [ "ElasticLoadBalancer0", "CanonicalHostedZoneNameID" ] }
          }
        }]
      }
    },
    "DNSEntry1": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": { "Fn::Join": [ "", [{ "Ref": "DomainName" },"."]]},
        "Comment": "DNS CName for the master redis ELB",
        "RecordSets": [{
          "Name": { "Fn::Join": [ "", [
            "cloudrim1.",
            { "Ref": "StackName" }, ".",
            { "Ref": "DeploymentName" }, ".",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }, "."
          ]]},
          "Type": "A",
          "AliasTarget": {
            "DNSName": { "Fn::GetAtt": [ "ElasticLoadBalancer1", "DNSName" ] },
            "HostedZoneId": { "Fn::GetAtt": [ "ElasticLoadBalancer1", "CanonicalHostedZoneNameID" ] }
          }
        }]
      }
    },
    "RegionDNSEntry": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": { "Fn::Join": [ "", [{ "Ref": "DomainName" },"."]]},
        "Comment": "DNS CName for the master redis ELB",
        "RecordSets": [{
          "Name": { "Fn::Join": [ "", [
            "cloudrim.",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }, "."
          ]]},
          "Type": "CNAME",
          "TTL": "900",
          "SetIdentifier": "Array0",
          "Weight": "30",
          "ResourceRecords": [{ "Fn::Join": [ "", [
            "cloudrim0.",
            { "Ref": "StackName" }, ".",
            { "Ref": "DeploymentName" }, ".",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }
          ]]}]
        },{
          "Name": { "Fn::Join": [ "", [
            "cloudrim.",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }, "."
          ]]},
          "Type": "CNAME",
          "TTL": "900",
          "SetIdentifier": "Array1",
          "Weight": "40",
          "ResourceRecords": [{ "Fn::Join": [ "", [
            "cloudrim1.",
            { "Ref": "StackName" }, ".",
            { "Ref": "DeploymentName" }, ".",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }
          ]]}]
        },{
          "Name": { "Fn::Join": [ "", [
            "cloudrim.",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }, "."
          ]]},
          "Type": "CNAME",
          "TTL": "900",
          "SetIdentifier": "Array2",
          "Weight": "30",
          "ResourceRecords": [{ "Fn::Join": [ "", [
            "cloudrim2.",
            { "Ref": "StackName" }, ".",
            { "Ref": "DeploymentName" }, ".",
            { "Ref": "AWS::Region" }, ".",
            { "Ref": "DomainName" }
          ]]}]
        }]
      }
    },
    "ConfigASG": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "Tags": [{ "Key": "DeploymentName", "Value": { "Ref": "DeploymentName" }, "PropagateAtLaunch": "true" }],
        "MinSize": { "Ref": "Min" },
        "MaxSize": { "Ref": "Max" },
        "DesiredCapacity": { "Ref": "DesiredCapacity" },
        "AvailabilityZones": [{ "Ref": "AvailabilityZone" }],
        "VPCZoneIdentifier": [{ "Ref": "PrivateSubnetId" }],
        "LoadBalancerNames": [
          { "Ref": "ElasticLoadBalancer0" },
          { "Ref": "ElasticLoadBalancer1" },
          { "Ref": "ElasticLoadBalancer2" }
        ],
        "LaunchConfigurationName": { "Ref": "ServiceLaunchConfig" }
      }
    }

And now the chef bits:

services = node["htc"]["services"] || [ "api" ]
Chef::Log.info( "SERVICES: %s" % services.inspect )
num_procs = `cat /proc/cpuinfo |grep processor|wc -l`.to_i + 1
Chef::Log.info( "NumProcs: %i" % num_procs )
num_procs.times do |pid|
  Chef::Log.info( "PID: %i" % (9000+pid) )
  runit_service "cloudrim-api%i" % pid do
    action [ :enable, :start ]
    options({ :port => (9000+pid) })
    template_name "cloudrim-api"
    log_template_name "cloudrim-api"
  end
end

Template:

cat templates/default/sv-cloudrim-api-run.erb
#!/bin/sh
exec 2>&1
PORT=<%= @options[:port] %> exec chpst -uec2-user node /home/ec2-user/cloudrim/node.js

Monday, November 4, 2013

Chef and Jenkins

I create chef bits that create jenkins build jobs.  This way I can "create the thing that creates the things."  I can reset the state of things by rm -rf *-config.xml jobs/* on the jenkins server, restart jenkins, then run chef-client and everything gets put back together automatically.  This also allows me to change the run jobs in real time, but then reset everything when I'm done.

recipes/jenkins-builder.rb

pipelines = []
dashboards = []

region_name = "us-east-1"
domain_name = "opsautohtc.net"
deployment_name = "Ogashi"
pipelines.push({
  :name => "Launch %s" % deployment_name,
  :num_builds => 3,
  :description => "

Prism

  • Cloud: AWS
  • Region: us-east-1
  • Account: HTC CS DEV
  • Owner: Bryan Kroger ( bryan_kroger@htc.com )
",
  :refresh_freq => 3,
  :first_job_name => "CloudFormation.%s.%s.%s" % [deployment_name, region_name, domain_name],
  :build_view_title => "Launch %s" % deployment_name
})
cloudrim_battle_theater deployment_name do
  action :jenkins_cloud_formation
  az_name "us-east-1b"
  git_url "git@gitlab.dev.sea1.csh.tc:operations/deployments.git"
  region_name region_name
  domain_name domain_name
end
cloudrim_battle_theater deployment_name  do
  action :jenkins_exec_helpers
  az_name "us-east-1b"
  git_url "git@gitlab.dev.sea1.csh.tc:operations/deployments.git"
  region_name region_name
  domain_name domain_name
end
dashboards.push({ :name => deployment_name, :region_name => "us-east-1" })

template "/var/lib/jenkins/jenkins-data/config.xml" do
  owner "jenkins"
  group "jenkins"
  source "jenkins/config.xml.erb"
  #notifies :reload, "service[jenkins]"
  variables({ :pipelines => pipelines, :dashboards => dashboards })
end



cloudrim/providers/battle_theater.rb:

def action_jenkins_exec_helpers()
  region_name = new_resource.region_name
  domain_name = new_resource.domain_name
  deployment_name = new_resource.name

  proxy_url = "http://ops.prism.%s.int.%s:3128" % [region_name, domain_name]

  proxy = "HTTP_PROXY='%s' http_proxy='%s' HTTPS_PROXY='%s' https_proxy='%s'" % [proxy_url, proxy_url, proxy_url, proxy_url]

  job_name = "ExecHelper.chef-client.%s.%s.%s" % [deployment_name, region_name, domain_name]
  job_config = ::File.join(node[:jenkins][:server][:data_dir], "#{job_name}-config.xml")
  jenkins_job job_name do
    action :nothing
    config job_config
  end
  template job_config do
    owner "jenkins"
    group "jenkins"
    source "jenkins/htc_pssh_cmd.xml.erb"
    cookbook "opsauto"
    variables({
      #:cmd => "%s sudo chef-client -j /etc/chef/dna.json" % proxy, 
      :cmd => "%s sudo chef-client" % proxy,
      :hostname => "ops.prism",
      :domain_name => domain_name,
      :region_name => region_name,
      :deployment_name => deployment_name
    })
    notifies :update, resources(:jenkins_job => job_name), :immediately
  end

 [...]

end


Sunday, November 3, 2013

Great run...

We had a great first run of the BattleTheater today.  Release 0.0.9 fixes a ton of bugs relating to the engines.

I have found an absolutely brilliant way of running the nodejs services.  I have this in my services.rb recipe:

runit_service "cloudrim-kaiju" do
  log false
  if(services.include?( "kaiju" ))
    action [ :enable, :start ]
  else
    action [ :disable, :down, :stop ]
  end
end

I can run a variable number of nodejs services based on the contents of the node["htc"]["services"] array.

These runit services are templated, but basically I can run a service that does this:

node /home/ec2-user/cloudrim/engines/kaiju.js

And run the Kaiju Engine on it's own thread.  Runit takes care of making sure the service is always running, and I don't have to mess with init scripts!!

Thank you #opscode !

How do I CloudRim?

Do something like this:

curl -XPOST -H "Content-Type: application/ajax" -d '{"email":"your.emil@blah.com"}' Tanaki-cl-ElasticL-10851AT1WINID-1417527178.us-east-1.elb.amazonaws.com:9000/hq/battle

You'll get this back:

{"kaiju":{"hp":90.71948742261156,"xp":0,"str":7,"name":"Knifehead92","level":14,"armor":7,"speed":7,"bonuses":{"ice":0,"fire":0,"earth":0,"water":6},"category":3,"actor_type":"kaiju","_id":"5276b4d902b6008c0e00007d"},"jaeger":{"level":33,"_id":"5276b4d94a2ad64d0e00007a","hp":100,"name":"","weapons":{"left":{"ice":0,"fire":1,"earth":9,"water":0},"right":{"ice":5,"fire":9,"earth":1,"water":0}}}}


Friday, November 1, 2013

CloudRim

GitRepo

CloudRiim is a little game I came up with to help people understand high-volume load and scaling techniques in AWS.

The idea here is that I launch a "Battle Theater" into an AWS region, let's say us-east-1 for example.  The BT is basically a MongoDB sharding rig with automatic sharding in place.  When the battle begins I will announce "The Kaiju have landed in USEast!!" and the battle will begin.

Most games are centered around the idea of keeping people off the server and doing as much local caching as possible.  This is not that by a good long shot.  The idea of this game is to pound the absolute shit out of the server as hard and as fast as you can.  In fact, simply using ab or a script with some treading junk won't be enough.  Even running jmeter on a single box won't be enough.

The goal of this game is to destroy all Kaiju in existence, but the Rift keeps spawning new Kaiju at a fairly high rate.  The Jaegers are also spawned at a similar rate.

There are two winning scenarios in the game: either the Kaiju win and you suck, or the Apocalypse is canceled and the hero's win.

Every time you do battle with the server you are awarded a random number of XP between 1 and 10, so in order to be at the top of the charts you have to do as many battles as possible.

Now here's the catch, this entire game is built around this idea that great automation can do amazing things.  In order to prove this the BT is only deployed for an hour.  This means that you have less than an hour to spin up your nodes and attack the server as hard and as fast you possibly can.  Obviously this is going to require coordination on your part.

At the end of the game the scores are tallied up and the person at the top of the score chart is given the metal of honor or something.

The idea here is to give people a fun, engaging way of learning how automation works and why it's important.  It's also a great way to prove just how unbelievably awesome AWS really is.