Wednesday, December 15, 2010

Chef and encrypted data bags.

As part of rolling out Chef at the new gig, we had a choice - stand up our own Chef server and maintain it or use the Opscode platform. From a cost perspective, the 50 node platform cost was pretty much break even with standing up another EC2 instance of our own. The upshot was that I didn't have to maintain it.

 

However, part of due diligence was making sure everything was covered from a security perspective. We use quite a few hosted/SaaS tools but this one had the biggest possible security risk. The biggest concern is dealing with sensitive data such as database passwords and AWS credentials. The Opscode platform as a whole is secure. It makes heavy use of SSL not only for transport layer encryption but also for authentication and authorization. That wasn't a concern. What was a concern was what should happen if a copy of our CouchDB database fell into the wrong hands or a "site reliability engineer" situation happened. That's where the concept of "encrypted data bags" came from for me.

 

Atlanta Chef Hack Day

I had the awesome opportunity to stop by the Atlanta Chef Hack day this past weekend. I couldn't stay long and came in fairly late in the afternoon. However I happened to come in right at the time that @botchagalupe (John Willis) and @schisamo (Seth Chisamore) brought up encrypted data bags. Of course, Willis proceeded to turn around and put me on the spot. After explaining the above use case, we all threw out some ideas but I think everyone came to the conclusion that it's a tough nut to crack with a shitload of gotchas.

 

Before I left, I got a chance to talk with @sfalcon (Seth Falcon) about his ideas. While he totally understood the use cases and mentioned that other people had asked about it as well, he had a few ideas but nothing that stood out as the best way.

 

So what are the options? I'm going to list a few here but I wanted to discuss a little bit about the security domain we're dealing with and what inherent holes exist.

 

Reality Checks

  • Nothing is totally secure.

          Deal with it. Even though it's a remote chance in hell, your keys and/or data are going to be decrypted somewhere at some point in time. The type of information we need to read, unfortunately, can't use a one-way encryption algo like MD5 or SHA because we NEED to know what the data actually is. I need that MySQL password to provide to my application server to talk to the database. That means it has to be decrypted and during that process and during usage of that data, it's going to exist in a possible place that it can be snagged.

  • You don't need to encrypt everything

          You need to understand what exactly needs to be encrypted and why. Yes, there's the "200k winter coats to troops" scenario and every bit of information you expose provides additional material for an attack vector but really think about what you need to encrypt. Application database account usernames? Probably not. The passwords for those accounts? Yes. Consider the "value" of the data you're considering encrypting.

  • Don't forget the "human" factor

          So you've got this amazing library worked out, added it to your cookbooks and you're only encrypting what you need to really encrypt. Then some idiot puts the decryption key on the wiki or the master password is 5 alphabetical characters. As we often said when I was a kid, "Smooth move, exlax"

  • There might be another way

          There might be another way to approach the issue. Make sure you've looked at all the options.

 

Our Use Case

So understanding that, we can narrow down our focus a bit. Let's use the use case of our applications database password because it's a simple enough case. It's a single string.

 

Now in a perfect world, Opscode would encrypt each CouchDB database with customer specific credentials (like say an organizational level client cert) and discards the credentials once you've downloaded them.

 

That's our first gotcha - What happens when the customer loses the key? All that data is now lost to the world. 

 

But let's assume you were smart and kept a backup copy of the key in a secure location. There's another gotcha inherent in the platform itself - Chef Solr. If that entire database is encrypted, unless Opscode HAS the key, they can't index the data with Solr and all those handy searches you're using in your recipes to pull in all your users is gone. Now you'll have to manage the map/reduce views yourself and deal with the performance impact where you don't have one of those views in place.

 

So that option is out. The Chef server has to be able to see the data to actually work.

 

What about a master key? That has several problems.

 

You have to store the key somewhere accessible to the client (i.e. the client chef.rb or in an external file that your recipes can read to decrypt those data bag items).

  • How do you distribute the master key to the clients?
  • How do you revoke the master key to the clients and how does that affect future runs? See the previous line - how do you then distribute the updated key?

 

I'm sure someone just said "I'll put it in a data bag" and then promptly smacked themselves in the head. Chicken - meet Egg. Or is it the other way around?

 

You could have the Chef client ASK you for the key (remember apache SSL startups where the startup script required a password? Yeah, that sucked.

 

 

Going the Master Key Route

So let's assume that we want to go this route and use a master key. We know we can't store in with Opscode because that defeats the purpose. We need a way to distribute the master key to the clients so they can decrypt the data so how do we do it?

 

If you're using Amazon, you might say "I'll store it in S3 or on an EBS volume". That's great! Where do you store the AWS credentials? "In a data ba...oh wait. I've seen this movie before, haven't I?"

 

So we've come to the conclusion that we must store the master key somewhere ourselves locally available to the client. Depending on your platforming, you have a few options:

  • Make it part of the base AMI
  • Make it part of your kickstart script
  • Make it part of your vmware image

 

All of those are acceptable but they don't deal with updating/revocation. Creating new AMIs is a pain in the ass and you have to update all your scripts with new AMI ids when you do that. Golden images are never golden. Do you really want to rekick a box just to update the key?

 

Now we realize we have to make it dynamic. You could make it a part of a startup script in the AMI, first boot of the image or the like. Essentially, "when you startup, go here and grab this key". Of course now you've got to maintain a server to distribute the information and you probably want two of them just to be safe, right? Now we're spreading our key around again.

 

This is starting to look like an antipattern.

 

But let's just say we got ALL of that worked out. We have a simple easy way for clients to get and maintain the key. It works and your data is stored "securely" and you feel comfortable with it.

 

Then your master key gets compromised. No problem, you think. I'll just use my handy update mechanism to update the keys on all the clients and...shit...now I've got to re-encrypt EVERYTHING and re-upload my data bags. Where the hell is the plaintext of those passwords again? This is getting complicated, no?

 

So what's the answer? Is there one? Obviously, if you were that hypersensitive to the security implications you'd just run your own server anyway. You still have the human factor and backups can still be stolen but that's an issue outside of Chef as a tool. You just move the security up the stack a bit. You've got to secure the Chef server itself. But can you still use the Opscode platform? I think so. With careful deliberation and structure, you can reach a happy point that allows you to still automate your infrastructure with Chef (or some other tool) and host the data off-site.

 

Some options

Certmaster

 

Certmaster spun out of the Func project. It's essentially an SSL certificate server at the base. It's another thing you have to manage but it can handle all the revocation and distribution issues.

Riak

 

This is one idea I came up with tonight. The idea is that you run a very small Riak instance on all the nodes that require the ability to decrypt the data. Every node is a part of the same cluster and this can all be easily managed with Chef. It will probably have a single bucket containing the master key. You get the fault tolerance built in and you can pull the keys as part of your recipe using basic Chef resources. Resource utilization on the box should be VERY low for the erlang processes. You'll have a bit more network chatter as the intra-cluster gossip goes on though. Revocation is still an issue but that's VERY easily managed since it's a simple HTTP put to update. And while the data is easily accessible to anyone who can get access to the box, you should consider yourself "proper f'cked" if that happens anyway.

 

But you still have the issue of re-encrypting the databags should that need to happen. My best suggestion is to store the encrypted values in a single data bag and add a rake task that does the encryption/revocation for you. Then you minimize the impact of something that simply should not need to happen that often.

 

Another option is to still use Riak but store the credentials themselves (as opposed to a decryption key) and pull them in when the client runs. The concern I have there is how that affects idempotence and would it cause the recipe to be run every single time just because it can't checksum properly? You probably get around this with a file on the filesystem telling Chef to skip the update using "not_if". 

 

Wrap Up

 

As you can see, there's no silver bullet here. Right now I have two needs, storing credentials for S3/EBS access and storing database passwords. That's it. We don't use passwords for user accounts at all. You can't even use password authentication with SSH on our servers. If I don't have your pubkey in the users data bag, you can't log in.  

 

The AWS credentials are slowly becoming less of an issue. With the Identity Access beta product, I can create limited use keys that can only do certain things and grant them access to specific AWS products. I can make it a part of node creation to generate that access programatically. That means I still have the database credentials issue though. For that, I'm thinking that the startup script for an appserver, for instance, will just have to pull the credentials from Riak (or whatever central location you choose) and update a JNDI string. It spreads your configuration data out a bit but these things shouldn't need to change to often and with proper documented process you know exactly how to update it.

 

One thing that this whole thing causes is that it begins to break down the ability to FULLY automate everything. I don't like running the knife command to do things. I want to be able to programatically run the same thing that Knife does from my own scripts. I suppose I could simply popen and run the knife commands but shelling out always feels like an anti-pattern to me.

 

I'd love some feedback on how other people are addressing the same issues!

 

Thursday, December 2, 2010

Automating EBS Snapshot validation with @fog - Part 2

This is part 2 in a series of posts I'm doing - You can read part 1 here

Getting started

I'm not going to go into too much detail on how to get started with Fog. There's plenty of documentation on the github repo (protip: read the test cases) and Wesley a.k.a @geemus has done some awesome screencasts. I'm going to assume at this point that you've at least got Fog installed, have an AWS account set up and have Fog talking to it. The best way to verify is to create your .fog yaml file, start the fog command line tool and start looking at some of the collections available to you.

For the purpose of this series of posts, I've actually created a small script that you can use to spin up two ec2 instances (m1.small) running CentOS 5.5, create four (4) 5GB EBS volumes and attach them to the first instance. In addition to the fog gem, I also have awesome_print installed and use it in place of prettyprint. This is, of course, optional but you should be aware.

WARNING: The stuff I'm about to show you will cost you money. I tried to stick to minimal resource usage but please be aware you need to clean up after yourself. If, at any time, you feel like you can't follow along with the code or something isn't working - terminate your instances/volumes/resources using the control panel or command-line tools. PLEASE DO NOT JUST SIMPLY RUN THESE SCRIPTS WITHOUT UNDERSTANDING THEM.

The setup script

The full setup script is available as gist on github - https://gist.github.com/724912#file_fog_ebs_demo_setup.rb

Things to note:

  • Change the key_name to a valid key pair you have registered with EC2
  • There's a stopping point halfway down after the EBS volumes are created. You should actually stop there and read the comments.
  • You can run everything inside of an irb session if you like.

The first part of the setup script does some basic work for you - it reads in your fog configuration file (~/.fog) and creates an object you can work with (AWS). As I mentioned earlier, we're creating two servers - hdb and tdb. HDB is the master server - say your production MySQL database. TDB is the box which will be running as the validation of the snapshots.

In the Fog world, there are two big concepts - models and collections. Regardless of cloud provider, there are typically at least two models available - Compute and Storage. Collections are data objects under a given model. For instance in the AWS world, you might have under the Compute model - servers, volumes, snapshots or addresses. One thing that's nice about Fog is that, once you establish your connection to your given cloud, most of your interactions are the same across cloud providers. In the example above, I've created a connection with Amazon using my credentials and have used that Compute connection to create two new servers - hdb and tdb. Notice the options I pass in when I instantiate those servers.

  • image_id
  • key_name

If I wanted to make these boxes bigger, I might also pass in 'flavor_id'. If you're running the above code in an irb session, you might see something like the following when you instantiate those servers: Not all of the fields may be available depending on how long it takes Amazon to spin up the instance. The above shot is after the instance was up and running. For instance, when you first created 'tdb', you'll probably see "state" as pending for quite some time. Fog has a nice helper method for all models call 'wait_for'. In my case I could do:

tdb.wait_for { print "."; ready?}

And it would print dots across the screen until the instance is ready for me to log in. At the end, it will tell you the amount of time you spent waiting. Very handy. You have direct access to all of the attributes above via the instance 'tdb' or 'hdb'. You can use 'tdb.dns_name' to get the dns name for use in other parts of your script for example. In my case, after the server 'hdb' is up and running, I now want to create the four 5GB EBS volumes and attach them to the instance:

I've provided four device names (sdi through sdl) and I'm using the "volumes" collection to create them (AWS.volumes.new). As I mentioned earlier, all of the attributes for 'hdb' and 'tdb' are accessible by name. In this case, I have to create my volumes in the same availability zone as the hdb instance. Since I didn't specify where to create it when I started it, Amazon has graciously chosen 'us-east-1d' for me. As you can see, I can easily access that as 'hdb.availability_zone' and pass it to the volume creation section. I've also specified that the volume should be 5GB in size.

At the point where I've created the volume with '.new' it hasn't actually been created. I want to bind it to a server first so I simply set the volume.server attribute equal to my server object. Then I 'save' it. If I were to log into my running instance, I'd probably see something like this in the 'dmesg' output now:

sdj: unknown partition table

sdk: unknown partition table

sdl: unknown partition table

sdi: unknown partition table

As you can see from the comments in the full file, you should stop at this point and setup the volumes on your instance. In my case, I used mdadm and created a RAID0 array using those four volumes. I then formatted them, made a directory and mounted the md0 device to that directory. If you look, you should now have an additional 20GB of free space mounted on /data. Here I might make this the data directory for mysql (which is the case in our production environment). Let's just pretend you've done all that. I simulated it with a few text files and a quick 1GB dd. We'll consider that the point-in-time that we want to snapshot from. Since there's no actual constant data stream going to the volumes, I can assume for this exercise that we've just locked mysql, flushed everything and frozen the XFS filesystem. Let's make our snapshots. In this case I'm going to be using Fog to do the snapshots but in our real environment we're using the ec2-consistent-snapshot script from Aelastic. First let's take a look at the state of the hdb object:

Notice that the 'block_device_mapping' attribute now consist of an array of hashes. Each hash is a subset of the data about the volume attached to it. If you aren't seeing this, you might have to run 'hdb.reload' to refresh the state of the object. To create our snapshots, we're going to iterate over the block_device_mapping attribute and use the 'snapshots' collection to make those snapshots:

One thing you'll notice is that I'm being fairly explicity here. I could shorthand and chain many of these method calls but for clarity, I'm not.

And now we have 4 snapshots available to us. The process is fairly instant but sometimes it can lag. As always, you should check the status via the .state attribute of an object to verify that it's ready for the next step. Here's a shot of our snapshots right now:

That's the end of Part 2. In the next part, we'll have a full fledged script that does the work of making the snapshots usable on the 'tdb' instance.

Automating EBS Snapshot validation with @fog - Part 1

Background

One thing that's very exciting about the new company is that I'm getting to use quite a bit of Ruby and also the fact that we're entirely hosted on Amazon Web Services. We currently leverage EBS, ELB, EC2 S3 and CloudFront for our environment. The last time I used AWS in a professional setting, they didn't even have Elastic IPs much less EBS with snapshots and all the nice stuff that makes it viable for a production environment. I did, however, manage to keep abreast of changes using my own personal AWS account.

Fog

Of course the combination of Ruby and AWS really means one thing - Fog. And lot's of it.

When EngineYard announced the sponsorship of the project, I dove headlong into the code base and spent what time I could trying to contribute code back. The half-assed GoGrid code in there right now? Sadly, some of it is mine. Time is hard to come by these days. Regardless, I'm no stranger to Fog and when I had to dive into the environment and start getting it documented and automated, Fog was the first tool I pulled out and when the challenge of verifying our EBS snapshots (of which we're currently at a little over 700), I had no choice but to automate it.

Environment

A little bit about the environment:

  • - A total of 9 EBS volumes are snapshotted each day
  • - 8 of the EBS volumes are actually raid0 mysql data stores across two DB servers (so 4 disks on one/4 disks on another)
  • - The remaining EBS volume is a single mysql data volume
  • - Filesystem is XFS and backups are done using the Aleastic ec2-consistent-snapshot script (which currently doesn't support tags)

The end result of this is to establish a rolling set of validated snapshots. 7 daily, 3 weekly, 2 monthly. Fun!

Mapping It Out

Here was the attack plan I came up with:

  • - Identify snapshots and groupings where appropriate (raid0, remember?)
  • - create volumes from snapshots
  • - create an m1.xlarge EC2 instance to test the snapshots
  • - attach volume groups to the test instance
  • - assemble the array on the test instance
  • - start MySQL using the snapshotted data directory
  • - run some validation queries using some timestamp columns in our schema
  • - stop MySQL, unmount volume, stop the array
  • - detach and destroy the volumes from the test instance
  • - tag the snapshots as "verified"
  • - roll off any old snapshots based on retention policy
  • - automate all of the above!

I've got lots of code samples and screenshots so I'm breaking this up into multiple posts. Hopefully part 2 will be up some time tomorrow

Tuesday, November 9, 2010

Fix it or Kick It and the ten minute maxim

One of the things I brought up in my presentation to the Atlanta DevOps group was the concept of "Payment". One of the arguments that people like to trot out when you suggest an operational shift is that "We can't afford to change right now". My argument is that you CAN'T afford to change. It's going to cost you more in the long run. The problem is that in many situations, the cost is detached from the original event.

Take testing. Let's assume you don't make unit testing an enforced part of your development cycle. There are tons of reasons people do this but much of it revolves around time. We don't have time to write tests. We don't have time to wait for tests to run. We've heard them all. Sure you get lucky. Maybe things go out the door with no discernible bugs. But what happens 3 weeks down the road when the same bug that you solved 6 weeks ago crops up again? It's hard to measure the cost when it's so far removed from the origination.

Configuration management is the same way. I'm not going to lie. Configuration management is a pain in the ass especially if you didn't make it a core concept from inception. You have to think about your infrastructure a bit. You'll have to duplicate work initially (i.e. templating config files). It's not easy but it pays off in the long run. However as with so many things, the cost is detached from the original purchase.

Fix it?

Walk with me into my imagination. A scary place where a server has started to misbehave. What's your initial thought? What's the first thing you do? You've seen this movie and done this interview:

  • log on to the box
  • perform troubleshooting
  • think
  • perform troubleshooting
  • call vendor support (if it's an option)
  • update trouble ticket system
  • wait
  • troubleshoot
  • run vendor diag tools

What's the cost of all that work? What's the cost of that downtime? Let's be generous. Let's assume this is a physical server and you paid for 24x7x4 hardware support and a big old RHEL subscription. How much time would you spend on each task? What's the turn around time to getting that server back into production?

Let's say that the problem was resolved WITHOUT needing replacement hardware but came in at the four hour mark. That's three hours that the server was costing you money instead of making you money. Assuming a standard SA salary of $75k/year in Georgia, that works out to $150. That's just doing a base salary conversion not calculating all the other overhead associated with staffing an employee. What if that person consulted with someone else during that time, a coworker at the same rate, for two of those hours. $225. Not too bad, right? Still a tangible cost. Maybe one you're willing to eat.

But let's assume the end result was to wipe and reinstall. Let's say it takes another hour to get back to operational status. Woops. Forgot to make that tweek to Apache that we made a few weeks ago. Let's spend an hour troubleshooting that.

But we're just talking man power at this point. This doesn't even take into account end-user productivity, loss of customers from degraded performance or any host of issues. God forbid that someone misses something that causes problems to other parts of the environment (like not setting the clock and inserting invalid timestamps into the database or something. Forget that you shouldn't let your app server handle timestamps). Now there's cleanup. All told your people spent 5 hours to get this server back into production while you've been running in a degraded state. What does that mean when our LOB is financial services and we have an SLA and attached penalties? I'm going to go easy on you and let you off with 10k per hour of degraded performance.

Get ready to credit someone $50k or worse cut a physical check.

Kick it!

Now I'm sure everyone is thinking about things like having enough capacity to maintain your SLA even with the loss of one or two nodes but be honest. How many companies actually let you do that? Companies will cut corners. They roll the dice or worse have a misunderstanding of HA versus capacity planning.

What you should have done from the start was kick the box. By kicking the box, I mean performing the equivalent of a kickstart or jumpstart. You should, at ANY time, be able to reinstall a box with no user interaction (other than the action of kicking it) and return it to service in 10 minutes. I'll give you 15 minutes for good measure and bad cabling. My RHEL/CentOS kickstarts are done in 6 minutes on my home network and most of that time is the physical hardware power cycling. With virtualization you don't even have a discernible bootup time.

Unit testing for servers

I'll go even farther. You should be wiping at least one of your core components every two weeks. Yes. Wiping. It should be a part of your deploy process in fact. You should be absolutely sure that should you ever need to reinstall under duress that you can get that server back into service in an acceptable amount of time. Screw the yearly DR tests. I'm giving you a world where you can perform bi-monthly DR tests as a matter of standard operation. All it takes is a little bit of up front planning.

The 10 minute maxim

I have a general rule. Anything that has to be done in ten minutes can be afforded twenty minutes to think it through. Obviously, it's a general rule. The guy holding the gun might not give you twenty minutes. And twenty minutes isn't a hard number. The point is that nothing is generally so critical that it has to be SOLVED that instant. You can spend a little more time up front to do things right or you can spend a boatload of time on the backside trying to fix it.

Given the above scenario, you would think I'm being hypocritical or throwing out my own rule. I'm not. The above scenario should have never happened. This is a solved problem. You should have spent 20 minutes actually putting the config file you just changed into puppet instead of making undocumented ad-hoc changes. You should have spent an hour when bringing up the environment to stand up a CM tool instead of just installing the servers and doing everything manually. That's the 10 minute maxim. Take a little extra time now or take a lot of time later.

You decide how much you're willing to spend.

Monday, November 8, 2010

Transitions

I haven't had a chance to mention this but those of you who I'm connected with on LinkedIn are aware that I'm starting with a new company on Wednesday. I'm taking a few days to get some house work done and then diving in. I don't like switching companies in general but I'm really excited about this opportunity. In addition to having almost a blank slate, I'm working with a much smaller team and a chance to contribute back to the community. It's also a chance for me to work in the Atlanta startup scene; something I've been hoping to do for a few years now.

So what about the previous company? Well they're looking to back fill my position. Please feel free to contact me if you're interested. I can put you in touch with the right people. Fair warning, it's a challenging place to work. They'll tell you the same thing. I've blogged about working at a "traditional" company before right here so you can go back and glean information from that.

Tuesday, November 2, 2010

Using Hudson and RVM for Ruby unit testing

As with everything lately, something popped up on Twitter that prompted a blog post. In this case, @wakaleo was looking for any stories/examples for his Hudson book. I casually mentioned I could throw in some notes about how we use Hudson on the Padrino project.

Prerequisites

Here's what you'll need:

I'll leave you to get Hudson working. There are prebuilt packages for every distro under the sun. If you can't get past this step, you'll need to rethink a few things.

Setting up RVM

Once you have it installed, log in as your Hudson user and set up RVM.

RVM Protip - If there are any gems (like say Bundler) that you ALWAYS install, edit .rvm/gemsets/default.gems and .rvm/gemsets/global.gems and add them there. In my examples, I did not do that.

You'll want to go ahead and install all the VMs you plan on testing against. We use 1.8.7, 1.9.1, 1.9.2, JRuby, RBX and REE:

for i in 1.8.7 1.9.1 1.9.2 jruby ree rbx; do rvm install ${i}; done

This will take a while. When it's done, we can now dive into configuring our job in Hudson

What is the Matrix?

So you've got Hudson running and RVM all set up? Open the Hudson console and create a new job of type "Build multi-configuration project". From the job configuration screen, you'll want to set some basics - repository, scm polling and the like. The key to RVM comes under "Configuration Matrix"

 

The way any user-defined variables work in Hudson, whether a build parameter or matrix configuration, is that you provide a "key" and then a value for that key. The value for that key is accessible to your build steps as a sigil variable. So if your key is my_funky_keyname_here, you can reference $my_funky_keyname_here in your build steps to get that value. With a configuration matrix, each permutation of the matrix provides the value for that key in the given permutation. So if I have:

foo as one axis with 6 values (1, 2, 3 ,4 ,5 ,6) and bar with 3 values (1, 2, 3)

each combination of foo and bar will be available to my build steps as $foo and $bar. The first run will have $foo as 1 and $bar as 1. Second run will have $foo as 2 and $bar as 1. On an on until the combinations are exhausted.

This makes for some REALLY powerful testing matrices. In our case, however, we only need one axis - rubyvm

Hudson Protip - Don't get creative with your axis or parameter names. In our case, we'll be performing shell script steps. Don't call your axis "HOME" because that will just confuse things. Just don't do it.

So now we've added an axis called 'rubyvm' and provided it with values '1.8.7 1.9.1 1.9.2 jruby rbx ree'. As explained, this means that our build steps will iterate over each value of 'rubyvm' for us and repeat our build steps.

Configuring your job

Now that you've got your variables in place, you can write the steps for your job. This took me a little bit of time to work out the best flow. There were some things with how RVM operates with the shell that caught me off-guard initially (the rvm command being a function alias versus an executable). I've broken the test job into three steps:

  • Create my gemset, install bundler and run bundle install/bundle check
  • Run my unit tests
  • Destroy my gemset

In addition to taking advantage of the variable provided by the configuration matrix, we're also going to take advantage of some variables exposed by Hudson in a given job run - $BUILD_NUMBER. Using these two bits of information, we can build a gemset name for RVM that is unique to that run and that ruby vm.

Step 1:

#!/bin/bash -l

rvm use $rubyvm@padrino-$rubyvm-$BUILD_NUMBER --create

gem install bundler

bundle install

bundle check

This uses the --create option of RVM to create our gemset. If our build number is 99 and our ruby vm is ree, we're creating a gemset called padrino-ree-97 for ree. Pretty straightforward.

Next we install bundler and then run the basic bundler tasks. All operations are performed in the workspace for your hudson project. This is typically the root directory of your SCM repository. If the root of your repo doesn't contain your Gemfile and Rakefile, you'll probably want to make your first step a 'cd' to that directory.

The reason for using a full shebang line is to make sure that RVM instantiates properly.

Step 2:

#!/bin/bash -l

rvm use $rubyvm@padrino-$rubyvm-$BUILD_NUMBER

rake test

Each build step is a distinct shell session. For that reason we need to "use" the previously created gemset. Then we run our rake tasks.

Step 3:

#!/bin/bash -l

rvm use $rubyvm@global

rvm --force gemset delete padrino-$rubyvm-$BUILD_NUMBER

This is the "cleanup" step. This cleans up our temporary gemsets that we created for the test run. My understanding was the each step was "independent". Should the middle step fail, the final step would still be executed. This doesn't appear to be the case anymore. For this reason, you'll probably want to occasionally go in and clean up gemsets from failed builds. If your build passes, the gemset will clean itself up. There's probably justification for some sort of "cleanup" job here but I haven't gotten around to trying to pass variables as artifacts to other build steps.

Now you can run the job and watch as Hudson gleefully executes your test cases against each ruby vm. How many of those run concurrently is dependent on how many workers you have configured globally in Hudson.

Unit Testing Protip - One thing you'll find out early on is how concurrent your unit tests REALLY are. In the case of Padrino, ALL of our unit tests were using a hardcoded path (/tmp/sample_project) for testing. My first major step once I got added to the project was to refactor ALL of our tests to make that dynamic so that we could run more than one permutation at a time. You can see an example of how I did that here. Essentially I created an instance variable for our temp directory using UUID.new.generate. It was the quickest way to resolve the problem. If your tests aren't capable of running in parallel, that's one way to address it.

One thing to be aware of: if you have intensive unit tests and your hudson server isn't very powerful, you simply may not have the capacity to run multiple tests at the same time. I had to spin up some worker VMs on other machines around the house to serve as Hudson slave nodes. Our unit tests were actually taking LONGER when we tried to run them in parallel because of the strain of compiling native extension gems and actually running the tests.

Optional profit! step

Code coverage is important. However it makes NO sense to run code coverage tasks on EVERY VM permutation. You only need to run it once (unless you have some VM dependent code in your application). What I've done is take advantage of "Post build actions" to kick off a second job I've defined. This job does nothing but runs our code coverage rake tasks. Steps 1 and 3 are the same as above without the rubyvm variable. Step 2 is different:

#!/bin/bash -l

rvm use 1.8.7@padrino-rcov-$rubyvm-$BUILD_NUMBER

bundle exec rake hudson:coverage:clean

bundle exec rake hudson:coverage:unit

We've broken the coverage tests into a unique rake task so they don't impact normal testing. This creates a code coverage report that's visible in Hudson under that project's page. Currently we don't run the coverage report job unless the primary job finishes.

Wrap up

That's pretty much it in a nutshell. I'm looking to move Hudson to a more powerful VM here at the house as soon as the hardware comes in. I should be able to then run all the tests across all VMs at one time. Screenshots for each of the steps described in this post are available here

 

Thursday, October 28, 2010

Designed for Developers - Why people keep asking you to use Github

I'll be the first to admit that I'm a Github fanboy. The shocker is that my love of Github has nothing to do with the DVCS underneath. While Git plays a major part of what makes github so great, the bigger reason github is so successful is this:

Github is designed for developers

What do I mean by that? Let's compare a series of screenshots from various code hosting sites:

 

Code Hosting Solutions comparison

I want you to take a look at the screenshots very carefully especially the "project" pages. What's the one thing you notice about Github compared to the others (excluding BitBucket). What's the focus of the project?

It's all about the code

You'll see quite clearly that with all the sites except for BitBucket, the focus of the project is the code itself. Not only is the focus of the project the code but everything about the code is about the community. I can "watch" a developer or project. I can easily see from the first page how to download the codebase. However the biggest part of what makes Github a success is one button:

Fork

From the start of a project page, not only can I easily browse the code and am provided with the information I need to checkout the code but I'm invited with a single button to become a contributor to that project. Immediately, I'm a potential contributor to that project. If I change something and push the code back to my fork, I can push one button and send a message to the project maintainers asking them to merge the changes back in. As a project maintainer, I have an easy way to evaluate the impact of the change and communicate with the requester and other team members about said change. At the bottom of the pull request page, I'm provided the information on how to easily merge those changes into my main tree.

Designed for Developers

I've been on a bit of a tear lately about usability in developer-targeted products. The latest target of my ire has been Atlassian. Let me clarify that I think Atlassian makes some wonderful products. Confluence is one of the best wikis out there. JIRA is a great issue tracking system for Developers.

However, Atlassian has some "duds" in my opinion. The biggest thorn in my side these days is Bamboo. Bamboo is Atlassian's Continuous Integration server. Like most Atlassian products, its primary target is Java developers. Everything about Bamboo is designed around the Java development toolchain - Maven, Ant and the like. But I don't have a problem with that. What I have a problem with is the over-complication. I grabbed the latest beta of Bamboo at the recommendation of one of the Bamboo developers who heard my rant on Twitter one day. He asked for some feed back and I provided it in a very detailed email. I'm happy to say that the new interface for adding build plans in Bamboo is much simpler than previous versions. I can't do screenshots of our company Bamboo install but previous versions had a VERY complicated multitab build plan configuration.

One point I mentioned in my email is that Bamboo felt like it lacked a focus. Jira was very clearly about Issues. That was the "unit of work". Confluence was very clearly about being a wiki. That was its "unit of work". Bamboo didn't have a singular focus. It was a CI server but what was the unit of work? A build plan? Test results? Fisheye integration? It wasn't clear.

Compare that with Hudson which had a very clear focus. The strength in Hudson is that it performs tasks. Those tasks are typically centered around CI but they don't have to be. In Hudson I can define a job that does nothing more than list directories. I don't even need to back it with a VCS. Bamboo, sadly, in the beta version still hasn't gotten this part right. I can't define a build plan without having a repository somewhere. It still assumes that I want to define all my work inside of an ant script. Using the "shell" builder is still VERY limiting. 

You can see some sample comparison shots between the two here. I'll try to actually setup a repo that Bamboo can use and do a deeper comparison later. 

So what's the focus of Google Code, Launchpad...

Going back to code hosting and comparing Github to the others, I think it's clear that they lack a focus. They try to do too much. They "feel" like they were designed by project managers and targeted at them. Maybe it was a faulty assumption that to effectively manage a large project, you had have all of the extra stuff. I don't know. Launchpad and others DO some things better than Github. Issue tracking is one. Github issue tracking is a pretty weak area for them. However here's where Github understands its focus and strengths.

Where Github lacks, it makes up for in integration. Github doesn't TRY to be the project manager's tool. It doesn't try to be a good issue tracker. What it DOES do is say "I suck at this. My focus is on the code and making working with and contributing to the code dead simple. I'll add hooks for the other stuff"

And they do. Github has a boatload of service hooks for everything from issue tracking to project management to irc and IM. They even have a "generic" hook that will submit JSON to a url for you so you can write your own receiver.

About BitBucket, backend technology and focus

I haven't mentioned much about BitBucket. The main reason is that at this point, BitBucket is simple attempting to feature copy from Github except using Mercurial in the background. Sadly, this isn't enough I think. If my only reason for using BitBucket is the DVCS tool then I honestly might as well use Github. I'll get more engagement there. See this quote from Mark Philips from Basho about why the moved from BitBucket to Github:

Why? There are several reasons, the primary of which is that GitHub,

the application, lends itself to more collaboration when developing

open source software. Again, this was a decision made on the basis of

community development; technically-speaking we were satisfied with

what Bitbucket offered.

The issue wasn't the technology. Mercurial and Git are pretty much at feature parity (as is Bazaar). One thing mercurial doesn't do out of the box is cherry picking but it's supported with extra configuration. Mercurial has hg incoming which let's you see what people are working on. Git has staging. Mercurial has better Windows support than Git. It's really six in one, half dozen in the other.

However what BitBucket DOESN'T have is the community. You see, BitBucket was playing catchup to Github. Simply copying the social aspects of Github isn't enough. Github has too much momentum precisely because they had the focus right from the start - code is king.

As a developer, my key focus is my code. It's what says the most about me. As a developer who wants to attract other developers, the best way to do that is showing the code and making that contribution as easy as possible. Github gets that.

That's why people keep asking you to switch to Github.

Friday, October 22, 2010

Potato Candy - A family recipe

With Halloween right around the corner and Thanksgiving beyond that, it's getting about the time of year when I get to make Potato Candy. Yes, candy made from potatoes.

I don't know the real story behind it. Ever since I was a little pile of baby fat, it's something the kids in my family have eaten. My uncle only made it for Thanksgiving and I think Christmas get-togethers. I've tried to find a bit of history about it over the years but nothing ever concrete. My uncle's family is Irish so that's as stereotypical of a reason as any. What I did seem to track down is that it's pretty unique to the Southeast. We do things weird here, ya'll.

Not long after I married my wife (a Michigan native), her aunt was putting together a family cookbook. Now that I was part of the family I got to contribute a few things. I had my mom and step-mom provide a few entries but I reserved one for myself - Potato Candy. Since the "secret" is out and because freaking @jtimberman got me thinking about candy, I figured I'd add it here for all my interweb friends.

Ingredients

  • 1 Potato about the size of your fist. Seriously. Don't get it any bigger. If you've got big hands, find someone with normal sized hands and compare.
  • 1 jar of peanut butter. Creamy not Crunchy. The last thing you want to deal with when making this stuff is nuts. Trust me.
  • 2 bags of powdered sugar. Yes, you will probably use ALL of it.
  • Wax paper and plenty of counter top space

Peel and boil the potato as you would to make mashed potatoes. When it gets sufficiently soft, mash that bastard up. No lumps. Again, trust me. As smooth as you can get it.

Dump it into a large mixing bowl and reach for the strongest and sturdiest spoon/stirring instrument you can find. Start folding in the first bag of powdered sugar.

This is where it gets fun. As the powdered sugar gets mixed in, this thing is going to get thick and heavy very quickly. It's going to be VERY hard to mix. Did you trust me on the sturdy spoon part? You should have. Don't even think about putting this in an electric mixer. It will burn out the motor. I've literally broken 1/4 inch dowel wooden spoons in this stuff. Your arm is going to hurt. You're going to have to put your back into it.

When you physically can't mix it ANYMORE put it aside for a minute. Spread out a nice sized area on the counter with wax paper and cover it in powdered sugar. This crap is sticky and you're going to need to manipulate it. Once you've gotten the workspace ready, start spreading the "mash" on the wax paper. Usually about 1/4 to 1/2 inch thick is good. You'll probably screw it up the first time around. I did.

Open the jar of peanut butter and start spreading it on the mash. Peaks are okay but you really want to get a good layer on there.

Now, the hard part

Somehow you're going to need to roll from one end of this beast to the other. Like a jelly roll. It's really hard and don't feel too bad if it isn't pretty. The end result is still good. You'll probably want to cover your hands in powdered sugar.

Once you've got it rolled up, flatten it back out. Stick it in the fridge overnight. The next day, cut it into smallish 1.5inch slices and enjoy.

As I said earlier, I've tried to do some research each year. The best picture I can find outside of making some myself is this one.

You probably won't be able to eat more than one or two pieces. It's REALLY rich and really thick. If you give it to kids, do it early in the day so they have time to burn it off.

Enjoy ya'll!

Thursday, October 21, 2010

PyCon DevOps piggy back

So I had a random idea the other night and like any other random idea I immediately sent it to Twitter.

This of course brought feedback which is the whole point, right?

The idea was to have a Velocity style conference in the South East. We all know my love for Atlanta and my half-disdain/half-jealousy of the West coast. So I threw the idea out on twitter and immediately got my first reply from Joe Heck with a bit of reality thrown in:

@lusis nice idea. critical mass with either be easy or impossible to get. You might consider riffing on existing conferences ... PyCon2011

Awesome idea so I headed off to to read up on how PyCon does that kind of thing. I shot off an email to the pycon-organizers mailing list and got some really nice responses. I also got a private tweets from people on the list as well.

The end result is this. If I want to hitchhike on the back of PyCon for a devops-related conference, here are the requirements/suggestions:

  • Involve Python in some way
  • Will need to take advantage of the Open Spaces system

This essentially means unless I (or someone else) is giving a full blown talk on Python and DevOps, it will be an ad-hoc thing. We can't reserve the spaces until the day of the conference. I'm also not sure how big the spaces are. I think this is the same place LISA was held years ago so you might be able to snag a dividable room segment?

So what does everyone think? I'm considering giving a talk on the state of devops toolchains in Python (func, cobbler, fabric, kokki, overmind, whatever else) but I don't know that I'm ready for that yet after a single LUG presentation ;)

I know that Mitchell H. of Vagrant fame was considering heading into town for it. Vagrant isn't just for Rubyists ;)

I'm open to ideas. I'd love to just have the conference I sent the tweet about but when I really think about it, I don't think I can pull something like that off in this amount of time.

Many thanks to the pycon-organizers folks for the input - Doug Hellmann, Vern Ceder and Jesse Noller. Also to Dean Goodmanson for his feedback via Twitter.

Tuesday, October 12, 2010

Latest Vogeler update - MongoDB, protobufs, Riak and war!

I wanted to take a minute to post an update about Vogeler to those who are following the project. Let's get the easy stuff out of the way - it's not abandoned. Far from it.

There have been several reasons why I haven't made any commits lately, the least of which is both kids have been sick recently and I haven't been able to get a good solid block of time to work on it.

Technical Hurdles

Another reason is that I almost went down a rabbit hole with regards to swappable persistence. In the process of refactoring the persistence backend, I realized it should be fairly easy, using the model I put into place, to go ahead and implement MongoDB and Riak support. I started with MongoDB when I hit a wall. MongoDB does not allow dots in key names. When I ran into that issue, I realized that I made some dangerous assumptions based on the fact that I started with Couchdbkit as the interface to CouchDB:

I was using an ORM when I should have used a lower level driver. You see couchdbkit does some nice stuff like translating native Python datatypes to the appropriate datatypes. If I define a row as having DictProperty(), couchdbkit converts that into the commensurate CouchDB JSON datatypes. If I use ListProperty(), the same thing. This is really evidenced in Futon and makes using Futon as your interface to Vogeler very appealing. However this is VERY couchdb specific.

The pymongo driver, however, didn't like my strategy of dumping execution results that way. You can see the "gist" of what I'm talking about here:

;

I brought the issue up on the MongoDB mailing list here. I opened an issue for myself to braindump my thoughts. One of my biggest goals (data transparency) was starting to fall apart for me. I decided to shelve MongoDB for a moment and look at Riak. I wanted to make sure that I at least thought about how a generic model would work across multiple document stores. That's when I ran into the biggest cockblock:

protobuf

I'm almost firmly convinced that protobuf is a piece of trash. Google has some smart people but protobuf is something that quite obviously came out of the mind of someone who was sent off to "solve the RPC problem". There are quite a few issues I have with protobuf:

  • Despite being a "universal" format, it works well in exactly TWO languages - C and Java. Everything else is an afterthought. Don't get me started on Python support. The one guy at Google who supports protobuf on Python can't make it work on anything but Python 2.5 because that's all Google uses. He's unwilling to cut a new PyPi package just to fix all the 2.5 assumptions because he doesn't want to bump the version number. You can't even install it on anything higher than 2.5 without hacking setup.py.
  • You have to precompile your protos before use. I understand what Google is trying to accomplish but seriously? So I have to build the protobuf compiler to compile protos to ship with my code. There's a reason why people like FFI folks.

There are alternatives like Apache Avro that have promise but they also have their own issues. However, Basho has committed to using protobuf which does make sense. Write your own serialization framework or use an existing one? Easy answer when Google wrote one for you.

So I started to noodle out what route I wanted to take when something else came out of left field.

Sgt first Class Lance Vogeler

I have a search setup in Tweetdeck on my Droid for Vogeler. It was nice to stay on top of people mentioning the project. The name for the project came out of me pretty much immersing myself in the latest S.M. Stirling Emberverse books. One of the characters was named Ingolf Vogeler. I really enjoyed the books and liked the name so I picked it. I'm also considering using Ritva for another project.

So one day my phone starts going nuts with Vogeler alerts. I was already getting the occasional history tweet about the real Ingolf Vogeler but it turns out a soldier from Georgia, of all places, was KIA in Afghanistan. He did 8 tours in Afghanistan and 4 tours in Iraq. Politics aside (I'm personally entirely against these campaigns), I didn't want to "pollute" the twitter stream. Regardless of what I think of the current military climate in my country, I have the utmost respect for most of the members in our armed forces.

However what struck me most is that SFC Vogeler left behind a wife. A wife carrying his unborn child. That pretty much did me in. As a father myself, I was pretty torn up thinking about this happening to my wife. Yes, it was a known risk but that doesn't make it any less sad. I decided, in addition to making a donation of my own to his family and holding off on Vogeler till it wasn't alerting on Twitter so much that I would think hard about what my software means.

Steve Jobs asked a guy who emailed him this question "What have you created lately?" Someone else recently said that entrepreneurs are busy creating the next social media app that means fuckall when they could be affecting change with the software they write. That got me thinking. Could I help affect this family somehow with my project that happened to share a name with them? The best I could come up with is this:

If you use Vogeler, are interested in it or just feel like a random act of kindness, please make a small donation to the Vogeler family. My wife and I agreed that should I ever make ANY money off of the project in any identifiable form that I would donate what I could to the family. Vogeler is just a small project. I have no grand aspirations of getting integrated into some mainstream project. I'm just trying to scratch an itch - a niche itch at that. One reason I'm so gung-ho about DevOps is that, as a family man, I don't WANT to be dealing with stupid shit taking time away from my family. I've done it and I'm done with it. If my phone goes off, it's not going to be from some stupid mistake that I made editing a config file or lack of metrics causing an "oh shit we're out of space" moment. I'm past that in my career and I'm past having to work places where that's the norm rather than the exception. My family is first and foremost and anything I can do to keep it that way, I'm going to do it.

So I'm taking the Vim approach to Vogeler. I'm not going to ask anyone to go against his conscience. If you feel like a small donation to this family implies consent to the stupidity of my government then I fully understand. But if you think that open source software and the broader open source community can make a difference in more than just writing software, throw a small donation their way.

I'm going to be starting back up on Vogeler now. I've decided that for now, I'm going to attempt to keep things as generic as possible but continue to code against CouchDB. I'll keep revisiting MongoDB and Riak support but the primary target is CouchDB. If any Basho folks are reading this, if you can remove the hard dep in setup.py on protobuf, that would be awesome. You can't even install from PyPi with it in there anyway. If the MongoDB folks take a gander at this, can you do something about the dots in key names? Thanks!

Friday, October 8, 2010

Why I hope BankSimple succeeds but fear they won't have a choice in the matter

My current political affiliation is not the most popular with people right now. That's understandable. I tend to approach things from a rational and unemotional perspective. Call me callous. Call me a dipshit. I'm consistent if nothing.

But the issue that's been bugging me lately transcends party or political affiliation. It's one that touches me as a technologist, an IT worker and a citizen.

DevOps and the changing IT landscape

I've blogged/tweeted/talked/rambled enough about DevOps to make people sick. In some regards we risk crossing the line with our enthusiasm and killing interest in the topic. Ignore for a minute that much of what makes up "DevOps" is not new ground. That's not to say that the "movement" hasn't introduced some amazing things.

The point is that aspects of DevOps as a philosophy are changing how people approach IT holistically. Automation. Rapid iteration. Frequent changes. Breaking down organizational silos.

So how does this play into BankSimple?

It actually plays into it in two ways. The first is that, like DevOps, BankSimple isn't doing anything "new" per se. If you're talking "old" business, banking ranks right up there with prostitution and in shares more in common with it than most would want to admit. What's different is how BankSimple is doing it. The BankSimple guys want to make money, no doubt but they're approaching it from a different focus. It's been formed, not by the traditional cadre but by "techies". It's almost "Agile Banking".

You can guarantee that because of that perspective and the history of the cofounders, the business and by extension the IT aspect will be run differently.

If you can't beat 'em, buy 'em.

I'm a hands-off kind of guy when it comes to my government. Many people thought that, based on concepts like "hope" and "change", the government was going to turn around and focus on its customers - the citizens. It hasn't yet turned out that way.

Someone asked an interesting question at the last DevOps meetup in Atlanta.

"Do you think traditional companies feel threatened by startups who use more agile methods to beat them to market?"

The answer:

"Not really. If they get to be too big of a threat, just buy them out.".

That, of course, assumes they're willing to sell. Some people might have that luxury but when the VCs start to chime in that choice may be stripped from you. On the flipside, some might believe that by selling what they've worked so hard to build up will be destroyed yet they feel an obligation to the people who started with them to sell.

    But let's assume that, in the case of BankSimple they have the goals of REALLY changing banking and finance (which I think they do). Assume they don't have to sell (which I don't know).

    If you can't buy em, buy a lobbyist

    I was talking to an individual a few weeks back about the idea that traditional companies would get eaten alive by leaner more agile startups. We were talking specifically about the financial sector but his comment applies. So I, with a head full of steam raving about agile operations and time to market, was hit head on with this reality in one comment:

    "Yeah but do they have lobbyists?"

    Every company I've worked at, save one, has had lobbyists on the books. At one company we actually had different colored checks that we printed for lobbyists vice expense checks. Some people think lobbyists just "lobby" for things like deregulation and "leave my business" alone but that's not the case.

    Without going into too much detail, at one company we actually LIKED a certain amount of regulation. You see a certain amount of regulation is just enough to restrict entry into the market by new players. We liked that each state had its own set of regulations. That in one state we were classified under one set of laws while in another state we were classified differently. This ARTIFICIALLY raises the barrier to entry for competitors. It keeps them small. It relegates them to one of two positions:

    • Small enough for us not to care because they can't afford to expand into new markets
    • Big enough to justify buying them to eliminate the competition.

    Sure, sometimes a company will succeed and grow big enough to be REAL competition but that's fine too. Helps prevent antitrust investigations. It's a win-win all around.

    Not all lobbyists are bad. I would argue that the ACLU is a "good" lobbyist but they can also exist to help preserve an existing business either buy demanding hands-off or demanding more regulation under the guise of "consumer protection".

    Sometimes the tree of innovation must be watered with the blood of failed business models

    I sent this out as a tweet earlier tonight. With no disrespect meant to the original quote that inspired it, I firmly believe this to be true. Without even discussing politics or regulation or monopolies, the fact of the matter is that to progress as a society some things have to die. Horse and buggy. Media distribution. Publishing. All of these changes help move us to greater things. Many a business is built on convenience and inefficiencies in the supply chain. FedEx and UPS would probably have not been needed had the post office not sucked so bad. In an age when information is passed slowly, traditional news organizations made sense. BankSimple is part of that progression just as DevOps is part of that transition in IT.

    But I hope that Alex and the gang aren't as naive as some people (myself included) have been about DevOps. Just as DevOps working its way into traditional organizations will have to deal with the Boogymen of HIPPA, SOX and PCI DSS, so will BankSimple have to deal with an establishment that, if threatened, sadly has the power to essentially make them "illegal".

    We need them to succeed so we can move on to better and brighter things.

     

    Wednesday, October 6, 2010

    .plan

    TODO

     

     

     

     

    Wednesday, September 29, 2010

    Distributions and Dynamic Languages - A Manifesto

    Background

    There's been a lot of talk recently on Twitter and in various posts across the Intertubes about how the various distributions handle dynamic languages and the package system those languages use. This has been a sore spot for me for a LONG time. Recently I had a chance to "stub out" my feelings in a comment on HN. I've been meaning to write this post for a few weeks but just haven't had the time. I'm making the time now.

    Distro vendors find themselves in an interesting spot. In general, the difference between Linux distributions has boiled down to a few categories:

    • Support
    • Management tools
    • Package format
    • Default desktop

    For the home/desktop user, the last two (and more importantly the last one) are the biggest deciding factors. For the "enterprise" user, the first is typically key. But not all enterprises are enterprises. Would anyone argue that Facebook or Google or Twitter are not enterprise users? Of course not. However those companies don't tend to need the same level of support and have the same hang-ups as Coca-Cola or Home Depot. The latter two companies are the traditional enterprise that does things like troubleshoot servers when they fail. The former are the forward thinking companies that say "Fuck it. Pull the server and put another one in. We don't have time for this 'bench' shit."

    In the same vein, the first group of companies are the kind that use Linux as a platform where the second group uses RedHat or Suse as an OS to host JBoss or Oracle or DB2. Those vendors say "We run on distros X and Y and we only support those". You don't have a choice in the second group. The first group may have standardized on a distro but the distro itself is also irrelevant. Those companies use Chef and Puppet and similar tools to totally abstract that out. The distro becomes a commodity. They just want Linux.

    This is the new type of company and this is the type of company that distro vendors have to worry about.

    So having said that, how does those tie into the dynamic language debacle of late? Increasingly, applications in the PaaS/SaaS space are being written in dynamic languages. The product is just different than Oracle or DB2. So these companies need to consider which distro will make using those dynamic languages as easy as possible. Frankly, they've all pretty much fucked it up. The main reason? Traditional software products.

    The biggest selling point of an enterprise distro was support. That, or the fact that you were required to run RedHat or Suse for your RAC cluster. One of the main reasons that enterprise distros were able to be supported platforms for Oracle or DB2 is that they "stabilized" things. In this case that meant long term support (LTS) models and a consistent base operating system. If you ported your product to run on RHEL4, you could guarantee that RedHat would never break compatibility for the life of that product support cycle (I think it's 7 years right now?). You could also be assured that version X of a package would be available for the platform should you need it.

    The Problem

    That worked fine for binary COTS products. Not so fine for the world of dynamic languages where new versions of a Gem or Python package come out daily. And ESPECIALLY not when the language package system allows for multiple versions of the same package to be installed alongside each other. But is this really a big deal? The distros can just upgrade python to 2.7 right? Nope and the reason why?

    Management tools

    I don't fault the distro vendors for using python (as an example) as the higher level management language for the OS. In fact, having now gotten into Python, I think it's a wonderful idea. It is, language wars aside, a very approachable and consistent language. It allows them to quickly iterate those tools and especially in the case of Python, the core language changes very little. It's mature.

    So now distro vendors have gone and written core parts of the operating system to use Python. Combine that with the package manager restrictions and LTS and you have a system where, if you upgrade Python, you've broken the system beyond repair. This is why RHEL5 is still on Python 2.4.

    This is the where we find ourselves today. Distro vendors have to continually package all the python modules they want to supply in native package format to the version of the runtime they use. Eventually the module/gem maintainer is going to stop supporting that module on such old runtimes. Now they essentially have to maintain backports for the life of the LTS terms. This is madness. Why would you put yourself in this situation? I didn't know this but FreeBSD evidently solved this problem a while ago by moving all core scripts away from Perl.

    The Manifesto

    So here's my manifesto. My suggestion if you will as a long time Linux user, enterprise customer and dynamic language programmer.

    Stop it. Get out of the game now. As much as you would like to think your customers care about LTS for Perl/Python/Ruby, they don't. Your LTS is irrelevant six months after you cut a new release of a distro. RHEL6 is shipping with Ruby 1.8.6. Seriously? Not even 1.8.7? I understand they have a long development cycle for new distro versions which is why I'm saying get out. You can't keep up.

    But what about our management tools?

    I've solved that for you to. system-python, system-ruby, system-perl. Isolate them. Treat them as you would /opt/python or /opt/ruby. Make them untouchable. Minimize your reliance on any module/gem/library you don't directly maintain (i.e. a gtk python module). Understand that you will be wasting resource on backporting this module for 5 or 7 years. No more '/usr/bin/env python'. Shebang that bastard to something like '/usr/lib/system-python/bin/python'

    So now that you've isolated that dependency, what about people who don't WANT to compile a new ruby or python vm? How do you provide value to them? The ActiveState model. /usr/lib/python27, /usr/lib/python31, /usr/lib/ruby187.

    But wasn't the point of this whole discussion around DLR package management? We don't want to maintain a package per vm version of some library.

    Then don't.

    This is where the onus is on the language writers. Your package format needs to FULLY support installing from a locally hosted repo of some kind. You may not believe it but not every server has internet access. At our company, NONE of the servers can get to the Internet. The still serve content TO the internet but can't get out. Not by proxy. Not at all.

    We're essentially forced to download python packages or jar files and copy them to a maven server or host them from apache to use them internally. Either that, or package them as RPMs. With the python packages, it's especially annoying because, while pip will happily pull from any apache-served directory of tarballs, we can't push from setup.py to it. We don't have ANY metadata associated with it at all.

    So Ruby/Python/Perl guys, you need to either provide a PyPi/Gem server package that operates in the same way as your public repos do or make those tools operate EXACTLY the same with a local file path as they do with a URL. Look at createrepo for RPMs for an idea of how it can work if you need to. Additionally, tools like RVM and virtualenv really need to work with distro vendors. RVM does a stellar job at this point. Virtualenv has a way to go.

    So now the distro vendors have things isolated. They ship said language repo server and by default point all the local language package tools to that repo path or server. Now if the user chooses to grab module X from PyPi to host locally, they've made that decision. It doesn't break they OS. You don't offer support for it unless you really want to and this whole fucking problem goes away.

    EDIT:

    I realize I'm not saying anything new here. I also realize that distro vendors realize that the distro itself is a commodity. RedHat figured that out a long time ago. Look at the JBoss purchase and everything since then. Additionally, virtualization removes any reason you might have for picking distro X over distro Y because of hardware support in the distro.

    Wednesday, September 22, 2010

    Hiring for #devops - a primer

    I've written about this previously as part of another post but I've had a few things on my mind recently about the topic and needed to do a brain dump.

    As I mentioned in that previous post, I'm currently with a company where devops is part of the title of our team. I won't go into the how and why again for that use case. What I want to talk about is why organizations are using DevOps as title in both hiring and as an enumerated skillset.

    We know that what makes up DevOps isn't anything new. I tend to agree with what John Willis wrote on the Opscode blog about CAMS as what it means to him. The problem is that even with such a clear cut definition, companies are still struggling with how to hire people who approach Operations with a DevOps "slant". Damon Edwards says "You wouldn't hire an Agile" but I don't think that's the case at all. While the title might not have Agile, it's definitely an enumerated skill set. A quick search on monster in a 10 mile radius from my house turned up 102 results with "Agile" in the description such as:

    • experienced Project Manager with heavy Agile Scrum experience
    • Agile development methodologies 
    • Familiar with agile development techniques
    • Agile Scrum development team 

    Yes, it's something of a misuse of the word Agile in many situations but the fact of the matter is that when a company is looking for a specific type of person, they tend to list that as a skill or in the job description. Of course Agile development is something of a formal methodology whereas DevOps isn't really. I think that's why I like the term "Agile Operations" more in that regard. But in the end, you don't have your "Agile Development" team and so you really wouldn't have your "Agile Operations" team. You have development and you have operations.

    So what's a company to do? They want someone who "does that devops thing". How do they find that person? Some places are listing "tools like puppet, chef and cfengine" as part of skill sets. That goes a long way to helping job seekers key off of the mindset of an organization but what about the organization? How do they determine if the person actually takes the message of DevOps to heart? I think CAMS provides that framework.

    Culture and Sharing

    What kind of culture are you trying to foster? Is it one where Operations and Development are silos or one where, as DevOps promotes, the destruction of artificial barriers between the groups? Ask questions of potential employees that attempt to draw that out of them. Relevance to each role is in parenthesis.

    • Should developers have access to production? Why or why not? (for Operations staff)
    • Should you have access to production? Why or why not? (for Development staff)
    • Describe a typical release workflow at a previous company. What were the gaps? Where did it fail? (Both)
    • Describe your optimal release workflow. (Both)
    • Have you even been to a SCRUM? (Operations)
    • Have you ever had operations staff in a SCRUM? (Development)
    • At what point should your team start being involved/stop being involved in a product lifecycle? (Both)
    • What are the boundaries between Development and Operations? (Both)
    • Do you have any examples of documentation you've written? (Both)
    • What constitutes a deployable product? (Both)
    • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)

    Automation and Metrics

    This is somewhat equivalent to a series of technical questions. The key is to deduce the thought process a person uses to approach a problem. Some of these aren't devops specific but have ties to it. Obviously these might be tailored to the specific environment you 

    • Describe your process for troubleshooting an outage? What's the most important aspect of an outage? (Both)
    • Do you code at all? What languages? Any examples? Github repo? (Operations)
    • Do you code outside of work at all? Any examples? Github repo? (Development)
    • Using psuedo-code, describe a server.  An environment. A deployable. (Operations)
    • How might you "unit test" a server? (Operations)
    • Have you ever exposed application metrics to operations staff? How would you go about doing that? (Development)
    • What process would you use to recreate a server from bare metal to running in production? (Operations)
    • How would you automate a process that does X in your application? How do you expose that automation? (Development)
    • What does a Dashboard mean to you? (Both)
    • How would you go about automating production deploys? (Both)

    A few of these questions straddle both aspects. Some questions are "trick questions". I'm going to assume that these questions are also tailored to the specifics of your environment. I'm also assuming that basic vetting has been done.

    So what are some answers I like to hear vice don't ever want to hear? Anything that sounds like an attitude of "pass the buck" is a red-flag. I really like seeing an operations person who has some sort of code they've written. I also like the same from developers outside of work. I don't expect everyone to live, breathe and eat code but I've known too many people who ONLY code at work and have no interest in keeping abreast of new technologies. They might as well be driving a forklift as opposed to writing code.

    I think companies will benefit more from a "technologist" than someone who is only willing to put in 9to5 and never step outside of a predefined box of responsibilities. I'm not suggesting that someone forsake family life for the job. What I'm saying is that there are people who will drag your organization down because they have no aspirations or motivations to make things better. I love it when someone comes in the door and says "Hey I saw this cool project online and it might be useful around here". I love it from both developers and operations folks.

    Do with these what you will. I'd love to hear other examples that people might have.

    Sunday, September 12, 2010

    Follow up to #vogeler post

    Patrick Debois was kind enough to comment on my previous post and asked some very good questions. I thought they would fit better in a new post instead of a comment box so here it is:

    I read your post and I must say I'm puzzled on what you are actually achieving. Is this a CMDB in the traditional way? Or is it an autodiscover type of CMDB, that goes out to the different systems for information? In the project page you mention à la mcollective. Does this mean you are providing GUI for the collected information? Anyway, I'm sure you are working on something great. But for now, the end goal is not so clear to me. Enlighten me!

    Good question ;) I think it sits in an odd space at the moment because it tries to be flexible and by design could do all of those things. Mentioning Mcollective may have clouded the issue but is was more of a nod to similar architectural decisions - using a queue server to execute commands on multiple nodes.

    My original goal (outside of learning Python) was to address two key things. I mentioned these on the Github FAQ for Vogeler but it doesn't hurt to repost them here for this discussion: 

    • What need is Vogeler trying to fill?

    Well, I would consider it a “framework” for establishing a configuration management database. One problem that something like a CMDB can create is that, to meet every individual need, it tends to over complicate. One thing I really wanted to do is avoid forcing you into my model and trying to provide ways for you to customize the application.

    I went the other way. Vogeler at the core, provides two things – a place to dump “information” about “things” and a method for getting that information in a scalable manner. By using a document database like CouchDB, you don’t have to worry about managing a schema. I don’t need to know what information is actually valuable to you. You know best what information you want to store. By using a message queue with some reasonable security precautions, you don’t have to deal with another listening daemon. You don’t have to worry about affecting the performance of your system because you’re opening 20 SSH connections to get information or running some statically linked off-the-shelf binary that leaks memory and eventually zombies (Why hello, SCOM agent!).

    In the end, you define what information you need, how to get it and how to interpret it. I just provide the framework to enable that.

    So to address the question:

    If we're being semantic, yes it's probably more of a configuration database than a configuration MANAGEMENT database. Autodiscovery, though not in the traditional sense, is indeed a feature. Install the client, stand up the server side parts and issue a facter command via the runner. You instantly have all the information that facter understands about your systems in CouchDB viewable via Futon. I could probably easily write something that scanned the network and installed the client but I have a general aversion to anything that sweeps networks that way. More than likely, you would install Vogeler when you kicked a new server and managed the "plugins" via puppet.

     

    I hope that makes sense. Vogeler is the framework that allows you to get whatever information about your systems you need, store it, keep that information up to date and interpret it however you want. That's one reason I'm not currently providing a web interface for reporting right now. I just simply don't know what information is valuable to you as an operations team. Tools like puppet, cfengine, chef and the like are great and I have no desire to replace them but you COULD use this to build that replacement. That's also why I use facter as an example plugin with the code. I don't want to rewrite facteri. It just provides a good starting tool for getting some base data from all your systems.

    Let's try a use case:

    I need to know which systems have X rpm package installed.

    You could write an SSH script, hit each box and parse the results or you could have Vogeler tell you. Let's assume that the last run of "package inventory" was a week ago:

    vogeler-runner -c rpms -n all

    The architecture is already pretty clear. Runner pushes a message on the broadcast queue, all clients see it ('-n all' means all nodes online) and they in turn push the results into another queue. Server pops the messages and dumps them into the CouchDB document for each node. You could then load up Futon or a custom interface you wrote and load the CouchDB design doc that does the map reduce for that information. You have your answer.

    Now let's try something of a more complicated example:

    I need to know what JMX port all my JBoss instances are listening on in my network.

    Well I don't provide a "plugin" for you to get that information, a key for you to store it under in CouchDB or a design doc to parse it by default. But I don't need to. We take the Nagios approach. You define what command returns that information. A shell script, a python script, a ruby script whatever works for you. All you need to tell me is what key you want to store it under and something about the basic structure of the data itself. Maybe your script provides emits JSON. Maybe it emits YAML. Maybe it's a single string. Maybe you run multiple JBoss instances per machine each listening on different JMX ports (as opposed to aliasing IPs and using the standard). I'll take that and create a new key with that data in the Couch document for that system. You can peruse it with a custom web interface or, again, just use Futon.

    Does that help?

     

    Notes on #vogeler and #devops

    UPDATE: There's some additional information about Vogeler in the followup post to this one:

    Background

    So I've been tweeting quite a bit about my current project Vogeler. Essentially it's a basic configuration management database built on RabbitMQ and CouchDB. I had to learn Python for work, we may or may not be using those two technologies so Vogeler was born.

    There's quite a bit of information on Github about it but essentially the basic goals are these:

    • Provide a place to store configuration about systems
    • Provide a way to update that configuration easily and scalably
    • Provide a way for users to EASILY extend it with the information they need

    I'm not doing a default web interface or much else right now. There's three basic components - a server process, a client process and a script runner. The first two don't act as traditional daemons but instead monitor a queue server for messages and act on that.

    In the case of the client, it waits for a command alias and acts on that alias. The results are stuck on another queue for the server. The server sits and monitors that queue. When it sees a message, it takes it and inserts it in the database with some formatting based on the message type. That's it. The server doesn't initiate and connections directly to the clients and neither do the clients talk directly to the server. All messages that the clients see are initiated by the runner script only.

    That's it in a nutshell.

    0.7 release

    I just released 0.7 of the library to PyPi (no small feat with a teething two year old and 5 month old) and with it, what I consider the core functionality it needs to be useful for people who really are interested in testing it. Almost everything is configurable now. Server, Client and Runner can specify where each component it needs lives on the network. CouchDB and RabbitMQ are running in different locations from the server process? No problem. Using authentication in CouchDB? You can configure that too. Want to use different RabbitMQ credentials? Got it covered.

    Another big milestone was getting it working with Python 2.6. No distro out there that I know of is using 2.7 which is what I was using to develop Vogeler. The reason I chose 2.7 is that was the version we standardized on and since I was learning a new language and 2.7 was a bridge to 3, I chose that one. But when I went to started looking at trying the client on other machines at home, I realized I didn't want to compile and setup the whole virtualenv thing on each of them. So I got it working with 2.6 which is what Ubuntu is using. For CentOS and RedHat testing, I just used ActivePython 2.7 in /opt/.

    Milestones

    As I said 0.7 was a big milestone release for me because of the above things. Now I've got to do some of the stuff I would have done before if I hadn't been learning a new language:

    • Unit Tests - These are pretty big for me. Much of my work on Padrino has been as the Test nazi. Your test fails, I'm all up in your grill.
    • Refactor - Once the unit tests are done, I can safely being to refactor the codebase. I need to move everything out of a single .py with all the classes. This also paves the way for allowing swappable messaging and persistence layers. This is where unit tests shine, IMHO. Additionally, I'll finish up configuration file setup at this point.
    • Logging and Exception handling - I need to setup real loggers and stop using print messages. This is actually pretty easy. Exception handling may come as a result of the refactor but I consider it a distinct milestone.
    • Plugin stabilization - I'm still trying to figure out the best way to handle default plugins and what basic document layout I want.

    Once those are done, I should be ready for a 1.0 release however before I cut that release, I have one last test.....

    The EC2 blowout

    This is the part I'm most excited about. When I feel like I'm ready to cut 1.0, I plan on spinning up a few hundred EC2 vogeler-client instances of various flavors (RHEL, CentOS, Debian, Ubuntu, Suse...you name it). I'll also stand up distinct RabbitMQ, CouchDB and vogeler-server instances.

    Then I fire off the scripts. Multiple vogeler-runner invocations concurrently from different hosts and distros. I need to work out the final matrix but I'll probably use Hudson to build it.

    While you might think that this is purely for load testing, it's not. Load testing is a part of it but another part is seeing how well Vogeler works as a configuration management database - the intended usage. What better way than to build out a large server farm and see where the real gaps are in the default setup? Additionally, this will allow me to really standardize on some things in the default based on the results.

    At THAT point, I cut 1.0 and see what happens.

    How you can help

    What I really need help with now is feedback. I've seen about a 100 or so total downloads on PyPi across releases but no feedback on Github yet. That's probably mostly due to such minimal functionality before now and the initial hurdle. I've tried to keep the Github docs up to date. I think if I convert the github markdown to rst and load it on PyPi, that will help.

    I also need advice from real Python developers. I know I'm doing some crazy stupid shit. It's all a part of learning. Know a way to optimize something I'm doing? Please tell me. Is something not working properly? Tell me. I've tried to test in multiple virtualenvs on multiple distros between 2.6 and 2.7 but I just don't know if I've truly isolated each manual test.

    Check the wiki on github and try to install it yourself. Please!

    I'm really excited about how things are coming along and about the project itself. If you have ANY feedback or comments, whatsoever, please pass it on even if it's negative. Feel free to tell me that it's pointless but at least tell me why you think so. While this started out as a way to learn Python, I really think it could be useful to some people and that's kept me going more than anything despite the limited time I've had to work on it (I can't work on it as part of my professional duties for many reasons). I've been trying to balance my duties as a father of two, husband, Padrino team member along with this and I think my commitment (4AM...seriously?) is showing.

    Thanks!