My post yesterday touched on one of the subjects related to software development that has really crystallized some of the process breakdowns I see in too many organizations out there. There is much time spent measuring developer output, but missing the overall cycle of going from idea to users. When organizations begin to measure this, the next step is to measure the activities within.

Of all the phases in a typical delivery cycle for software, the most costly in improperly automated environments is that of deploying to production. We spend hours writing unit tests, maybe some integration tests, and perhaps even writing a full automated acceptance suite but still significant time is spent getting that code to work right in its eventual “production” environment.

Some signs that this might be happening to you:

  • Deploying to production keeps folks working long past the planned duration, involves numerous personnel and is a high stress event.
  • Code that was accepted in test doesn’t work in staging or production.
  • Things that work in production after the latest deployment don’t work in the other environments, and an operations person has to be contacted to find out what they changed recently.

Before I go much further, lets define what I mean by production. In an IT department with internal applications, production may be a farm of web servers and a database cluster servicing one instance of several applications used by the organization. For a shrink-wrapped product, production will be your users’ computers. The cost on cycle time of not properly testing your application in its environment before delivering it can be significant.

Since production environments are a company’s IT backbone bread and butter, operations personnel (or those of your customers) have a motivation for keeping things as stable as possible. Developers however, are motivated by their ability to enact change in the form of new features. This tends to create a conflict of interest and most organizations’ answer is to lock down production environments to only be accessed by operations personnel. An alternative strategy, one outlined in continuous delivery, is to start treating the work operations does related to setting up and maintaining their environment with the same rigor and process as the software being deployed to it.

Life before source control – are we still there?

Consider an example. An organization has 4 environments – development, test, staging, and production. Development is meant to be an environment in which programmers can make changes to the environment needed to support ongoing changes. Test should be the same environment, but with the purpose of running tests and manually checking out the application like a user would. Staging should be the final place your code goes to verify a successful deployment, and production simply a copy of staging. You may be thinking already “I can’t afford a staging environment that has the same hardware as production!”.

It’s acceptable for staging not have the exact specifications of production, but you should minimally try to have two nodes for every scalable point in the topology. If production has a cluster of 4 databases, staging needs to have 2. If production has a farm of 10 web servers, staging needs to have 2. With this environment in place, you are still testing the scaled points in your architecture, but without the cost of maintaining an entire cluster. This is obviously easier to do with virtualization, but take care to not use a staging environment that is significantly more or less powerful than production if using it for capacity and performance testing. You cannot have a staging environment that has half the servers of production and just double the performance you are experiencing to assume production will provide twice the capacity. Measuring computing resources does not occur in a linear fashion as one might assume.

Continuing with the example, consider what work would be like without source control. When you make a change to your code, you would have to manually send that code and make its changes on each developer’s machine. Maybe you could make things a bit easier by creating a document that tells developers how to make the changes to their code. This is ridiculous right? Sadly this is exactly how many organizations treat the environment. A change made in one environment is manually made in all the others, and the opportunity for lag between making those changes and human error is large.

Making the environment a controlled asset

The way out of this mess is to start thinking about the environment as a product that deserves the same process oversight as the software being deployed to it. We spend so much time making sure code developers write is tested, but it’s just as easy to break production by making one bad configuration change. To get around this, we need to change the way the environment is managed and leverage automation.

  1. Create baselines of environment operating system images for each node required by your application (database server, web server, etc.). These images should have the operating system, and any other software that takes a long time to install already setup. Don’t have anything pre-configured in these images that can change from one environment (dev/test/prod etc.) to the next.
  2. Create deployment scripts that you can point to a network computer or VM using datacenter management software (Puppet, System Center etc.). These scripts should install the baseline image on the target computer. Work with operations to determine the best scripting technology to use for them. Operations personnel typically hate XML, but using PSake (a powershell deployment extension) or rake is usually acceptable.
  3. Create deployment scripts that run after the datacenter management step and configure the environment suitable for your software. This includes setting up permissions, adding users to groups, making configuration changes to your frameworks (.NET machine config, Java classpath, Ruby system gems etc.).
  4. Create configuration settings that are specific to each of your environments. This would optimally be one database table, XML, or properties file with the settings that change from one environment to the next. Put your database connection strings, load balancer addresses, web service URLs etc. in one place. I’ll do a future post on this point alone.
  5. Create deployment scripts that apply the configuration settings to the target environment.
  6. Store all of these assets in source control (other than maybe the OS images, which should be on a locked down asset repository or filesystem share).

Once this is in place, you should be able to point to any computer or VM on your network that has been setup by IT to be remotely managed and target a build to it. The build should setup the OS image and run all your deployment scripts. From this point forward, the only way any change should be made to the environment is through source control.

This change provides us with a number of benefits:

  • Operations personnel improve their career skills by learning to write scripts to automate changing the environment and these can be reused in all of the other environments. If you want to change the configuration of the database for example, this change once made in source, will propagate to ALL environments that are deployed to from the same build.
  • Developers can look in source control to see the configuration of the environment. No more sending an email to operations to find out what varies in production from the other environments.
  • Deploying new builds will test the latest code, with the latest database changes, along with any environment changes. This is the only way to really test how your application will run in production. Any problems found in staging will also be found in production, so you get a chance to fix them without the stress doing so in production adds.

There are a couple more things to mention here. First, if you are deploying shrink-wrapped software, you probably have many target environments. To really deliver quality with as few surprises to your customers, you should setup automated builds like this for each variation you might deploy to. Determine minimum hardware requirements for your customer, test at this minimum configuration, and also test any variances in environment. If you support two versions of SQL server, you really should be testing deployment on an environment with each of these different versions for example.

One more thing – for organizations in which production settings are not to be made visible to everyone, simply have a separate source control repository or folder with configuration settings for production, and give your build the permissions to pull from that repository (just the configuration) when setting up a production node. Developers will still need elevated permissions or to coordinate with more-privileged operations personnel to find the answer to their questions about how production is setup, but the code for applying environment configuration settings to the other environments will be accessible via source control, simply with different values than production.

Once you have an automated mechanism for setting up and configuring your environment from a build, you need a way to piggy back that process on top of your continuous integration server. I’ll leave that for my next post.