Reducing MTTR with Automated Server Provisioning

	dwilson on Fri May 18 2007 12:12:08 GMT+0100 (GMT) If you follow this - oddly green - blog you'll have seen posts from management, developers, more developers, even more managers but none from the sysadmins. Why is this? Obviously we actually have real work to do - although considering I'm a sysadmin I might be a little biased. So what kind of thing have we been spending our time on recently? Over the weekend Bob and I spent some time investigating how we could speed up the deployment of an entire staging environment for Zimki. There are a number of reasons for us doing this, in addition to all those lovely terms that management like (that also come in handy at bonus reviews) such as operational efficiency and business continuity planning it has a noticeable impact on reducing our Mean Time To Recovery (MTTR). Our plans for this prototype involved automated installation (we decided to use Debian Etch for the experiment) using FAI, a Fully Automated Installer (hence the name). We currently have an in-house solution for deploying dedicated hosts that run our products but it's not an ideal fit for the infrastructure hosts and it's proprietary so we have to do all the hard work ourself! For the actual systems management we're very close to settling on Puppet as our tool of choice. While we've already invested some time in CFEngine and have a very basic deployment, Puppet has a much nicer feel and is easier to get up and running; which is great thing when your scarcest resource is time. Over the weekend we completed proof of concepts for BIND deployments (including custom configs and zones from our internal SVN repo), centralised loghosts and the client side part of our (actually very comprehensive) Nagios setup. So what does this get us? Depending on which parts of the experiment pass review and get pulled back in to the standard build - we're closer to an automated deployment of an entire staging environment (which was the ultimate goal). It also has the more immediate gain of allowing us finer grained control of which services run on which machines. Although we already have most critical services in paired or clustered configurations the ability to reduce the amount of stand by hardware, and in cases where a short outage is acceptable, remove it completely by reducing the MTTR to an acceptable value is an immediate gain. It also has an added bonus of making the replacement of the paired servers easier. One of the more pleasant side effects of this kind of project are the challenges it makes to your assumptions. Where do you pull your configs down from? How many of them are abstracted enough to be reused without modification? Fotango is fortunate in that we invest in our own tools and we've got a very nice piece of config generation software that makes a lot of this trivial for us.
	leave a comment
name
email
comment

leave a comment