dwilson
on Fri May 18 2007 12:12:08 GMT+0100 (GMT)
If you follow this - oddly green - blog you'll have seen posts from
management, developers, more developers, even more managers but none
from the sysadmins. Why is this? Obviously we actually have real work to
do - although considering I'm a sysadmin I might be a little biased. So
what kind of thing have we been spending our time on recently?
Over the weekend Bob and I spent some time investigating how we could
speed up the deployment of an entire staging environment for Zimki.
There are a number of reasons for us doing this, in addition to all
those lovely terms that management like (that also come in handy at
bonus reviews) such as operational efficiency and business continuity
planning it has a noticeable impact on reducing our Mean Time To
Recovery (MTTR).
Our plans for this prototype involved automated installation (we decided
to use Debian Etch for the experiment) using FAI, a Fully
Automated Installer (hence the name). We currently have an in-house
solution for deploying dedicated hosts that run our products but it's
not an ideal fit for the infrastructure hosts and it's proprietary so we
have to do all the hard work ourself!
For the actual systems management we're very close to settling on Puppet as our tool
of choice. While we've already invested some time in CFEngine and have a
very basic deployment, Puppet has a much nicer feel and is easier to get
up and running; which is great thing when your scarcest resource is
time. Over the weekend we completed proof of concepts for BIND
deployments (including custom configs and zones from our internal SVN
repo), centralised loghosts and the client side part of our (actually
very comprehensive) Nagios setup.
So what does this get us? Depending on which parts of the experiment
pass review and get pulled back in to the standard build - we're closer
to an automated deployment of an entire staging environment (which was
the ultimate goal). It also has the more immediate gain of allowing us
finer grained control of which services run on which machines. Although
we already have most critical services in paired or clustered
configurations the ability to reduce the amount of stand by hardware,
and in cases where a short outage is acceptable, remove it completely
by reducing the MTTR to an acceptable value is an immediate gain. It
also has an added bonus of making the replacement of the paired
servers easier.
One of the more pleasant side effects of this kind of project are the
challenges it makes to your assumptions. Where do you pull your configs
down from? How many of them are abstracted enough to be reused without
modification? Fotango is fortunate in that we invest in our own tools
and we've got a very nice piece of config generation software that makes
a lot of this trivial for us.
|