recent posts for dwilson

System Administration - Tactical vs Strategic

dwilson on Fri May 25 2007 09:58:58 GMT+0100 (GMT)

Contrary to my previous sysadmin blog posts we (unfortunately) don't spend all our time in the office trying new software and evaluating new hardware - even we need a lunch break. Instead we're working behind the scenes on Zimki and our other websites to keep things running happily, teaming up with our developers to answer any support issues that you lovely customers send our way and, sometimes, we even complete milestones in our longer running projects. Honest.

Despite the often recited "herding cats" analogy, creating an operationally efficient systems team is a pretty straight forward thing to do. Notice that I didn't say it'd be easy. It mostly requires an understanding that our workload has two main forms: tactical and strategic. I'm not including firefighting here - that's a topic for something longer than even one of my blog posts.

Tactical work is what most sysadmins spend their days doing. Helping customers (a much nicer term than users), fixing problems that appear, making small tweaks and changes etc. These tasks are often important to other people but they rarely help us achieve our own goals or complete our project work. The projects themselves, which are the strategic part of the workload, are every bit as important as user requests - they're just not as visible.

At Fotango the systems team is currently four people (we're looking for a fifth) and the work breakdown on a typical day looks a lot like this -

One person monitors the request tracking system. We track customer issues escalated upwards by our excellent (and very patient) front end support and all internal requests.

We've found that people have an expected response time for tasks. By assigning a dedicated person we keep our response time low while not constantly interrupting any one on a more involved task. The systems support person can also help with other, less focus demanding, tasks.

The second tactical role is the on-call bunny. She's first line for issues that crop up from our systems themselves. Problems detected via Nagios, suspicious lines in logs, performance bottlenecks and load spikes are all part and parcel of this role.

The other half of the team mostly work on our longer term projects, attend meetings that require a sysadmin to be present, or perform daily maintenance. These are often concentration demanding tasks (apart from the meetings) that are made much easier by having the other two providing an interruption shield. Of course, a big problem will drag them back off in to the trenches but there shouldn't be enough big problems to make this a real issue.

So now you know it's not all glamour in the systems team. Next time your page loads quickly and with no problems spare a thought for the effort we've put in so you don't have to.

Reducing MTTR with Automated Server Provisioning

dwilson on Fri May 18 2007 12:12:08 GMT+0100 (GMT)

If you follow this - oddly green - blog you'll have seen posts from management, developers, more developers, even more managers but none from the sysadmins. Why is this? Obviously we actually have real work to do - although considering I'm a sysadmin I might be a little biased. So what kind of thing have we been spending our time on recently?

Over the weekend Bob and I spent some time investigating how we could speed up the deployment of an entire staging environment for Zimki. There are a number of reasons for us doing this, in addition to all those lovely terms that management like (that also come in handy at bonus reviews) such as operational efficiency and business continuity planning it has a noticeable impact on reducing our Mean Time To Recovery (MTTR).

Our plans for this prototype involved automated installation (we decided to use Debian Etch for the experiment) using FAI, a Fully Automated Installer (hence the name). We currently have an in-house solution for deploying dedicated hosts that run our products but it's not an ideal fit for the infrastructure hosts and it's proprietary so we have to do all the hard work ourself!

For the actual systems management we're very close to settling on Puppet as our tool of choice. While we've already invested some time in CFEngine and have a very basic deployment, Puppet has a much nicer feel and is easier to get up and running; which is great thing when your scarcest resource is time. Over the weekend we completed proof of concepts for BIND deployments (including custom configs and zones from our internal SVN repo), centralised loghosts and the client side part of our (actually very comprehensive) Nagios setup.

So what does this get us? Depending on which parts of the experiment pass review and get pulled back in to the standard build - we're closer to an automated deployment of an entire staging environment (which was the ultimate goal). It also has the more immediate gain of allowing us finer grained control of which services run on which machines. Although we already have most critical services in paired or clustered configurations the ability to reduce the amount of stand by hardware, and in cases where a short outage is acceptable, remove it completely by reducing the MTTR to an acceptable value is an immediate gain. It also has an added bonus of making the replacement of the paired servers easier.

One of the more pleasant side effects of this kind of project are the challenges it makes to your assumptions. Where do you pull your configs down from? How many of them are abstracted enough to be reused without modification? Fotango is fortunate in that we invest in our own tools and we've got a very nice piece of config generation software that makes a lot of this trivial for us.