Murphy’s Day

If something weird is happening with a server, never think "It'll just be an hour or two." Never think "If I'm going to be in the server room anyway, I might as well do foo as well to another box." Since I thought both of these foolish things, it shows off that there's definitely areas of Linux system administration that I'm no good at and that are needlessly complicated, and that I'm an inveterate optimist when it comes to these things.

The CodeYard server -- a five year old IBM x306 with hard drives showing over 30000 hours of continuous operation and which has had uptimes over 500 days -- slowed to a crawl, then rebooted yesterday. Sjors pinged me by phone, so I biked to the University to take a look with him. While en-route, the box did another kernel panic while running fsck(8). Ugh.

Now, working on a server that has two partially-mirrored 250GB SATA-150 hard drives and only 1GB of RAM (seriously, when we got this machine it was a sensible box for supporting medium sized workgroups, now my phone has more oomph) just takes forever. It never takes just an hour or two to wait for GEOM mirror to complete and then the fsck(8) to wind up and then .. bam, another kernel panic. By the end of the day we hadn't really pinned down what was causing the problem, but memcheck seems in order.

All the data -- students SVN and git repositories -- on the machine seems safe, but we've pretty much turned off all the services offered by the box by various service jails until we get things sorted out.

So one failure doesn't a Murphy's day make. The second is that my laptop -- which worked in the morning and didn't when I got to the server room -- has suddenly forgotten that it has a display panel attached to it, so I don't see a thing. Not even BIOS POST messages. It still seems to boot into Fedora OK and I can even log in to my wonderful pink desktop (now there's a blessing in disguise). Can't see a thing. This particularly puts a crimp in the plan to use the laptop as a KDE demonstration machine during the NLUUG fall conference. I might end up lugging a desktop machine along instead.

In parallel with all this I did some upgrades on the EBN machine, which was foolish of me. That server had been running off of a spare laptop drive for some time now -- a situation that was bound to come crashing down at some point. So the plan was simple: add a 500GB data disk, put back the Sun 10kRPM SAS disk that came out of the machine some time ago, copy boot stuff to SAS disk, reboot, done.

Yeah, right.

Three things I'd forgotten: dump + restore no longer works, making disks bootable is non-trivial and initrd is some brain-dead invention intended to prevent you from moving things around effectively. Give me FreeBSD, which at least will boot (quickly) and then complain and you can type in the root directory for single-user mode in a human-friendly fashion.

In the end I dd'ed the old disk onto the new disk, then did a chroot and mkinitrd. It just doesn't seem right. Maybe I've missed a really obvious manpage somewhere explaining how the boot process works nowadays and how to migrate an installation to a different disk (lazyweb!). Tracking down the remaining references to the old disk took a bit longer, but the machine is up-and-running again. Now my next challenge is to convince the disk subsystem that I hot-attached a new drive (which would be /dev/sdf) which is physically identical to /dev/sde, and then dd everything over again so there's a spare boot disk.

Plenty of things to go wrong. In retrospect, the old Nethack adage serves best (e.g. when going down stairs while burdened with a cockatrice corpse) "just don't do that."