The EBN machine, which hosts api.kde.org (sadly under-maintained) and www.englishbreakfastnetwork.org (Krazy code checking tools) and solaris.bionicmutton.org (OpenIndiana packages of KDE4) and some other things like an anonsvn mirror (note to self and/or sysadmin: need to set up anongit mirror, too), is back up. The usual “oh, drat” accompanied the downtime, like needing to do a failsafe boot because coming up to a full OpenSUSE desktop login (headless!) causes a panic.
The university where the EBN is hosted is having some scheduled downtime to replace the air-conditioning units. The maintainence window starts in 10 hours and is 4 hours long, so from 7am to 11am on Saturday, February 12th in the Europe/Amsterdam timezone (GMT+1 right now). I don’t know how long the work will take, nor when exactly the machine will be back up (it takes a while to press the “on” button on so many machines), but it’ll be down for a bit. I expect to HACF the machine early tomorrow morning, say 6:50am.
I wrote this on Saturday afternoon, but didn’t hit “post”: After running fsck repeatedly until it finally stopped finding unreferenced files, I’m hopefully calling the disk array fixed on the EBN. I’m still going to reconfigure the server in some way, but for the time being I’ve restarted the VM for api.kde.org and the EBN. That means that api.kde.org is accessible again and the EBN is running. I’m hoping that the NIC and RAID will hold out. It will take a while for API regeneration to finish as well as a new round of Krazy checks to run.
And I would add this today: the RAID array didn’t survive the night, with new read timeouts on the disk followed by mirror corruption and end of story. So I rebooted, dropped that VM again and the machine is currently running only essential services. We’re in the process of moving things off of the machine now so that it is easier to reinstall, with fewer tasks. Then we’ll hopefully have something usable again.
Well, it took ages, but the EBN and the different VMs it hosts are back. Add “sysadmin” to the list of occupations I probably shouldn’t attempt without (1) more training (2) a stricter schedule. The NLUUG spring conference on systems administration was quite educational — and fun, too, chatting with various companies and learning about NanoBSD and ZFS — but it didn’t give me any magical beans to fix what ailed the EBN.
So what was the problem? Well, the whole thing started (yay, placing the blame!) with Bertjan, who wanted a newer Qt version on the EBN for his software quality checking tools. The EBN ran 6.2-R, and the necessary Qt versions and stuff are not supported on that OS anymore. While the EOL for FreeBSD 6 is still six months away, the ports maintainers don’t necessarily want to support that. So we needed to update the OS to something newer.
There’s tools to do that now, but I’ve never used them, and anyway I don’t think they support FreeBSD 6. So that means lots of “make buildworld buildkernel installkernel installworld” kinds of steps. First off I found that doing the compilations took a lot longer than I expected (or hoped). So where I planned to go 6-6.4-7.3-8.0 in one day, the fact was that just compiling was going to take longer than that. I couldn’t pre compile everything either with the machine still up, because FreeBSD 8 doesn’t compile in a FreeBSD 6 environment. Hence the multiple steps. Note to self: update more frequently to avoid this kind of large upgrade.
Second problem was that the jails (virtual machines) on the server were poorly set up. They all had their own copies of the world. I hadn’t realized that a 6.2 jail wouldn’t work in a 7.3 host (for instance, ps fails and lots of other system tools don’t like it). If I had spent more time thinking, I would have realized that I could installworld to each jail again and things would be ok. Note to self: set up jails with an easily upgradeable world, as described in lots of best-practices documents on jails.
So I upgraded the host onwards to FreeBSD 8.0. Another long long compile, with no GNU screen to make it easier to deal with. Thank goodness for the ILOM and the system console redirection it provides.
Of course, then I went on to make delete-old-libs, which meant that the ports on the system — all of which were compiled against the 6.2 libraries — didn’t work anymore. Note to self: see that little note “in case no 3rd party program uses them anymore”? Keep it in mind next time.
So, after about two days, I had a base system updated to 8.0, no working jails at all, and all ports — both in the host and in the jails — broken. At this point, I started doing two things in parallel. Note to self: don’t. I started rebuilding the ports in the host system, and reconfiguring the jails to have a single base installation with just /home, /etc, /var and /usr/local local to each jail, using nullfs mounts; I also decided to drop the starting of jails in /etc/rc.local and to use the jail-launching support that is now built in (but which wasn’t, as far as I know, available in 6.0 which is when I first configured the machine). Note to self: that was actually a good idea, and thanks also to Sjors who reminded me of the jail_* variables.
So, rebuilding ports after a big step like that is complicated by the fact that perl, ruby, php and python all needed to be recompiled and portupgrade -apP sometimes doesn’t quite get it right. In any case I needed to rebuild the ruby stack first to get a working portupgrade. The other three languages were a mess, with some modules of the languages disappearing at inopportune points along the upgrade path. Basically I did portupgrade -apP ; pkgdb -F ; portinstall <something missing> an awful lot until things were working again. This morning I finally got rid of the last missing PHP 5.3 modules which brought the EBN parts back to life. Note to self: read UPDATING twice before doing this again.
Of course, all that would have been less problematic if the disk array hadn’t given out twice during the whole operation. Once the ridiculously heavy load on the machine caused a panic and once the power on one of the disks fluctuated enough to cause another panic. Running fsck on a 600GB filesystem with 14M inodes is not quick (especially if there’s a few directories with 1M files in each, as is the case with KDE SVN mirrors). Note to self: badger more people about a better disk array for KDE.
Combine all that with sickness and family time and that’s why it took a week. I’m blogging this for the notes to self for the next time I run an upgrade (resolution: when FreeBSD 8.1 comes out) and to notify folks that things should be back to normal. (If not, drop me a note in comments). One the positive side, the server is better organized now, disk usage is down a little bit, and future upgrades should be much easier.
There’s a few things I learned today: One is that FreeBSD with UFS2 is a little slow when dealing with directories with over a million files in them. The KDE SVN — created way back in the SVN 1.4 or earlier days — is set up like that, with one flat directory structure. As a consequence, copying a SVN repo mirror from one place on the disk to another is rather slow. Moving it (within the same filesystem) would be a lot faster, but I wanted a copy. Second is that the EBN machine has grown SVN mirrors and experiments and KDE checkouts (of the whole thing) like mushrooms after rain. I’ll have to clean some of that up, not so much for the diskspace, but for tidyness. Third is that while copying three distinct million-file trees in parallel, your disk array will have a power hiccup, panicking the machine and leading to another two days of fsck. So more waiting for the EBN to come back — particularly annoying since I had the other virtual machines on the system back up and running, so that Sebas had his website back, the KDE4-Solaris packages were available again, and Claudia could share documents with the rest of the board.
Fourth is that Mystic Kriek is really quite tasty, in a pink-and-foamy-cherry-coke-with-alcohol kind of way.
Speaking of pink, I got word that my talk for Akademy has been accepted, with the condition that I must bring my pink whip. Paul Adams has nothing to do with that, I’m sure. However, I need to point out that I got a new whip in Kano last month, made of rolled up goat hide. White, plumed, a little bit more floppy than the nylon-core things we’re used to at Akademy. We’ll see how that turns out as a speaker motivational tool.
Overconfidence goeth before a fall, they say. The software upgrades on the EBN machine are taking a good deal longer than intended. Part of this is some unexpected trips I had to make, but mostly it’s just that 6.2-6.4-7.4-8.0 is a lot of buildworlds (which take surprisingly long!) and reboots, followed by the realization that the jails need upgrading too to work at all (although the ports may continue to work, the base system doesn’t).
All in all it’s just taking a lot longer than intended, but it’s coming back bit-by-bit, rest assured.
Also, I should say that Sun’s ILOM really rocks for remote management, since I needed a great deal of console access to get this done, and I don’t feel like sitting in a cold and drafty datacentre to do so — not with this ongoing hacking cough and headache, no sirree.
The server running the EBN (a Sun X4200 running FreeBSD — soon to be running OpenSolaris in a VM) is getting a bit long in the tooth, software-wise, and it turns out that it can no longer even run all of the software needed for improvements to the EBN. Bertjan has been bugging me to update it, which I can’t until I update the whole machine from 6-CRUFTY to 8-STABLE, so I’m going to plan some downtime for the EBN machine: this weekend, 8 and 9 may 2010, from 12 noon (GMT) on the 8th until midnight (GMT) on the 9th. That should give me enough time to bring the machine down, make additional backups, upgrade the heck out of it (all except hardware, unless someone cares to donate a pair of ECC Registered DDR2-800 DIMMs) and bring it back up. There may be some additional downtime on Monday (but only brief) as some disks are swapped and I correct some historical mistakes in the machine’s hardware configuration regarding disk layout and management.
Sites affected: the EBN itself (www.englishbreakfastnetwork.org) and the KDE4-OpenSolaris package site (solaris.bionicmutton.org) and some personal sites, including Sebas’ vizzzion.org, bionicmutton.org and euroquis.nl.
After a great deal of procrastinating, I’ve rearranged my home office again and restored my FreeBSD machine to its rightful place under the desk. It must have been switched off for several months now, as the update (buildworld for 7- and 8- as well as portupgrade -aPP) is taking forever. There’s a reason for running through this routine: there are a few things I want to test with systems upgrades and jails. In addition, I’d like to document how the EBN is set up, to the level of detail of including package names. This is part of an effort to clean up the EBN code and separate the tools from the website a little; that in turn is in advance of adding some new features to Krazy / EBN in general.
It’s kind of nice to be back on FreeBSD again. I’ll have to take note and compare the FreeBSD KDE packages with the OpenSolaris ones I produce.
PS. There’s nothing like cleaning your desk with a shovel and just dumping everything in a box never to be looked at again. I did retrieve several Bluetooth dongles and SD cards that I’d thought lost.
The English Breakfast Network — which hosts the KDE code checking site, vizzzion.org, an anonsvn mirror for KDE, my irssi-in-screen instance and a bunch of other stuff — is down following a power outage at the university. While the older CodeYard machine came back up with no problems (yay FreeBSD 6.1!? too bad about the 3 years uptime, maybe) the EBN is stuck somewhere. I’ll have to fiddle around to find the ILOM password, I guess, or in worst case go over there and sort it out at the console (which is not a little trip I would look forward to, as it’s hella stormy today). Expect medium to major delays.