« Punch Drunk | Main | More news »

July 24, 2003

Posting shortage

Most of you may not have noticed that the posting has been a bit light in the last couple of days. I have been snowed under at work to the extent that a horrible day yesterday has given me some time-in-lieu (or is that time in the loo?). So I intent to catch up on a number of things today.

For the technically minded, the problem yesterday was a Sun Ultra 10 (about the smallest Sparc-based box you could get at the time) which has two internal IDE drives and one failed - the one holding the root partition (for non-UNIX geeks, feel free to click the next button and read about the ice fishing blonde). Well before swapping the drive out, I wanted to recover the rest of the data - after all this is an "inessential" system - no DR facilities, no cold standby, no spare parts - system owner was happy to have a 4 to 5 day outage ;-). Well that was until the system was doen and then they wanted it back yesterday (or rather the day before yesterday now).
No worries, I hear the readers say, wack on an external drive and dd the data. First problem is that the external drive is SCSI and the Ultra 10 doesn't have SCSI - that's right, must be the only sparc box ever created without it but it doesn't have it. Second problem is that the disk will be a different configuration (heads/cylinders/etc) and so dd might not be the best tool to copy. Thirdly (and most importantly) the failure (possibly only one block) is a low number area apparently in the middle of the primary inode table so we have to be really careful here (fsck want to unlink half the filesystem).
Rummage, rummage, rummage. I am sure me mate Mike (who is a Kiwi and thus I cannot post jokes like this) has a box full of old PCI cards - lets see, token ring, qe4, token ring, token ring, some sort of fibre card (probably ATM), more token ring (they liked token ring at this place), ahh here we go a dual network/scsi card - nope that's sbus. Hmmm.... here we go, this one is pci and looks like a network/scsi combo. Find the Swiss army knife (all the way from Switzerland - it is white instead of red) open up the box (which is on a shelf at head level), wack in the card, screw it all up, plug it in and pray. Oh yeah. Need some disks. Got a couple in the cupboard, never know when you might need them. Lets see, 2 x 9Gb should do and there is a six pack out the back (SCSI enclosure that holds 6 disks). Turns out the enclosure has 3 x 4 Gb disks already in it. Cool. Plug em all in, attach the SCSI cable, power it up, got a green light on the back (SCSI termination light), must be OK.
Now the computer is in the server room and it is headless, plugged into a terminal server. The terminal server can be accessed via reverse telnet or ssh and that gives you a console session. Only problem is that it sometimes doesn't respond to a control-break signal (to drop the box to the boot prom). You have to time the power on/off carefull and then run like the blazes to get back to your desk before the system actually boots. Done that, run reconfigure (to detect the new hardware) and it can't see the disks. Panic. Check with prtconf/prtdiag - we can see the second ethernet port (on the new card) and a scsi port with "(no driver loaded)". This could mean that no driver was required because there was no device out there or more often that the driver was not available. More panic.
Now lets be rational. This box has no scsi devices and it is running ROS (Really Old Solaris - 2.6) maybe there is no SD/ST driver in 2.6 when you do a clean install on an Ultra 10 (these days we jumpstart boxes so we know what drivers are installed). Hmmm. I can find more Ultra 10's but they are running Solaris 8. Wonder what module holds the driver, check in /var/sadm/install/contents for sd.conf and find it is in SUNWcar - bugger that is a core module and it is always installed. Mike (always good with the old suggestions which is why I can't tell this joke again) figures we might try a terminator on the SCSI chain. Search, search, search. Not on anyone's desk - of course I am the only one here. One guy is sick, two in Wollongong, one in Homebush and just me here. Not in the compactus. Not in the cupboard. Get some lunch ready..... Wait a minute. Why would I need a terminator with active termination on. Let's go check the six pack. Back in the computer room (through all these card coded doors) and ... hmmm ... why are there no lights on the front of the six pack. Check the back, yep got lights (scsi termination) but nothing at the front (disk power). Ahhhh. Dud power cable. The scsi termination gets power from the SCSI bus but the disks need 240V. Chuck power cable, install new one. Five lights on the front (2 x 9 Gb disks, 3 x 4 Gb disks). Power cycle and run ...
Now we are cooking with gas. Format, partition, newfs.... mount as /new. Copy the other partitions using anything at all (tar will work). Now what about root. Can't use tar because of the devices. Have to use cpio but can't remember the syntax (special level placed on brothels). Find another server, read man pages. Okey doke. Run cpio. Wait. Eat cold lunch. Wait some more. Fend of repeated requests for "are we done yet" (sort of like a two year old in the car - "are we there yet dad?" "No son. We haven't pulled out of the driveway yet."). The people asking are being pressured by the system owner who is (hopefully) regretting not spending the money on DR. This system is scheduled for an upgrade in the next 6 months or so and this might help with the planning process.
Copy done. Install new boot blocks. Run around to the computer room for a power cycle. Get the prom to boot from the new disk. What is the path for the wretched thing. Can't work it out. Boot back onto the old (dodgy) disk. Check /dev/dsk/c1t1d0s0 to see where it points to. Cut and paste the full name. Back in the computer room for another power cycle. Boot correctly off the new disk. Errors all over the shop. Some problem with /tmp (move to /tmp.old, make a new /tmp). Some problems with /proc (doesn't exist so create it). Halt the box (worked without a power cycle). Fix the devalias for default boot device and reboot. Yeehaa! All systems are go. Still need to restore /usr/dt so trot round to the storage/backup group and get them to do that.
Finish cold lunch. And cold coffee. Some problems. Apparently the application won't start. Something is missing. Probably a symbolic link. They can't see the old disk because it didn't mount (I arranged to mount it as /old). Check vfstab and realise I forgot to change the 'no' to 'yes' for 'mount this' on the old root and var filesystems. Fixed. Missing files can be recovered one at a time or we could try just restoring over the root filesystem. Storage guys give it a go and the whole thing blows up when it tries to replace libc.so .... System is now back where it was around 11 am.
Start all over again. Finally about 7 pm it appears to be up and working. It is now back to where it was at the YeeHaa point above PLUS some additional files had been restored including at least one of the missing items (a link from the root directory into the application directory). My migrane which has been getting worse all day has reached the point where I can no longer read the screen (or post blog items) so I am going home.

[PS: Apparently the application is now working - as of Thursday morning and I have a brief "thank you" email from my boss].

Posted by Ozguru at July 24, 2003 09:07 AM


Comments