Long time no post! Last Sunday I spotted the following email in my junk mailfolder:
Date: Sun, 21 May 2006 19:08:39 +0200
From: root <root@metis.lenznet>
To: <lenz@localhost.metis.lenznet>
Subject: SMART error (OfflineUncorrectableSector) detected on host: metis
This email was generated by the smartd daemon running on:
host name: metis
DNS domain: lenznet
NIS domain:
The following warning/error was logged by the smartd daemon:
Device: /dev/hda, 1 Offline uncorrectable sectors
For details see host's SYSLOG (default: /var/log/messages).
You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.
Which did not sound too good. In addition to that, the system had failed to resume from a suspend to disk earlier that day - the kernel experienced disk read errors while trying to load the suspended image from the swap partition. Fortunately a fresh reboot still worked and I ran a more thorough analysis of the disk drive by using
smartctl -t long /dev/hda. Using various Open Source tools from a
SUSE Linux 10.1 rescue system (which boots off the first installation CD) helped me to backup and restore my data without losing anything (except for some time, of course).
This thorough check took about 80 minutes to finish and revealed the following problems after investigating the SMART error log using smartctl -a /dev/hda:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
[...]
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
[...]
Error 1 occurred at disk power-on lifetime: 9812 hours (408 days + 20 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 59 04 4e d0 a4 e2 Error: UNC 4 sectors at LBA = 0x02a4d04e = 44355662
In other words, the disk was about to fail and was running out of spare sectors to relocate data from the already corrupted sectors. About time to replace the drive immediately! Too bad that I was on the road, visiting my parents and without a recent backup - I left both of my two external USB harddisk drives at home, that I frequently use to perform backups of my home directory and other data directories.
Fortunately I managed to convince my dad to purchase an external 300GB USB disk for him just the day before! As the file systems themselves did not yet show any signs of corruption, I simply used rsync to create backup copies of all relevant directories. In addition to that, I used dd_rescue to create a dump of the entire disk. With the exception of one single block, it created a full image of the disk drive, including both the Windows XP partition as well as the Linux LVM partition, that contained SUSE 10.0 and 10.1 root file systems, my encrypted home file system and some data volumes.
In parallel to the activity of performing backups and running more thorough SMART tests using the Linux smartmontools (All by booting off a SUSE Linux 10.1 rescue CD), I called the IBM hotline to tell them about my problems. Unfortunately they did not really trust my words and did not accept the output of smartctl as an evidence that the drive was actually about to fail (I should have probably lied and told them that the disk was completely dead already). The hotline person insisted that I had to install the Hitachi drive fitness test (even though my hard disk drive was manufactured by Fujitsu) and perform an analysis of the drive using this tool. This would then give me an error code that the hotline person needs in order to execute a replacement. The software is available in various formats: a Windows EXE file that creates the boot floppies, a plain floppy disk image and an ISO CD image. I was not to keen on having to burn a CD just to be able to boot the program and I did not had a floppy drive handy, either. Again, Linux came to the rescue! Fortunately the drive was not completely dead yet and I was still able to boot from it. So I was trying to find out a way to use the GRUB boot manager to boot off a floppy disk image stored on disk. My first attempts to simply pass the disk image as a kernel or chainloader option unfortunately did not work - the boot always failed. Some research then led me to the memdisk tool, which is part of the syslinux package.
From the README:
MEMDISK is meant to allow booting legacy operating systems via PXE,
and as a workaround for BIOSes where ISOLINUX image support doesn't
work.
MEMDISK simulates a disk by claiming a chunk of high memory for the
disk and a (very small - 2K typical) chunk of low (DOS) memory for the
driver itself, then hooking the INT 13h (disk driver) and INT 15h
(memory query) BIOS interrupts.
So I could have probably used ISOLINUX as well, but this is what worked for me. I copied the memdisk binary as well as the floppy disk image containing the Drive Fitness Test into the /boot directory (which is located on the /dev/hda3 partition in my case) and added the following to /boot/grub/menu.lst:
title Hitachi DFT
kernel (hd0,2)/memdisk
initrd (hd0,2)/dft32_v406_b00_install.img
This added a new entry to the GRUB boot menu which booted the drive fitness test from the floppy image just fine!
The test then churned along for a while, finally resulting in the very helpful error code 0x72. I called IBM again, informing them of the result of this test. This finally convinced them to send me a replacement disk drive via UPS, which arrived the other day. It was not a new drive - it was labelled as "used component" and SMART reported some 150 hours of operation. But at least it did not indicate any suspicious errors and I was able to restore the entire hard disk just by dumping the previously saved disk image using the dd command (again from the SUSE Linux 10.1 rescue system) onto the new drive. To be on the safe side, I compared the file systems with the copies I created with rsync before, which did not reveal any differences.
After I was confident to not have lost any data, I re-inserted and wiped the defective drive using the command dd if=/dev/zero of=/dev/hda bs=8k before packing it up for returning it to IBM.
Lessons learned from this experience:
- Backup early and often and have a recent backup available
- Monitor the health of your disk drives by using the smartmontools
- Always have some kind of rescue boot system handy, in case the disk fails to boot up completely (The SUSE Linux 10.1 rescue system is very well rounded and hosts a nice collection of useful tools)
- External USB disk drives are small, easy to use, cheap and fast for performing backups and restores
- dd_rescue is helpful to create disk images from defective drives
- memdisk from the syslinux toolkit is useful to boot floppy disk images directly from hard disk
- Even though IBM promotes Linux and OSS, they don't really live up to this in every sector of their enterprise yet. They should have some more faith in their Linux users and trust that if these report a problem, it should be taken seriously. I wasted a day just to get the result from the drive fitness test that was required to prove that the disk was indeed faulty.