Recovering Failed Fault Tolerant Disks
If you have your data on a single drive or a non-fault tolerant volume such as a spanned volume or a striped volume, you expect to lose data if a drive fails. Because disk failure is an unavoidable fact of computer life, I assume that you have a good backup system and a plan for restoring data quickly. If a single disk in a striped volume fails, for example, you must delete the volume from the remaining disks, replace the disk, and rebuild the volume.
On the other hand, if you don't want to deal with masses of panicked users and their crazed managers who will gather outside the server room like enraged French revolutionaries looking for guillotine fodder, you'll want to put your data on a fault tolerant subsystem of one form or another. This topic covers putting a system back in a stable condition following a disk failure and then recovering the system to normal operation. This includes the following operations:
Replacing a failed disk in a RAID 5 volume
Building a fault tolerant boot floppy
Replacing a failed disk in a mirrored volume
Moving dynamic volumes between computers
Replacing a Failed Disk in a RAID 5 Volume
When a disk fails in a RAID 5 volume, you will get a very small and very temporary information balloon from a drive icon in the system tray. The message states A disk that is part of a fault-tolerant volume can no longer be accessed. The message comes from a process called FT Orphan. This is a special process that logically disconnects the drive from the system to eliminate the possibility of data corruption.
The file system on the volume with the failed disk continues to be active. Your only indication of the failure (unless you have installed a third-party utility to alert you of error log entries) is a slight decrease in I/O performance.
When you discover that you have a failed disk, open the Disk Management console. You'll get a display that looks something like that in Figure 14.7. Each disk for the volume shows a Failed Redundancy status and the failed drive shows a red Stop indicator.
Figure 14.7. Disk Management console showing failed disk in RAID 5 volume.
Thanks to the fault tolerant nature of RAID 5, the system remains operational. However, you have now entered a statistical universe where the numbers are not in your favor. The next drive crash will cause data loss. If the drives were all manufactured in the same batch, your time might run out very quickly depending on the cause of the crash
Obtain a spare drive that has at least as much capacity as the drive you are replacing. It should be configured for the same SCSI ID to simplify installation, although this is not a requirement.
Use the Disk Management console to check the SCSI ID assigned to the dead drive. Right-click the status block and select PROPERTIES from the flyout menu. The SCSI ID (called the Target ID) and the Local Unit Number (LUN) are listed. I recommend that you paste a screen print of this window on the server so that you have a reference when you replace the disk. The snarl of SCSI cables inside the machine can lead you astray unless you have a good map. Nothing is quite so embarrassing as replacing the wrong drive.
After you have the replacement drive in your hands and your users have left for the day, you're ready to get to work. Down the server and replace the drive. Test the drive operability using any IDE or SCSI hardware utilities you like.
Now restart and let the operating system load. The RAID 5 volume will initialize and the file system should mount. Open the Disk Management console. The display should look something like that in Figure 14.8.
Figure 14.8. Disk Management console showing replacement disk with Unknown status and the RAID 5 array with a Failed Redundancy status.
The RAID 5 volume still shows a Failed Redundancy status. A status block for the missing disk opens because its information is contained in the LDM database on the other disks. The replacement disk is brand new, so it does not have a fault tolerant signature or a Master Boot Record. The system lists its status as Unknown. Follow Procedure 14.13.
It sometimes happens that the LDM does not initialize correctly when loading the Disk Management console. The RAID 5 volume may show Healthy even though it is not. If this happens, select ACTION | RESCAN DISKS from the menu, close the Disk Management console, and open it again. You may need to do this a couple of times to get the display to show a Failed Redundancy status.
Procedure 14.13 Replacing a Failed Disk in a RAID 5 Volume
Write a signature to the new disk by following the wizard instructions.
Upgrade the disk to a dynamic disk.
Right-click the RAID 5 volume and select REPAIR VOLUME from the flyout menu.
Select the new disk to use as a replacement for the failed disk. The new disk now becomes part of the RAID 5 volume and the system begins regenerating. This can take a long time, sometimes hours. It will take much longer if users access the drive.
While the regeneration is in progress, right-click the status block for the missing disk and select REMOVE DISK from the flyout menu. (Make absolutely sure you have the correct disk.) The status block disappears and the graphic display rearranges to show the new drive configuration.
Building a Fault Tolerant Boot Floppy
If you mirror your boot volumeЧthe most popular fault tolerant choiceЧone of the most important tools you have for recovering from a failure is a fault tolerant boot floppy. The secondary drive is not necessarily bootable, so you need a way to boot the system to the mirrored volume on the secondary drive if the primary drive fails.
Even if the secondary drive is bootable, you or a colleague may have forgotten to modify the Boot.ini file to point at the secondary volume.
A fault tolerant boot floppy also comes in handy if you experience problems with the MBR or boot sector on a server that prevents the machine from booting. Viruses are one common cause for this problem.
A fault tolerant boot floppy does not boot Windows Server 2003 on a floppy. It uses the system files that are normally found at the root of the hard drive to bring up the operating system.
Procedure 14.14 shows a brief set of steps for creating a fault tolerant boot floppy. Chapter 3, "Adding Hardware," contains information about ARC paths and Boot.ini entries.
Procedure 14.14 Building a Fault Tolerant Boot Floppy
Format a floppy. You cannot use a preformatted floppy because the boot sector must look for Ntldr. You can use a disk formatted on an NT4 machine.
Copy the system files to the root of the A: drive. These files are as follows:
Use ATTRIB to remove the read-only attribute from Boot.ini.
Edit the Boot.ini file on the floppy to include the ARC path of the boot volume on the second drive. This would look something like this:
multi(0)disk(0)rdisk(1)partition(1)\Windows="Windows Server 2003 Mirrored Secondary Disk"
You might also want to change the time setting to [ms]1. This disables the counter.
Restart the computer and boot from the fault tolerant boot floppy.
When the BOOT menu appears, select the second disk. The system will boot from the secondary disk. At this point, the floppy is no longer needed. Remove it from the drive.
Replacing a Failed Disk in a Mirrored Volume
If you lose a disk that is part of a mirrored volume, the system responds as it did for a failed disk in a RAID 5 volume. When the system attempts to write to the volume and fails to get a response from the disk, the FT Orphan process disconnects the system from the drive and announces this via a System Tray icon. The FT Orphan process locks the Registry on the failed drive, if possible, so that even if you get the drive back in service, the system will refuse to load the operating system from it.
When you open the Disk Management console following the drive failure, you'll get a display like that in Figure 14.9. The failed drive has a Missing status. The mirrored volume shows a Failed Redundancy status. The secondary drive moves to the top of the drive list. This may be different in your system, depending on your SCSI ID configuration.
Figure 14.9. Disk Management console showing failed primary drive in a mirrored volume.
As you can see by the figure, it can be difficult to determine exactly which drive failed. Keep careful records of the SCSI IDs or IDE controller numbers. As with the RAID 5 failure, you do not need to take immediate corrective action. As many administrators will attest, however, you take a big chance if you wait too long.
Obtain a new disk that is at least the size of the one you're replacing. Configure it for the same SCSI ID or IDE master/slave configuration to simplify recovery. When you're ready to replace the drive, follow Procedure 14.15.
Procedure 14.15 Replacing a Failed Disk in a Mirrored Volume
Down the server and replace the drive.
Restart and boot using a fault tolerant boot floppy. If you replaced the drive using the same SCSI ID, pick the Boot.ini menu item corresponding to the original rdisk() value of the secondary drive. If you used a different SCSI ID, you need to figure out the rdisk() value based on the SCSI scan order. Use your SCSI adapter's configuration utility to see the scan order, then modify the Boot.ini file on the fault tolerant boot floppy accordingly.
After the operating system finishes loading, open the Disk Management console. The new drive does not have a fault tolerant signature or a copy of the LDM database, so an Initialize and Convert Disk Wizard opens to walk you through applying the signature and converting the disk to a dynamic disk.
Once you've completed the Wizard, the Disk Management console is visible. You might be surprised to see that the old disk still appears in the display along with the new disk. This is to remind you of the original configuration. Figure 14.10 shows an example.
Figure 14.10. Disk Management console following disk replacement of a failed mirrored drive prior to regenerating the new disk.
Right-click the mirrored volume and select REMOVE MIRROR from the flyout menu. The Remove Mirror window opens.
Select the missing disk from the list and click Remove Mirror. The system prompts for verification. Click Yes. The remaining disk now shows a Healthy status.
Right-click the status block for the missing disk and select REMOVE DISK from the flyout menu. The disk disappears immediately.
If you have verified that the new primary disk is bootable, remirror the volume to the new drive using the instructions in Procedure 14.8, "Creating Mirrored Volumes." If the new primary disk is not bootable, you'll need to get a good backup and then reinstall the operating system and recover from tape to get a bootable primary disk. After this is done, remirror the volume.
Moving Dynamic Disks Between Computers
It sometimes happens that a server or workstation goes to that big byte bucket in the sky. (This usually happens about a half hour before your plane is due to leave on that vacation you've been planning for the past year.) If the problem is not with the storage system, one quick recovery method that might get you to the plane on time is to move the data disks to a new machine.
Moving disks between machines can cause other problems. Windows Server 2003 has a lot of information in the Registry that is hardware-dependent. If you move the boot disk (the disk with the operating system files) to a different platform, expect to see lots of Plug and Play (PnP) activity when you start the machine. You may need to supply hardware drivers. You may get blue screen stop errors if the memory management subsystem cannot interpret the chipset or memory configuration. You will certainly get a failure if the new server requires a different Hardware Abstraction Layer (HAL).
Moving data disks between machines is a much simpler matter. If the disk is a basic disk, the system sees the new disk, reads the partition table, and assigns the next available drive letters to any partitions it finds.
Moving dynamic disks, however, especially dynamic disks that contain volumes that span disks, is a bit more complicated. You'll need to merge the LDM database entries for the disks into the LDM database of the machine where you install them.
The operating system identifies disks with an outside disk group name as foreign disks. The purpose of the steps shown in Procedure 14.16 is to import the LDM information on those disks so that the disk group name can be changed and the system will accept the new entries.
Procedure 14.16 Moving a Fault Tolerant Volume to Another Computer
Down the two servers and transfer the drives. You might have to make room for new drives. You might need to rework the terminators and assign new SCSI IDs and so forth. The objective is to keep the drives together, if possible, although this is not absolutely necessary.
After the drives have been installed, test them to make sure that they are connected and that you know the order of their installation. The LDM permits the sequence of disks to be changed, but you make your job more difficult if the Disk Management console display has the foreign disks distributed willy-nilly.
Boot the operating system and make sure that the system loads. The data on the new drives will not be available until you import the disks. Also, any share points you have for directories on the drives will need to be recreated.
Open the Disk Management console. After initialization, the graphical display looks something like that in Figure 14.11. The disks from the other computer are flagged as Foreign.
Figure 14.11. Disk Management console showing a dynamic disk moved from another machine and introduced into a new machine as a foreign disk.
Right-click the status block of the foreign disk or disks and select IMPORT FOREIGN DISKS from the flyout menu. The Import Foreign Disks window opens showing the name of the original server from where the disks came. This information comes from the LDM database at the end of the disks.
Click OK. The system analyzes the disks, and then the Verify Volumes on Foreign Disks window opens.
The system may report the Data Condition as Data Incomplete. This indicates that you did not move all the disks in the disk group. This is expected if the boot/system disk in the original server was a dynamic disk, or if there were other dynamic disks in the original server that you intentionally didn't move. Make sure that you have all the disks that are in the shared volume. You are permitted to move a subset of a disk group, but you'll need to do a few more steps.
Click OK. The system warns you that it might not be able to recover data if you had a Data Incomplete status in the preceding window.
Click OK to acknowledge the warning. The system imports the disks and then attempts to build the volumes and initialize the file system. The status may go to Failed on the disks. Don't worry (at least not yet). This is normal if you did not include all disks in the disk group in the transfer.
Right-click the status block for any of the new disks and select REACTIVATE DISK from the flyout menu. The system will think a long time and you'll hear lots of disk activity. If the reactivation is successful, a drive letter for the volume appears and the status changes to Regenerating. This regeneration takes a long time and consumes many CPU cycles. The file system is active during this time and you can access files, but this is not recommended because it slows down regeneration. After regeneration has completed, the new volumes show a status of Healthy.
There is a chance that any existing dynamic disks in the new machine will show an Error status after the import in the status block of the disk. This is because their copy of the LDM database has values that they cannot interpret. If this happens, right-click the status block for the disk and select REACTIVATE DISK from the flyout menu. This should immediately correct the problem.