How to Repair and Re-build Failed or Damaged RAID-5 Array

As the system sees the raid controller and not the disks behind, a little digging behind the scenes revealed a problem with the /twl0 raid drive. One of the disks had failed and though the backup had been swapped in, this was showing a ‘degraded’ status. Even worse, the server didn’t have the tw_cli software installed to properly control and interrogate the drives.
Here is where a cascade of problems started. The server was not configured properly to either communicate (or else to tunnel) through our internet proxy. So I had to manually search for the tw_cli RPM on Google before downloading and using the rpm -ifv command to install.
Once the software was installed I could disconnect and then pull the failed and degraded disks for swap-out.
I began by searching for and downloading the manual for the tw_cli software. Then I did a few internet searches and assembled all the commands I needed to perform the drive swap-out.
The process turned out to be a little more complicated than I first imagined, but that makes it a better demonstration and tutorial.
To get information on the raid systems supported by the raid controller, I ran the command (as root):
tw_cli /c0 show
(as a side-note, the rpm package manager installs tw_cli in /sbin/tw_cli, but this is in the root user’s path so I can just call the application directly).
The 3ware raid is the first raid controller on the system (c0).

This yielded the following output:

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       931.312   RiW    ON     
u1    RAID-5    DEGRADED       -       -       64K     11175.8   RiW    ON     
u2    RAID-5    INOPERABLE     -       -       64K     11175.8   RiW    ON     

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SAS   0   -            SEAGATE ST31000424SS
p1    OK             u0   931.51 GB SAS   1   -            SEAGATE ST31000424SS
p2    OK             u1   2.73 TB   SATA  2   -            Hitachi HDS723030AL 
p3    OK             u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p3    OK             u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p4    SMART-FAILURE  u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p5    SMART-FAILURE  u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p6    OK             u1   2.73 TB   SATA  6   -            Hitachi HDS723030AL 
p7    OK             u1   2.73 TB   SATA  7   -            Hitachi HDS723030AL 
As you can see, two drives p4 and p5 are reporting problems. The large storage array is disk /u1 (11TB). Normally both failed drives would be in this partition. However, the 3ware controller, when it swapped out the failed disk p5 moved it to its own drive system /u2 and this is where it gets a little complicated.

Removing Damaged Disks from the RAID Array

Now, to be able to physically pull the disks from the system they need to be taken out of the raid array. To do that, simply remove the disks with the tw_cli software by running the following command:

# tw_cli /c1/p4 remove
Exporting port /c0/p4 …
Done

on the p4 (degrated) disk, this worked a charm… However, when I tried to do the same on the p5 disk, by running:

tw_cli /c1/p5 remove 

I got the following error:

Remove operation invalid for unit 

Then I realized that the disk had been moved to its own array /c2. However, running:

tw_cli /c2/p5 remove 

also gave an error.

A little more digging on the internet revealed that later versions of the 3ware firmware a failed disk can be assigned its own RAID-5 unit, particularly if a disk failure led to shutdown (as in my case). In these instances, the new unit needs to be removed, so that the disk can be assigned to the role ‘spare’ in the original raid. Only then can the disk be removed.
disk p5 is now the only unit in raid c2. To remove it, the entire raid c2 has to be deleted. To do this, issue the following command (remember you are removing the extra raid array here, not the disk:
# tw_cli maint deleteunit c2 u2 Deleting unit c2/u2 … Done. Now, rerunning the show command:

# tw_cli /c0 show 

reveals the following about the p5 drive

p5 OK - 2.73 TB SATA 7 - HITACHI HDS723030AL 

Now it can be added to the c0/u1 raid unit as a spare:

# tw_cli /c0 add type=spare disk=5 Creating new unit on controller /c0 … Done. The new unit is /c0/u2 

OK, looks like not much has happened, we’ve just removed unit u2, but then added it right back in again as a spare. However, whereas the complete unit u2 can only be deleted and not removed, a ‘spare’ drive can be removed from anywhere.

So, if you now run

tw_cli /c2/p5 remove 

you get the message:

Exporting port /c0/p4 … Done 

the drive has been removed cleanly. Both the p4 (degraded) and the p5 (inoperable) can now be removed from the system. Simply take out the drive bays and remove the screws holding the physical drives in place. Replace with the new drives and slot the new drives back into the machine.

Re-building the RAID with the New Drives

Now issue the following command to ensure that the new drives were recognized:
# tw_cli /c0 show This will give output similar to that below:

As you can see, both the new drives have been recognized, but are not yet part of the array. To add the drives to the array, put them in as hot spares:

# tw_cli /c0 add type=spare disk=4:5

(this adds both drives at the same time, if you only have one drive you only need to specify the port number for that drive).

The controller will now add both drive as spares and it will then swap one of the drives into the array and start rebuilding the array.

If you issue a show command, you will see input similar to the following:



# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-1    OK             -       -       -       931.312   RiW    ON     
u1    RAID-5    DEGRADED       -       -       64K     11175.8   RiW    ON     

VPort Status         Unit Size      Type  Phy Encl-Slot    Model
------------------------------------------------------------------------------
p0    OK             u0   931.51 GB SAS   0   -            SEAGATE ST31000424SS
p1    OK             u0   931.51 GB SAS   1   -            SEAGATE ST31000424SS
p2    OK             u1   2.73 TB   SATA  2   -            Hitachi HDS723030AL 
p3    OK             u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p4    OK             u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p5    Spare          u1   2.73 TB   SATA  3   -            Hitachi HDS723030AL 
p6    OK             u1   2.73 TB   SATA  6   -            Hitachi HDS723030AL 
p7    OK             u1   2.73 TB   SATA  7   -            Hitachi HDS723030AL 

If you want to monitor the rebuild process, issue the following command:

watch /sbin/tw_cli /c0 show 

This will update information from the show command every 2 seconds, continually. Escape this process by using the [Ctrl]-[c] command.

Whilst the array is re-building (and this can cake a while for a large array) do not use the server for anything else. Indeed, it’s probably best to do this either over the weekend or at the end of the working day, so the server has uninterrupted time to re-build the array.

If you are paranoid, restart the server in single user mode. If you are using LILO as the boot loader, at the LILO boot prompt click [Ctrl]-[x] to exit the graphical screen. At the boot: prompt, type:

linux single 

If you are using GRUB as the boot loader, , use the following steps to boot into single-user mode:

  1. 1. If you have a GRUB password configured, type p and enter the password.
  2. Select Red Hat Enterprise Linux with the version of the kernel that you wish to boot and type a to append the line.
  3. Go to the end of the line and type single as a separate word (press the [Spacebar] and then type single). Press [Enter] to exit edit mode.
  4. Back at the GRUB screen, type b to boot into single-user mode.
before attempting any of the above.

 

Please note that all the commands above were issued as the root user (or you can use any sudoer account).