Replacing dead drive on Adaptec RAID array

monitoring our nodes especially the RAID arrays to ensure all VPS host machine are in top shape for our clients, we do use monitoring scripts that’s regularly sending RAID’s health condition via email to our noc team. Should we detect something our technicians will proactively fix the issue by replacing the hard drives, we normally use hard swap bays so we can easily hot swap faulty drives with new one in the background without causing service interruptions as much as possible to our clients.

In this post i will show to you how easy it is to replace a faulty hard drive using Adaptec hardware RAID controller (under linux environment) using arcconf utility, we use RAID10 on most of our VPS host nodes with minimum 4 drives configuration so this can survive up to 2 hard drive failure at the same time given they are on the opposite end of RAID1 mirrored on RAID0.

To verify RAID status

# ./arcconf GETCONFIG 1
Controllers found: 1
———————————————————————-
Controller information
———————————————————————-
Controller Status : Optimal
Channel description : SAS/SATA
Controller Model : Adaptec 2405
Controller Serial Number : 2D0811E8B58
Physical Slot : 6
Temperature : 47 C/ 116 F (Normal)
Installed memory : 128 MB
Copyback : Disabled
Background consistency check : Disabled
Automatic Failover : Enabled
Global task priority : High
Performance Mode : Default/Dynamic
Stayawake period : Disabled
Spinup limit internal drives : 0
Spinup limit external drives : 0
Defunct disk drive count : 0
Logical devices/Failed/Degraded : 1/0/1
SSDs assigned to MaxCache pool : 0
Maximum SSDs allowed in MaxCache pool : 8
MaxCache Read Cache Pool Size : 0.000 GB
MaxCache flush and fetch rate : 0
MaxCache Read, Write Balance Factor : 3,1
NCQ status : Enabled
Statistics data collection mode : Enabled
———————————————————————-

Logical device information

———————————————————————-
Logical device number 0
Logical device name : RAID10
RAID level : 10
Status of logical device : Degraded
Size : 1906678 MB
Stripe-unit size : 256 KB
Read-cache mode : Enabled
MaxCache preferred read cache setting : Disabled
MaxCache read cache setting : Disabled
Partitioned : Yes
Protected by Hot-Spare : No
Bootable : Yes
Failed stripes : No
Power settings : Disabled
——————————————————–
Logical device segment information
——————————————————–
Group 0, Segment 0 : Present (Controller:1,Enclosure:0,Slot:1) WD-WMATV6849433
Group 0, Segment 1 : Missing
Group 1, Segment 0 : Present (Controller:1,Enclosure:0,Slot:2) WD-WMATV6823225
Group 1, Segment 1 : Present (Controller:1,Enclosure:0,Slot:3) WD-WMATV6827620

As you can see above the RAID array is degraded as one drive suddenly marked as missing. Normally all you need is to identify the faulty  hard drive in the drive bays and replace it from our spare unit pool, one method to ensure remote hard technician wont mixed it up and pull our the wrong drives (trust me this do happen)  is by issuing this command and immediately the LED light will start blinking giving you visual reference which drive to pullout.

Syntax

ARCCONF IDENTIFY <Controller#> LOGICALDRIVE <LogicalDrive#>

so in our case:

ARCCONF IDENTIFY 1 DEVICE 0 0

 

Normally Adapter should automatically rebuild the array once there’s a hot spare or a replacement drive is inserted, however you can still force it to rebuild with newly inserted drive by issuing this command

 ./arcconf SETSTATE 1 DEVICE 0 0 HSP LOGICALDRIVE 0

So once again let’s check the status of RAID after issuing command above

 

./arcconf GETCONFIG 1
Controllers found: 1
———————————————————————-
Controller information
———————————————————————-
Controller Status : Optimal
Channel description : SAS/SATA
Controller Model : Adaptec 2405
Controller Serial Number : 2D0811E8B58
Physical Slot : 6
Temperature : 47 C/ 116 F (Normal)
Installed memory : 128 MB
Copyback : Disabled
Background consistency check : Disabled
Automatic Failover : Enabled
Global task priority : High
Performance Mode : Default/Dynamic
Stayawake period : Disabled
Spinup limit internal drives : 0
Spinup limit external drives : 0
Defunct disk drive count : 0
Logical devices/Failed/Degraded : 1/0/1
SSDs assigned to MaxCache pool : 0
Maximum SSDs allowed in MaxCache pool : 8
MaxCache Read Cache Pool Size : 0.000 GB
MaxCache flush and fetch rate : 0
MaxCache Read, Write Balance Factor : 3,1
NCQ status : Enabled
Statistics data collection mode : Enabled
——————————————————–
Controller Version Information
——————————————————–
BIOS : 5.2-0 (18252)
Firmware : 5.2-0 (18252)
Driver : 1.1-5 (24702)
Boot Flash : 5.2-0 (18252)

———————————————————————-
Logical device information
———————————————————————-
Logical device number 0
Logical device name : RAID10
RAID level : 10
Status of logical device : Degraded
Size : 1906678 MB
Stripe-unit size : 256 KB
Read-cache mode : Enabled
MaxCache preferred read cache setting : Disabled
MaxCache read cache setting : Disabled
Partitioned : Yes
Protected by Hot-Spare : No
Bootable : Yes
Failed stripes : No
Power settings : Disabled
——————————————————–
Logical device segment information
——————————————————–
Group 0, Segment 0 : Present (Controller:1,Enclosure:0,Slot:1) WD-WMATV6849433
Group 0, Segment 1 : Rebuilding (Controller:1,Enclosure:0,Slot:0) WD-WCAW34988757
Group 1, Segment 0 : Present (Controller:1,Enclosure:0,Slot:2) WD-WMATV6823225
Group 1, Segment 1 : Present (Controller:1,Enclosure:0,Slot:3) WD-WMATV6827620

As you can see it is now rebuilding and another way to verify this is

./arcconf GETSTATUS 1
Controllers found: 1
Logical device Task:
Logical device : 0
Task ID : 103
Current operation : Rebuild
Status : In Progress
Priority : High
Percentage complete : 12
Command completed successfully.

 

Here are the summary of most commonly used command parameters:

  • get current tasks (rebuild, etc)
    arcconf getstatus 1
  • get current logical device
    arcconf getconfig 1 ld
  • look at physical devices
    arcconf getconfig 1 pd
  • look at dead drives
    arcconf getlogs 1 dead tabular
  • look at devices with problems
    arcconf getlogs 1 device tabular
    sourcehttp://techhighlight.com/hosting/replacing-dead-drive-on-adaptec-raid-array/