Install MegaCli software
You can download the latest release from LSI here: http://www.lsi.com/support/Pages/Download-Search.aspx . Search for “megacli”.
There is also an excellent driver repository here: http://www.thomas-krenn.com/de/download.html (german language).
To install the software, unpack the zip file and run the install command according to your OS. For a Linux system, the command is usually
# rpm -Uhv ./MegaCLI/MegaCli_Linux/MegaCli-8.05.71-1.noarch.rpm Preparing... ########################################### [100%] 1:MegaCli ########################################### [100%]
The software is installed at /opt/MegaRAID/MegaCli/MegaCli
, so you either need to include this in your $PATH or use the absolute path name on every invocation.
Check status of MegaRAID controller and disks
To find out whether there are any failed disks, run the command below. This will return a detailed status of all LSI MegaRAID controllers on this system, including the status of their disks. Look for the “Device Present” section, which looks like this for a RAID array without any problems:
# /opt/MegaRAID/MegaCli/MegaCli -AdpAllInfo -aAll ... Device Present ================ Virtual Drives : 2 Degraded : 0 Offline : 0 Physical Devices : 5 Disks : 4 Critical Disks : 0 Failed Disks : 0
If there is one or more “Degraded” virtual drives, this means that a physical disks has problems and should be replaced. Most RAID configurations other than RAID-0 (striping) provide error recovery so that the virtual drive is currently still accessible and no data is lost at this time. However the failed disks shoud be replaced as soon as possible.
Disable RAID controller sound alarm
The LSI MegaRAID SAS 9260-4i controller will emit a periodic and very annoying beep when there is a problem. It is so loud that you might get a phone call from the data center staff demanding that you turn it off. Here’s the command to do that:
# /opt/MegaRAID/MegaCli/MegaCli -AdpSetProp -AlarmSilence -aALL Adapter 0: Set alarm to Silenced success.
The alarm is disabled for the current problem only. If there is a status change it will beep again, which happens also when you add a new disk and bring it online. So you may have to execute this command again after running other commands. It is also possible to completely disable the alarm.
Identify failed disk drive
The next step is to figure out which disk has to be replaced. This is critical – you don’t want to accidentally pull a good disk drive from an already degraded RAID array. Most RAID configurations can handle one missing disk, but not two.
The MegaRAID controllers provide an event log which can be saved to disk. Here is what has happened to the RAID array (older messages at the bottom):
- the disk returned a sense status of b/00/00 for a command (several older events not shown here). The event contains the Enclosure Index (252) and the Slot Number (0) which we need in the next steps
- as a consequence the physical disk #05 was marked as “failed”
- which in turn causes the virtual drive this disk is a member of to be marked as “degraded”
# /opt/MegaRAID/MegaCli/MegaCli -AdpEventLog -GetLatest 100 -f events.log -aALL # more events.log seqNum: 0x0000af33 Time: Mon Aug 26 08:44:03 2013 Code: 0x000000fb Class: 2 Locale: 0x01 Event Description: VD 00/1 is now DEGRADED Event Data: =========== Target Id: 0 seqNum: 0x0000af32 Time: Mon Aug 26 08:44:03 2013 Code: 0x00000051 Class: 0 Locale: 0x01 Event Description: State change on VD 00/1 from OPTIMAL(3) to DEGRADED(2) Event Data: =========== Target Id: 0 Previous state: 3 New state: 2 seqNum: 0x0000af31 Time: Mon Aug 26 08:44:03 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from ONLINE(18) to FAILED(11) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 24 New state: 17 seqNum: 0x0000af30 Time: Mon Aug 26 08:44:03 2013 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 05(e0xfc/s0) Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense: b/00/00 Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 CDB Length: 10 CDB Data: 002e 0000 003a 0038 001b 00c7 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 0070 0000 000b 0000 0000 0000 0000 000a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 seqNum: 0x0000af2f Time: Mon Aug 26 08:44:02 2013 Code: 0x00000071 Class: 0 Locale: 0x02 Event Description: Unexpected sense: PD 05(e0xfc/s0) Path 4433221103000000, CDB: 2e 00 3a 38 1b c7 00 00 01 00, Sense: b/00/00 Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 CDB Length: 10 CDB Data: 002e 0000 003a 0038 001b 00c7 0000 0000 0001 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18 Sense Data: 0070 0000 000b 0000 0000 0000 0000 000a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ...
In this case, the disk in Enclosure 252 slot number 0 is the leftmost disk bay in this particular server but of course it might be different for other servers.
If you did not document your map of enclosures, slot numbers and physical disk locations, it is sometimes possible to find the broken disk with this command:
# /opt/MegaRAID/MegaCli/MegaCli -pdLocate -start -PhysDrv \[252:0\] -aALL # /opt/MegaRAID/MegaCli/MegaCli -pdLocate -stop -PhysDrv \[252:0\] -aALL
This will attempt to flash the LED of the disk bay. The parameters are the enclosure id and the slot number. The square brackets must be escaped with a backslash otherwise the shell will interpret this instead of passing it to the MegaCli command.
The problem with this command is that the LED of a broken disk might not flash at all, or that it might flash very similar to the LEDs of the other bays. So you need to be very cautious here.
The best method is to identify each disk bay before there is a problem, and to write it down.
Replace disk
Now that you have found the broken disk, replace it with a new one. If you do not have the same type of disk, it is also possible to use any other disk that has at least the same size. Depending on the RAID controller configuration, the controller will activate the new disk automatically and start the rebuild process.
You can monitor the progress of the rebuild using the event log:
seqNum: 0x0000afa6 Time: Mon Aug 26 23:13:48 2013 Code: 0x000000f9 Class: 0 Locale: 0x01 Event Description: VD 00/1 is now OPTIMAL Event Data: =========== Target Id: 0 seqNum: 0x0000afa5 Time: Mon Aug 26 23:13:48 2013 Code: 0x00000051 Class: 0 Locale: 0x01 Event Description: State change on VD 00/1 from DEGRADED(2) to OPTIMAL(3) Event Data: =========== Target Id: 0 Previous state: 2 New state: 3 seqNum: 0x0000afa4 Time: Mon Aug 26 23:13:48 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from REBUILD(14) to ONLINE(18) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 20 New state: 24 seqNum: 0x0000afa3 Time: Mon Aug 26 23:13:48 2013 Code: 0x00000064 Class: 0 Locale: 0x02 Event Description: Rebuild complete on PD 05(e0xfc/s0) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 seqNum: 0x0000afa2 Time: Mon Aug 26 23:13:48 2013 Code: 0x00000063 Class: 0 Locale: 0x02 Event Description: Rebuild complete on VD 00/1 Event Data: =========== Target Id: 0 seqNum: 0x0000afa1 Time: Mon Aug 26 23:13:21 2013 Code: 0x00000067 Class: -1 Locale: 0x02 Event Description: Rebuild progress on PD 05(e0xfc/s0) is 99.94%(45514s) Event Data: =========== ... seqNum: 0x0000af3e Time: Mon Aug 26 10:42:40 2013 Code: 0x00000067 Class: -1 Locale: 0x02 Event Description: Rebuild progress on PD 05(e0xfc/s0) is 0.99%(473s) Event Data: =========== seqNum: 0x0000af3d Time: Mon Aug 26 10:34:47 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from OFFLINE(10) to REBUILD(14) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 16 New state: 20 seqNum: 0x0000af3c Time: Mon Aug 26 10:34:47 2013 Code: 0x00000067 Class: -1 Locale: 0x02 Event Description: Rebuild progress on PD 05(e0xfc/s0) is 0.00%(0s) Event Data: =========== seqNum: 0x0000af3b Time: Mon Aug 26 10:34:47 2013 Code: 0x0000006a Class: 0 Locale: 0x02 Event Description: Rebuild automatically started on PD 05(e0xfc/s0) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 seqNum: 0x0000af3a Time: Mon Aug 26 10:34:47 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from UNCONFIGURED_GOOD(0) to OFFLINE(10) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 0 New state: 16 seqNum: 0x0000af39 Time: Mon Aug 26 10:34:47 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 1 New state: 0 seqNum: 0x0000af38 Time: Mon Aug 26 10:34:47 2013 Code: 0x000000f7 Class: 0 Locale: 0x02 Event Description: Inserted: PD 05(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=01, sasAddr=4433221103000000,0000000000000000 Event Data: =========== Device ID: 5 Enclosure Device ID: 252 Enclosure Index: 1 Slot Number: 0 SAS Address 1: 4433221103000000 SAS Address 2: 0 seqNum: 0x0000af37 Time: Mon Aug 26 10:34:47 2013 Code: 0x0000005b Class: 0 Locale: 0x02 Event Description: Inserted: PD 05(e0xfc/s0) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 seqNum: 0x0000af36 Time: Mon Aug 26 10:26:10 2013 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 05(e0xfc/s0) from FAILED(11) to UNCONFIGURED_BAD(1) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0 Previous state: 17 New state: 1 seqNum: 0x0000af35 Time: Mon Aug 26 10:26:10 2013 Code: 0x000000f8 Class: 0 Locale: 0x02 Event Description: Removed: PD 05(e0xfc/s0) Info: enclPd=fc, scsiType=0, portMap=01, sasAddr=4433221103000000,0000000000000000 Event Data: =========== Device ID: 5 Enclosure Device ID: 252 Enclosure Index: 1 Slot Number: 0 SAS Address 1: 4433221103000000 SAS Address 2: 0 seqNum: 0x0000af34 Time: Mon Aug 26 10:26:10 2013 Code: 0x00000070 Class: 1 Locale: 0x02 Event Description: Removed: PD 05(e0xfc/s0) Event Data: =========== Device ID: 5 Enclosure Index: 252 Slot Number: 0
The rebuild process can take many hours or even days (in this case, 10 hours for a 250 GByte volume).
The new disk should be a brand new one, or at least zeroed.
Handling Foreign disks
If you insert a disk that was already used, it is not automatically accepted. This happens often if you remove a disk and put it right back in, instead it is marked as “foreign”. Since the disk was not updated by the RAID controller (even if it was removed for just a few seconds), a full rebuild is required.
Use the commands below:
/opt/MegaRAID/MegaCli/MegaCli -PDMakeGood -PhysDrv \[252:0\] -aALL /opt/MegaRAID/MegaCli/MegaCli -CfgForeign -Clear -aALL /opt/MegaRAID/MegaCli/MegaCli -PDHSP -Set -PhysDrv \[252:0\] -aALL
This will mark the re-inserted disk drive as “good” and useable, the second statement clears all RAID information on that disk, and the third statement will designate the new disk as a “Hot Spare”. Since a disk is missing from the RAID, the new disk is automatically inserted in the RAID array and the rebuild process is started.
source:schirmacher.de/display/Linux/Replace+failed+disk+in+MegaRAID+array