The state of the active Linux software RAID devices can be viewed by running:
Software RAID in Linux is implemented by the multiple devices (MD) driver. MD devices can be managed via the mdadm utility. Read the man page for more details on usage.
When an error is returned to the operating system on a member device of a MD device, the MD driver will mark the member device as failed. An example of a failed device is:
# cat /proc/mdstat Personalities : [raid1] read_ahead 1024 sectors Event: 11 md4 : active raid1 sdb1 sda1(F) 513984 blocks [2/1] [_U] unused devices: <none>
n the output above, the (F) indicates that the device /dev/sda1 is in a failed state.
The array members status [_U] shows which member devices are Up. In this case it shows that the second numbered member (i.e. the device with ). If you are confused, simply look for the (F).
The procedure is pretty similar for ATA, SATA, SAS and SCSI systems, but SCSI, SAS & SATA let you hot add/remove drives. This step only deals with hot removal.
Determine the device name of the failed disk. It is absolutely imperative that you do not stuff this step up.
Disable smartd as this will prevent us from pulling the device from the RAID array:
service smartd stop || /etc/init.d/smartd stop ps -C smartd
Gather the list of all md devices that contain the failed disc:
grep /dev/FAILED /proc/mdstat
Where FAILED is your disk that has failed, e.g sda
Remove the drive from all MD devices. For all partitions on the drive with the failure run:
mdadm -f /dev/mdX /dev/FAILED_PARTX mdadm -r /dev/mdX /dev/FAILED_PARTX
- For hotswap SCSI/SATA/SAS, tell Linux to remove the device:
Firstly, get the drive id:
# cat /proc/scsi/scsi Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: MAW3073NC Rev: 0104 Type: Direct-Access ANSI SCSI revision: 03 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: FUJITSU Model: MAW3073NC Rev: 0104 Type: Direct-Access ANSI SCSI revision: 03
In this case, we want to remove the first drive:
echo "scsi remove-single-device" 0 0 0 0 > /proc/scsi/scsi
- Now remove the fail disk from the machine.
At this stage you should have grub installed on the remain hard disks that contain /boot.
- Insert the new hdd into the server, it should be of equal or greater capacity. And equal or greater RPM spindle speed.
Now you need re-detect the hard disk on the SCSI channel you removed one from earlier.
echo "scsi add-single-device" 0 0 0 0 > /proc/scsi/scsi
Confirm you can see the new hard disk when you run cat /proc/scsi/scsi
Partition Table Setup & Raid Re-Sync
Now that you have the replacement drive installed into the machine you want to setup the partition table on the disk so you can begin a raid re-sync.
The easiest way to copy the partition table from disk to another, is to use sfdisk.
To copy the partition table from /dev/sdb to /dev/sda you would run
sfdisk -d /dev/sdb | sfdisk /dev/sda
Check that it’s re-silvering. MD will intelligently queue the partitions so the drives aren’t hammered by several parallel reads/writes.
At this stage you should have a machine suffering from heavy IO load as MDADM re-syncs the raid array. If you are finding the load is causing to much of an impact on your business operations, you can slow down the rate of syncing.
To check the max rate of the raid re-sync on the md1 raid device you would run
To change this value you would use the echo command like such
echo "50000" > /sys/block/md1/md/sync_speed_max
Now that the drive has been swapped, you want to start smartd again
service smartd start