Replacing Failed Disks
Figure out which disk failed (check dmesg and /var/log/messages).
Once the bad disk has been identified, remove it from the array
Mark disk as failed (if it isn't already)
mdadm --manage $ARRAY --fail $DISK
mdadm --manage /dev/md2 --fail /dev/sdg
Once the disk is marked as failed, remove it from the array
mdadm --manage $ARRAY --remove $DISK
mdadm --manage /dev/md2 --remove /dev/sdg
Check and make sure the disk has actually been removed from the array.
Look for something similar to this. The "_" indicates a disk is removed from the array.
Aditionally, make sure that the failed disk is not syncing, indicated by: "sdg(S)"
Hot swap the physical disk
Check dmesg to see if new disk has been assigned a device name - if so move to 4
In the event the newly inserted disk is not seen - use the following information below to correct the situation
Determine disks serial number by running
ls -l /dev/disk/by-id/
Check SCSI errors for the SCSI id of the disk (sd 8:0:0:0: SCSI error: return code = 0x08000002)
Note: If you don't see any errors like this, you can find the ID by looking for the correct block:sd? symlink
ls -l /sys/bus/scsi/devices/*/block*
Note which host number the disk is showing up as.
ls -l /sys/bus/scsi/devices/[id]
The path this symlink links to will contain the "host#", remember this for later.
Remove the disk from the system
echo x > /sys/bus/scsi/devices/[id]/delete
echo x > /sys/bus/scsi/devices/8:0:0:0/delete
Physically replace the disk with the serial number noted earlier. Make sure to label the tray with the new disks serial number in case it needs to be replaced in the future.
Have the OS rescan the SCSI bus for the new disk, host[n] is the host id obtained earlier.
echo "- - -" >/sys/class/scsi_host/host[n]/scan
Check dmesg and make sure the disk comes back online. If it does not, the server may end up needing to be rebooted and the device rescanned again.
Figure out what the new device name is (check dmesg, /var/log/messages)
Add the new disk back to the array
mdadm --manage $ARRAY --add $DEVICE
This will add the disk back to the array and cause it to repair the array. You can check the progress of the sync by cating /proc/mdstat
Ran into an issue where a drive would not mount on startup, the system seemed to think it was part of a software raid.
Errors in dmesg, and errors on boot.
device-mapper: multipath: version 1.0.5 loaded device-mapper: table: device /dev/mapper/ddf1_4c534920202020201000006010001012471147112c8150e7 too small for target device-mapper: table: 253:1: linear: dm-linear: Device lookup failed device-mapper: ioctl: error adding target to table
Used the following process to get the server to boot, the error I was getting was coming back on reboot though. boot with single, fastboot
Show the ddf1 raid device:
dmsetup remove ddf1_4c534920202020201000006010001012471147112c8150e7
Your partition should mount now
Put the system into user mode
Permanently remove the RAID metadata