Linux MD

From Wiki
Jump to: navigation, search

Replacing Failed Disks

Figure out which disk failed (check dmesg and /var/log/messages).

Once the bad disk has been identified, remove it from the array

Mark disk as failed (if it isn't already)

mdadm --manage $ARRAY --fail $DISK

Example:

mdadm --manage /dev/md2 --fail /dev/sdg

Once the disk is marked as failed, remove it from the array

mdadm --manage $ARRAY --remove $DISK

Example:

mdadm --manage /dev/md2 --remove /dev/sdg

Check and make sure the disk has actually been removed from the array.

cat /proc/mdstat

Look for something similar to this. The "_" indicates a disk is removed from the array.

[U_UUUUUUUUUU]

Aditionally, make sure that the failed disk is not syncing, indicated by: "sdg[12](S)"

Hot swap the physical disk

Check dmesg to see if new disk has been assigned a device name - if so move to 4

In the event the newly inserted disk is not seen - use the following information below to correct the situation

Determine disks serial number by running

ls -l /dev/disk/by-id/

Check SCSI errors for the SCSI id of the disk (sd 8:0:0:0: SCSI error: return code = 0x08000002)

Note: If you don't see any errors like this, you can find the ID by looking for the correct block:sd? symlink

ls -l /sys/bus/scsi/devices/*/block*

Note which host number the disk is showing up as.

ls -l /sys/bus/scsi/devices/[id]

The path this symlink links to will contain the "host#", remember this for later.

Remove the disk from the system

echo x > /sys/bus/scsi/devices/[id]/delete

Example

echo x > /sys/bus/scsi/devices/8:0:0:0/delete

Physically replace the disk with the serial number noted earlier. Make sure to label the tray with the new disks serial number in case it needs to be replaced in the future.

Have the OS rescan the SCSI bus for the new disk, host[n] is the host id obtained earlier.

echo "- - -" >/sys/class/scsi_host/host[n]/scan

Check dmesg and make sure the disk comes back online. If it does not, the server may end up needing to be rebooted and the device rescanned again.

Figure out what the new device name is (check dmesg, /var/log/messages)

Add the new disk back to the array

mdadm --manage $ARRAY --add $DEVICE

This will add the disk back to the array and cause it to repair the array. You can check the progress of the sync by cating /proc/mdstat

Reference links

https://ata.wiki.kernel.org/index.php/Libata_error_messages

DM RAID

Ran into an issue where a drive would not mount on startup, the system seemed to think it was part of a software raid.

Errors in dmesg, and errors on boot.

device-mapper: multipath: version 1.0.5 loaded
device-mapper: table: device /dev/mapper/ddf1_4c534920202020201000006010001012471147112c8150e7 too small for target
device-mapper: table: 253:1: linear: dm-linear: Device lookup failed
device-mapper: ioctl: error adding target to table

Used the following process to get the server to boot, the error I was getting was coming back on reboot though. boot with single, fastboot

Show the ddf1 raid device:

dmsetup ls 

Remove it

dmsetup remove ddf1_4c534920202020201000006010001012471147112c8150e7 

or with

dmsetup remove_all

Your partition should mount now

mount -a 

Put the system into user mode

init 3

Permanently remove the RAID metadata

dmraid -rE