vimwiki/Replacing A Failed Disk in a mdadm RAID.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


If disk errors are reported there may be H/W problems with the disk. Check dmesg for the following type of errors:

`[737961.360080] raid5_end_read_request: 64 callbacks suppressed`
`[737961.360087] md/raid:md125: read error corrected (8 sectors at 2722701256 on sdc1)`
`[737961.360093] md/raid:md125: read error corrected (8 sectors at 2722701264 on sdc1)`
`[737961.360095] md/raid:md125: read error corrected (8 sectors at 2722701272 on sdc1)`
`[737961.360098] md/raid:md125: read error corrected (8 sectors at 2722701280 on sdc1)`
`[737961.360100] md/raid:md125: read error corrected (8 sectors at 2722701288 on sdc1)`
`[737961.360102] md/raid:md125: read error corrected (8 sectors at 2722701296 on sdc1)`
`[737961.360105] md/raid:md125: read error corrected (8 sectors at 2722701304 on sdc1)`
`[737961.360107] md/raid:md125: read error corrected (8 sectors at 2722701312 on sdc1)`
`[737961.360109] md/raid:md125: read error corrected (8 sectors at 2722701320 on sdc1)`
`[737961.360112] md/raid:md125: read error corrected (8 sectors at 2722701328 on sdc1)`
`[742462.760119] md: md125: data-check done.`

Use SMART to investigate the hard drive.

`$ smartctl -i /dev/sdc`

The drive can be tested via the following command

`$ smartctl -t long /dev/sdc`

The long test will take a while, there is also a short test which can be performed.
The results can be viewed using:

`$ smartctl -l selftest /dev/sdc`
` `
`smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.36.3.el7.x86_64] (local build)`
`Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org`
``
`=== START OF READ SMART DATA SECTION ===`
`SMART Self-test log structure revision number 1`
`Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error`
`# 1  Extended offline    Completed: read failure       40%     21930         2722703304`

Thus this needs to be replaced. To find it can use hdparm to get the serial number.

`$ hdparm -i /dev/sdc | grep SerialNo`
`Model=ST2000DM001-1ER164, FwRev=CC27, SerialNo=Z4Z5QAY5`

so before shutting down and replacing the drive mdadm is used to mark the drive as failed and it can
be removed from the raid.

`$ mdadm --manage /dev/md0 --fail /dev/sdc1`
`$ mdadm --manage /dev/md0 --remove /dev/sdc1`

Before the old drive is removed the partition table can be dumped using:

`$ sfdisk -d /dev/sdc > sdc.out`

Once the new drive has been swapped in, the old partition table can then be used on the new drive:

`$ sfdisk -d /dev/sdc < sdc.out`

The new disk is now ready to be included in the raid:

`$ mdadm --manage /dev/md125 --add /dev/sdc1`

Finally can monitor the progress of the rebuild using:

`$ cat /proc/mdstat`