Discovering a Bad Disk
zpool status
I check my ZFS pools regularly. Proxmox is nice, because it will send you emails, but other systems might not. Still, I make it a curiosity once a week to manually check the size and health of my storage pools.
~$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 8.74T 1.81T 8.53T /tank
~$ zpool status
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 2.20M in 0 days 00:00:10 with 0 errors on Sat Jun 26 04:06:30 2021
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c50079975b01 ONLINE 0 0 3
wwn-0x5000039fe3cf4293 ONLINE 0 0 0
wwn-0x5000039fe3cfcb2f ONLINE 0 0 0
wwn-0x5000039fe3cf3491 ONLINE 0 0 0
wwn-0x5000039fe3cfbe45 ONLINE 0 0 0
errors: No known data errors
smartctl
We can see from the zpool status that one of our disks has a problem. The pool status is reporting an error, and we can see one disk has failed checksums 3 times. But is this disk bad? I can check it with smartctl.
I've allocated my pool disks by ID, so zpool status will report the disks using that ID. To check a disk with smartctl, simply pass the command the path using the same disk ID.
Note: Some OSes, need to have smartctl installed. In Ubuntu, it can be installed using "sudo apt install smartmontools"
~$ sudo smartctl -x /dev/disk/by-id/wwn-0x5000c50079975b01
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-77-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-1ER166
Serial Number: Z500JERL
LU WWN Device Id: 5 000c50 079975b01
Firmware Version: CC25
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Jun 26 18:15:07 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM level is: 128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 89) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 330) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 106 099 006 - 29690410
3 Spin_Up_Time PO---- 094 093 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 77
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 083 060 030 - 214055033
9 Power_On_Hours -O--CK 046 046 000 - 47624
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 76
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 098 098 000 - 2
188 Command_Timeout -O--CK 100 100 000 - 0 0 0
189 High_Fly_Writes -O-RCK 084 084 000 - 16
190 Airflow_Temperature_Cel -O---K 070 057 045 - 30 (Min/Max 27/33)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 37
193 Load_Cycle_Count -O--CK 098 098 000 - 5695
194 Temperature_Celsius -O---K 030 043 000 - 30 (0 15 0 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
240 Head_Flying_Hours ------ 100 253 000 - 46235h+43m+57.716s
241 Total_LBAs_Written ------ 100 253 000 - 16226994503
242 Total_LBAs_Read ------ 100 253 000 - 11227817981448
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
The part of the smartctl output we are interested in is the Attributes Table. We can see that some attibutes are "P - prefailure" (which is generally OK, and herolds the age of this disk) and some have an "R - error rate". The "error rate" attributes in particular communicate a failure.
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 106 099 006 - 29690410
7 Seek_Error_Rate POSR-- 083 060 030 - 214055033
189 High_Fly_Writes -O-RCK 084 084 000 - 16
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
These four rows help explain what is going on with this disk. We can see that these values are above the manufacturers thresholds for this drive. Read errors are at 106/6, seek errors are at 83/30, high fly writes are at 84/0, and CRC errors are at 200/0. Seems like ZFS was right, and this drive is starting to fail.
Replacing the Disk
zpool offline & shutdown
I'm out of SATA ports in this system, so we will offline the disk, physically replace it, then run the "zpool replace" command.
~$ sudo zpool offline tank wwn-0x5000c50079975b01
~$ sudo zpool status
pool: tank
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 2.20M in 0 days 00:00:10 with 0 errors on Sat Jun 26 04:06:30 2021
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
wwn-0x5000c50079975b01 OFFLINE 0 0 3
wwn-0x5000039fe3cf4293 ONLINE 0 0 0
wwn-0x5000039fe3cfcb2f ONLINE 0 0 0
wwn-0x5000039fe3cf3491 ONLINE 0 0 0
wwn-0x5000039fe3cfbe45 ONLINE 0 0 0
errors: No known data errors
~$ sudo shutdown now
Replace the disk
Now turn off the system and physically replace the disk. It's a good idea to write down the disk ID, to make sure the correct drive was pulled. This will make it easier to identify the disk in the future, should another failure happen. At this point I write "BAD" in red sharpie on the label of the bad disk. Also take a good look at the new drive and write down or take a picture of any identifying numbers. We can pick a disk ID that uses these when the drive is replaced.
My replacement disk for this 3TB Seagate is going to be a new 4TB Toshiba. The Toshiba doesn't have a wnn number printed on it, so I take note of the other hardware IDs. In this case we will use the ata-HWID for zfs.
zpool replace
Booting the system, and another "zpool status" shows the correct drive was pulled. Now look for the new disk and using the disk ID, run zpool replace.
~$ ll /dev/disk/by-id/ata-*
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-CT240BX200SSD1_1625F01DF837 -> ../../sda
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-CT240BX200SSD1_1625F01DF837-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-CT240BX200SSD1_1625F01DF837-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-CT240BX200SSD1_1625F01DF837-part3 -> ../../sda3
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2EY2GS -> ../../sde
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2EY2GS-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2EY2GS-part9 -> ../../sde9
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2KNSGS -> ../../sdc
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2KNSGS-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46H2KNSGS-part9 -> ../../sdc9
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3MM7AS -> ../../sdf
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3MM7AS-part1 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3MM7AS-part9 -> ../../sdf9
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3S1WAS -> ../../sdd
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3S1WAS-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_DT01ACA300_46M3S1WAS-part9 -> ../../sdd9
lrwxrwxrwx 1 root root 9 Jun 26 19:06 /dev/disk/by-id/ata-TOSHIBA_HDWQ140_Y026K4H8FBJG -> ../../sdb
The disk in sdb matches the new Toshiba drive. The hardware ID matches the picture of the label that was taken. Another "hint" is that it doesn't have partitions "part1" or "part9" like the other disks.
~$ sudo zpool replace tank wwn-0x5000c50079975b01 /dev/disk/by-id/ata-TOSHIBA_HDWQ140_Y026K4H8FBJG
~$ zpool status
pool: tank
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Jun 26 19:16:51 2021
787G scanned at 615M/s, 364G issued at 285M/s, 10.9T total
72.3G resilvered, 3.25% done, 0 days 10:49:28 to go
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
wwn-0x5000c50079975b01 OFFLINE 0 0 0
ata-TOSHIBA_HDWQ140_Y026K4H8FBJG ONLINE 0 0 0 (resilvering)
wwn-0x5000039fe3cf4293 ONLINE 0 0 0
wwn-0x5000039fe3cfcb2f ONLINE 0 0 0
wwn-0x5000039fe3cf3491 ONLINE 0 0 0
wwn-0x5000039fe3cfbe45 ONLINE 0 0 0
errors: No known data errors
Resilvering
The pool is now in a "resilvering" state. This can take hours to days depending on the pool. Data is copied from the other disks in the pool onto the new disk. This will bring the pool back into compliance with the raidz1 redundancy strategy.
It is important not to do anything to the zpool while resilvering is taking place. If something should happen to another disk in the pool during this process (there is only 1 disk parity), then data will be lost. This is less fragile if running raidz2 (can lose 2 disks before data is lost). With only 5 disks in this pool, and a backup policy that has already copied the data off of this pool, I feel confident recovering from a failure during resilvering.
Resources
https://help.ubuntu.com/community/Smartmontools
https://docs.oracle.com/cd/E19253-01/819-5461/gazgd/index.html