Hot-swapping Failed Hard Drive in RAID-Z Pool (TrueNAS)

The other day I logged in to my TrueNAS admin interface and was welcomed by multiple warnings. Basically, one of my disks was having a bad time. This post talks about my experience swapping that disk.

I connected to the server via SSH and checked the status of /dev/ada3 using smartctl.

# smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD80EFZX-68UW8N0
Serial Number:    XXXXXXXX
LU WWN Device Id: 5 000cca 263c99eb7
Firmware Version: 83.H0A83
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun  9 09:19:56 2023 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       37597
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       87
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
194 Temperature_Celsius     0x0002   185   185   000    Old_age   Always       -       35 (Min/Max 19/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       176
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       275
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     37597         -
# 2  Short offline       Completed: read failure       10%     37468         -
# 3  Short offline       Completed: read failure       90%     37300         -
# 4  Short offline       Completed: read failure       90%     37132         -
# 5  Short offline       Completed: read failure       10%     36964         -
# 6  Short offline       Completed: read failure       10%     36797         -
# 7  Short offline       Completed: read failure       90%     36628         -
# 8  Short offline       Completed: read failure       90%     36460         -
# 9  Short offline       Completed: read failure       90%     36292         -
#10  Short offline       Completed without error       00%     36124         -
#11  Short offline       Completed without error       00%     35956         -
[...]

At this point, given the high values of Current_Pending_Sector and Offline_Uncorrectable, along with failed SMART self-tests, it is clear that this disk should be replaced quickly. It is still functional, and the pool’s integrity is not yet affected, but this should not stay unaddressed.

As you can see from the LifeTime column, the disk started failing shortly after 4 years of service. It is no longer under warranty. I could have expected a slightly longer life time from a NAS-quality disk, especially as the load it receives is not very intense.

On a side note, the disk was a simple WD Red NAS 8TB drive (5400RPM). This model is no longer advertised on WD’s website. The Red NAS series now ends at 6TB. The 8TB model is under a new category called Red NAS Plus and comes with a slightly unusual 5640RPM. All those models use SMR technology, unlike smaller size models which may use CMR and for which you should pay attention to the exact model number.

Fortunately, this disk is part of a RAID-Z2 pool, meaning it has two disks worth of redundancy. Even it stops working, the pool will be able to operate (albeit in a degraded state).

I ordered a replacement drive, and then started the process of replacing a failed disk by following the official instructions.

One concern I had is related to the GELI encryption: What happens to a pool that’s encrypted? Can I still swap the drive easily? The answer is yes.

First step is to take the failing disk offline.

Of course, you would not assume everything with TrueNAS goes absolutely smoothly. There got to be some errors. And the first error indeed came: I cannot take the disk offline: Cannot allocate memory occured while trying to swapoff /dev/mirror/swap0.eli.

My server should have enough RAM to support its activities without swap, so I don’t understand why I get this error. I stopped a number of my jails and tried again, it finally worked. A new alert shows the pool is going from Active to Degraded state.

Second, I unplugged the disk’s power cable, then data cable, swapped in the new disk and plugged it in (data then power). TrueNAS will keep track of what you are doing.

Third, I clicked on Replace for this disk in the pool, and selected ada3 (I plugged the disk in the same place as the old one).

Unexpectedly, I needed to re-enter the GELI passphrase used in the pool. Since I needed to type it twice, I was unsure whether mistakenly using a different passphrase than the rest of the pool was going to create any issue. It does not seem that this new passphrase will be checked against the existing one, so be careful.

The disk then got formatted quickly, and resilvering started!

I thought this was all, but somehow in the process of taking the failed disk offline, I think some processes got killed. My virtual machines showed an Off state, although they were still running. I restarted the middleware service (service middlewared restart). Then, they were showed as On, but I couldn’t VNC To them, instead I got a libvirtError: internal error: client socket is closed. Cool…

To access the VMs using VNC, it is still possible to grab the VNC port (different than the web port you can use for VNC) and connect with a VNC client. My jails are suddenly unreachable even after a restart. Definitely something wrong happened… Except by restarting the server, I am not sure to address this problem.

Otherwise, the experience was rather good. Fingers crossed everything will be back to normal in a day after resilvering.

UPDATE: I had to restart the NAS for all jails and virtual machines to work properly. Resilvering resumed by itself after a while, and it’s now finished!

state: ONLINE
scan: resilvered 6.04T in 1 days 11:27:51 with 0 errors on Sun Jun 11 01:33:44 2023

Leave a comment