An Impromptu Introduction to ZFS Drive Replacement at 1am

Last night at 1 minute past midnight, just before heading to bed, I received the following notification from my main production Proxmox system:

SMART error (Health) detected on host: chonk

Chonk is where I host everything intended to be publicly accessible, including the BookStack site & infrastructure, among many other sites and applications. It’s a dedicated Hetzner server which has two Samsung NVME SSDs combined in a ZFS mirror for redundancy.

For some context, I have little prior experience with the failure patterns of NVME SSDs. I also have little prior experience with the specifics of ZFS, only having done some research years ago along with basic creation/setup tasks. I have never delved into the handling of live pool handling & drive replacements. Now was the time to quickly get educated on both. I do have solid off-system daily backups, in multiple locations, of all machines on the Proxmox system so restoring would always be possible, but it would still be a pain in the backside and more work to re-set-up everything again from blank drives & backups.

The first step was to understand if the detected drive errors were problematic. The drive seemed to be working okay, and the zpool didn’t report any errors. SMART tests were reporting a failure though, stating that “NVM subsystem reliability has been degraded”. It was also reporting 4 “Media and Data Integrity Errors”. While I was not sure if these could be signs of a temporary issue of some kind, I decided it was not worth the risk and that I should contact Hetzner support.

My urgency to replace the drive increased when looking the at the other good drive. The drive model was the same, and the SMART usage stats were similar. My fear being that continued use and future needed mirror rebuilds may risk that other drive producing issues too, resulting in an issue snowball.

At 1am, I submitted my support request with Hetzner, which had a dedicated option for disk failures with inputs for the drive serial number and SMART output. I chose the option to replace the drive ASAP, which indicated this may typically be within 2-4 hours of submitting the request. My plan now was to stay up and get this done quick, during the night where downtime isn’t so considerable & stressful. I also chose for the machine to be powered off for the replacement, which was required for NVME drives.

After submitting the request, I started a fresh backup run of VMs/containers since the existing ones were a day old, with the usual daily backup scheduled an hour later from now. With that ongoing I started to educate myself on ZFS, and in particular drive replacements, watching YouTube guides and reading up documentation.

Luckily, I found Proxmox themselves have great ZFS documentation including the steps for replacing a drive. Without this, I’m sure that I would have ended up with a problematic boot scenario as most of the content I consumed, including Proxmox specific ZFS drive replacement videos, did not factor in the scenario of mirrored bootable drives.

Just one hour later my monitoring systems alerted me that my sites had gone down, a sign the drive replacement was in progress. I waited a while until I regained SSH and web access, and there I could see the new healthy drive!

I started carefully following the Proxmox guidance. I copied the partition table across with re-issued GUIDs. While I was double checking some factors for the main ZFS zpool replace command, more monitor alerts sounded. The system was down again. Oh dear!

A minute later I received the following from Hetzner:

We have replace the the fault NVMe and start our rescue-system because the OS was not start.

Ah, so it looks like they didn’t detect the system as being up for some reason, and rebooted it into recovery while I was attempting to fix things! I felt lucky this was between commands, and that I wasn’t in the middle of an ongoing restoration action. A login & reboot later, my Proxmox system was back up.

Accessing once again, I ran the last few commands to replace the partition in the zpool mirror, and to re-install the bootloader. This was the only trouble I had following the Proxmox guide, since I originally misidentified my bootloader as plain grub instead of grub via their proxmox-boot-tool, but luckily they have safeguards in place which stopped my wrong commands from taking action.

The rebuild of the mirror (resilvering) was surprisingly quick. I was expecting this to take hours, but it had completed the 353GB of used data in just under 8 minutes. I guess that’s the benefit of fast modern NVME SSDs.

All seems to be good now. Overall this was an interesting impromptu lesson into ZFS and drive failure handling. Maybe I should have been more prepared for this, and trialled such scenarios before hand, but you can never be ready for everything. Upon the experience, I have gained extra respect and interest into ZFS, and the value of drive redundancy. I have redundant drives for my personal NAS, but now they’re probably going to be a more significant factor in regards to my other homelab systems also.