zfs: adding existent segment to range tree

Hi! It’s been a while. I tend to leave these articles half-finished and they end up rotting. I wanted to carry this one forward since everyone loves a good data loss story.

For the first time in a decade, my primary mixed-use data pool has experienced a critical failure and the entire pool is faulted.

Note: I’m far from an authoritative source on ZFS; I know just enough to get myself into trouble. There’s plenty of smarter folks that worked through this. I didn’t want another write to touch my pool. This was thrown together haphazardly, so details might be a bit sporatic.

Background

I run a small NAS for myself and a few others to store data on. I also have a Kubernetes cluster which will mount from this server. As you can tell from the title of this, I use the absolutely fantastic OpenZFS to manage all of this data. I run a few ZFS pools, but we’ll be focusing on my largest / most complex pool.

The host system was running on TrueNAS SCALE 25.04.2.4, though I’ll likely be moving away from that solution soon for (mostly) unrelated reasons.

How did we get here?

It was a weekday like any other. I was sending over a couple of terabytes to be written to the archive with moderate compression (ZSTD-6). I got an alert from a disk that flipped over to FAULTED status. 34 checksum errors and it was pulled out of the array, with ZFS not trusting the device and suggesting a resilver. I was able to log into the machine, saw a few kernel ATA faults, and the drive stopped responding to S.M.A.R.T. INQUIRY requests.

Unfortunate, but we plan for this. I keep a spare drive on-hand for the dreadful moments where I need to resilver. Looks like I was burning my spare today. So, I did. Normally I would investigate this a bit deeper, but I wasn’t going to be able to troubleshoot the disk itself until the next day. By that time, the resilver would be complete and I could burn test the problem disk to see what’s up.

At some point, I got another alert suggesting that the host restarted unexpectedly. Well, that’s not something you want to hear while it’s resilvering.

I dropped what I was doing and checked on the system to see what’s up. It was at this point I realized that this pool was quite full - 94.6% capacity used. This is totally my fault, as I have been slacking on purging data for some time. The alert had been firing for .. longer than I’d like to admit.

That’ll probably cause some issues, and something I should have better considered before performing the resilver. The system came back up on its own, but I noticed the resilver was incredibly slow, likely due to aforementioned capacity problem (and the sudden restart might not help).

So, what was my next action? Well, deleting some data, of course! I wasn’t sure about the reset, but I was fairly certain that the resilver lethargy came from the lack of free space. It is fairly well-known that ZFS performance falls off a cliff after that 80% mark, with exact slowdowns deterministic to the pool.

This started by evicting a few hundred gigabytes of unimportant data and removing any associated snapshots. This helped the case fairly well and the resilver continued to issue around 150 MB/s. That was probably good enough. It’d still take a couple days, but the pace was much better.

I kept going, since I already knew of some other things that could go. Actually, there’s a whole (de-duplicated) dataset that can go and it would free a few terabytes. One dataset destroy later, and the disks were churning away at deleting a bunch of data on top of the resilver job. Seemed fine enough. I left it for a while, without much thought.

I received another watchdog alert. The system appears to have restarted again. Even worse, the alerts kept coming. It did not come back up this time.

It would be a few hours before I could realistically inspect this and see how broken things were.

Image: Cryptic ZFS trace

Seems ominous enough.

What happened?

I was up until 4 in the morning.

After a long night of research while hunching beside my server cabinet, I had a few guesses to what went wrong. This isn’t heavily validated, and this post is NOT going to dive deep into dissecting a failed pool. For my purposes, this is a production server and we need to get things back online.

I knew the disks were very occupied with three major tasks: deleting blocks associated with deleted data and deleted snapshots, dealing with a deleted dataset (which had dedupe on), and (most importantly) resilvering. During this, the system experienced a hard reset. Notably, I did not have any useful log around why the system reset, only that it did.

From my best guess at this point, I had some kind of metadata corruption. At the time, I suspected a hardware problem - bad memory, an overheating HBA, or just aging of the hardware. Problem is, I couldn’t re-create any real hardware issues. Pulling the array would allow the system to boot again, and excessive activity across the HBA to other pools would be fine. I run on bi-weekly scrubs, so if I had an overheating issue, it picked a very unfortunate time to show up.

Recovery

I shut down the host and booted a live environment. I poked at the pool a bit, trying to manually mount the zpool. With importing the pool, ZFS was able to immediately identify the pool and the resilvering state, even recognizing the (previously) faulted member. Interestingly, the faulted member seemed to work again. RW mounts would fail (with the same range tree panic), but RO mounts would work right away. Phew.

In an (over-)abundance of caution, I proceeded with the disks mounted read-only and copied off everything. This provided a good opportunity to adjust the storage configuration anyway. Due to the fragility of this situation, I was very hesitant to try any real “recovery” methods, as they practically all wanted to do zfs_recover=1 zil_replay_disable=1. This does seem to be a common enough problem among TrueNAS users. However, this pool was in the middle of resilvering, not just some standard data I/O. I was highly skeptical of discarding the intent log with that process in mind.

In hindsight, copying everything off and effectively downing this pool for ten(!) days might have been overdramatic. However, I’ve had painful data loss due to simple power outages before, with the good ol’ btrfs. I’m not about to agitate the situation at all, even if it is a perfectly safe thing to do. In the moment, I simply wasn’t sure, and I opted for the safest path.

For those of you reading this, in a similar situation, I suggest you rebuild the pool from scratch. I know, it’s terrible. It may not even be practical for you. I was floundering with about 50 TiB of data that suddenly needed to be re-homed temporarily. I’ve linked some of the GitHub issues related below, which imply that you can continue using the pool in a recovery mode until ZFS re-commits the corrupted spacemap entries. It might take a day. It might take weeks. It really depends on your pool and its activity, or so it seems.

Given more time, I would have liked to toy with the array state further. However, I knew the copy operations would be measured in days, not hours, and needed this to be back up soon.

For my copy process, I created new zpools with other disks I had available and mounted the original pool read-only. It took a while, but the data extracted successfully (via zfs send). I was a bit skeptical of trusting zfs send, but it’s very likely that I would encounter write problems during replication. That never happened, and a general inspection of critical data on the pool seemed successful.

How do we avoid it in the future?

I can take some guesses, but my overall takeaway is mixed.

RAID is not a backup.

Neither are snapshots.

I’ve maintained off-site replicas for a while now, so this would not have been a total loss for me. I would have lost some data, and it would have taken a significant amount of time to re-curate it back to where it was. I was in the middle of some migrations here, which involved a lot of tagging & organizing which would’ve been lost.

Not to mention, the off-site disks would need to be removed from the server and shipped. Ouch. The real inconvenience here was simply a prolonged service outage, which could have been reduced or mitigated in some way with a local replication plan.

We’ll be doing that now.

Pull the old disk.

Might’ve been worth removing the broken disk. I can’t prove that it contributed to the problem, especially because it now seems to be stable, though I do not intend to trust it until after burn testing. Historically, I’ve not had issues leaving a problem disk in-place while resilvering, but the erratic behavior of the disk during troubleshooting was a variable that I did not remove.

Respect your pool configuration.

Running a pool at 95% capacity was right stupid. A dead-end issue thread briefly talked about it, and notably this person had excessive storage use. Based on my research (and my own experience here), red-lining capacity on a pool may be outright dangerous for the spacemap. It’s not the whole story, but your risk may meaningfully increase.

De-duplication might’ve eaten away some of my safety margin here. We didn’t need de-duplication either, and it was mostly for a test against a particular data set. It’s plausible that the de-duplication dataset was harboring corruption, and we finally tripped it with this episode.

It’s very possible that neither one of these had anything to do with it. I derive these points purely from my experience and anecdotal issues. They might not be true at all.

Final thoughts

This issue has some loose traction and the reception seems mixed on how to best proceed. In general, some folks share my mindset of “I wouldn’t trust the pool any further - get the array rebuilt.” Others were happy to carry on with a zfs_recover=1 zil_replay_disable=1. This seemed quite successful for bitwise0perator in their issue, and perhaps it is the path I should have taken.

In my case, though, I was very hesitant to let the pool continue writing. I knew corruption had occurred, and a disk was resilvering at the time. Either ZFS would be able to revert the ZIL with no problem and it’d have to duplicate work, or I would be in for a very rude awakening. Seems like quite the coin toss.

This was inconvenient because too much hinged on the array. This was worrisome because no replicas existed on site. It was recoverable, with lots of time, even if the disk pool was a total loss. However, for me, it was a realization that my backup policy is beginning to drift and I will need to make amends.

A few days after the array was rebuilt and service was restored, I found this log from a couple months ago.

kernel: task:txg_sync        state:D stack:0     pid:5564  ppid:2      flags:0x00004000
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x349/0x950
kernel:  schedule+0x5b/0xa0
kernel:  schedule_timeout+0x98/0x160
kernel:  ? __pfx_process_timeout+0x10/0x10
kernel:  io_schedule_timeout+0x50/0x80
kernel:  __cv_timedwait_common+0x12a/0x160 [spl]
kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
kernel:  __cv_timedwait_io+0x19/0x20 [spl]
kernel:  zio_wait+0x10f/0x220 [zfs]
kernel:  dsl_pool_sync+0xb9/0x410 [zfs]
kernel:  spa_sync_iterate_to_convergence+0xdc/0x200 [zfs]
kernel:  spa_sync+0x30a/0x5e0 [zfs]
kernel:  txg_sync_thread+0x1ec/0x270 [zfs]
kernel:  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
kernel:  thread_generic_wrapper+0x5e/0x70 [spl]
kernel:  kthread+0xe8/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x34/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>

There’s other pools on this system, so it’s not guaranteed that this trace is related. A new issue showed up while authoring this article, which hints that a txg_sync race condition may be to blame. It’s plausible this corruption has been silent for some time, and it was finally disturbed by this whole process.

I clobbered this together over the course of a couple weeks, so things might be a bit disjointed. I might’ve left out some details. By now, the issue is resolved, and I’m entertaining new storage plans. Hope this was at least marginally helpful to someone else going through the same thing.

References

These were notable sources during my troubleshooting period: