Kernel 6.8.12-8 zfs 2.2.7 crashed

s_peter · Feb 8, 2025

My Proxmox VE 8.3 with the latest upgrades just crashed on zfs. There are no any log messages, just this screenshot from the IPMI terminal.
In this state it became totally unresponsive (both remotely and locally).

I have two zfs pools:
1. rpool: raidz1 3 x 4TB NVME drives
2. strorage: raidz1 8 x 12TB SATA hard disks

The trick is that some partitions of the NVME drives are ZIL/LOGS and L2ARC caches for storage pool.
Any ideas or similar cases? I had one crash with the previous kernel as well.

waltar · Feb 8, 2025

I would assume your nvme's are consumer kind of and were rosted by pve db and parity writes.

s_peter · Feb 8, 2025

waltar said:
I would assume your nvme's are consumer kind of and were rosted by pve db and parity writes.

A kernel crash cannot be explained by this. Consumer SSDs have shorter lifetime, but this is a software bug in the ZFS module.
Please help me to understand how this crash relates to Consumer SSDs.

Kingneutron · Feb 9, 2025

You have a hung task for more than 120 seconds writing to a zvol, this indicates that ZFS was having trouble with I/O to the drive.

> rpool: raidz1 3 x 4TB NVME drives
> storage: raidz1 8 x 12TB SATA hard disks
> The trick is that some partitions of the NVME drives are ZIL/LOGS and L2ARC caches for storage pool

You built it wrong. Full stop. Re-architect.

RAIDZx is not good for VMs, and re-using partitions on the rpool drives for ZIL/LOG/L2ARC is probably causing I/O contention and general confusion.

[TUTORIAL] Thread 'FabU: Can I use ZFS RaidZ for my VMs?'

Jan 1, 2025

Assumption: you use at least four identical devices for that. Mirrors, RaidZ, RaidZ2 are possible - theoretically.

Technically correct answer: yes, it works. But the right answers is: no, do not do that! The recommendation is very clear: use “striped mirrors”. This results in something similar to a classic Raid10.

(1) RaidZ1 (and Z2 too) gives you the IOPS of a single device, completely independent of the actual number of physical devices. For the “four devices, mirrored” approach this will double --> giving twice as many Operations per Second. For a large-file...

--What I would recommend:

o Mirror for rpool, and use different make/model SSD so they don't both wear out around the same time (Think EVO and Pro, you want one to wear out faster.) Backup the ZFS rpool to the 3rd nvme drive if you want, or repurpose it.

o Mirrors for LXC/VM vdisk backing storage, so interactive response is better

o RAIDZ2 for bulk storage / media, where interactive response is not an issue. You might have a bad time with "raidz1 8x12TB SATA hard disks" when things start failing, especially if they're not NAS-rated disks. Desktop-class hard drives can cause Weird Behavior with ZFS when they start failing; the firmware is different from NAS.

The odds of a 2nd disk (especially if it's over ~2-4TB) falling over during replacement / resilver are not in your favor.

o Separate devices for ZIL / SLOG (if you even need these, generally you don't unless NFS / lots of sync writes), and L2ARC.
You can try moving the L2ARC to e.g. 64GB PNY USB3 thumbdrives. Inexpensive, disposable, pool doesn't fall into a black hole if they fail, easily replaced if you have spares (buy a 4-5 pack.) L2ARC survives a reboot, where ARC does not.

https://search.brave.com/search?q=z...summary=1&conversation=8cc2378dc0a0e3207dd594

o If you have a lot of small files and your scrubs are taking more than ~24 hours, add-in a mirrored Special SSD device. Again, different make/model to minimize double-failure odds.

https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

o Consider adding a hotspare to the pool if you have extra drive bay(s) - with 12TB disks you want at least 1-2 spares lying around if you can afford it. Waiting for a replacement drive to show up in the mail is nail-biting time, and hoping the pool doesn't alter the deal and fail any further.

--When you get back up and running, check the Wearout indicator in Nodes / (nodename) / Disks. If any are above ~50-80%, proactively replace. With SSD/nvme, you want a high TBW rating if you're not going with Enterprise-level.

s_peter · Feb 10, 2025

Hi @Kingneutron, thanks for your reply, I appreciate that it had many good points.

>You built it wrong. Full stop. Re-architect.
It seems you are optimizing for your needs. I'm fully satisfied with my setup, it was created for the biggest potential storage space.
I'm not complaining here because of the performance or data loss, it is ok.

My setup is a home server with zero load. When this crash happened 0 VM ran on that.
I'm just looking operational stability with 0 software crashes.

I'm a Linux veteran but new to Proxmox. After two months it seems that the no-subscription repository is an experimental test lab as many users complained about the similar crashes. Or maybe it is just an unlucky period for Proxmox with the combination of kernel 6.8 and zfs 2.2.7.

s_peter · Mar 15, 2025

It seems that moving to the new Opt-in Linux 6.11 Kernel :

proxmox-kernel-6.11.11-1-pve-signed/stable,now 6.11.11-1 amd64 [installed,automatic]

improved the situation. I haven't received any ZFS crash in the last couple of weeks.

Search

Search

Kernel 6.8.12-8 zfs 2.2.7 crashed

s_peter

New Member

waltar

Renowned Member

s_peter

New Member

Kingneutron

Renowned Member

[TUTORIAL] Thread 'FabU: Can I use ZFS RaidZ for my VMs?'

s_peter

New Member

s_peter

New Member

We value your privacy