Unexplainable ZFS Corruption

piku · Jun 9, 2022

Hello all,
I've been running proxmox for quite some time with no problem. Lately I got a couple new servers so I've been moving things around. One of the things I did was to use lz4 compression and encryption. I was using it before without trouble but I remade the pools.

Before I used znapzend for backups from SSD to local disks and pve-zsync for remote backups. All worked fine for a long time.

Now I use znapzend with 2 destinations, 1 to local disks and the other to a server on LAN connected by ssh. I still use pve-zsync for the remote backups.

On one of my new servers with a mirror of 1.6TB HPE SATA SSD's I got a zfs error. "status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup."

I was curious by the fact that the errors shown were only in snapshots and the READ WRITE and CKSUM counts were all 0 on all disks. Interestingly if you list snapshots it gives an i/o error.

root@infpmx01:~# zfs list -t snapshot rpool/enc/vm-110-disk-0
cannot iterate filesystems: I/O error
NAME USED AVAIL REFER MOUNTPOINT
rpool/enc/vm-110-disk-0@rep_default_2022-06-06_23:45:33 1.20M - 16.9G -
rpool/enc/vm-110-disk-0@rep_default_2022-06-07_02:30:27 1.16M - 16.9G -

The affected snapshots aren't found if you try to delete them.

Anyway, I just assumed I had a bad disk or drive backplane or something but in a week this has happened to my original server that I know is good. Same exact way:
root@infpmx01:~# zpool status -v rpool
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub canceled on Wed Jun 8 23:52:47 2022
config:

NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x55cd2e404b51d71e ONLINE 0 0 0
wwn-0x55cd2e404b55cea6 ONLINE 0 0 0
wwn-0x55cd2e404b564309 ONLINE 0 0 0
wwn-0x55cd2e404b57c399 ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

rpool/enc/vm-110-disk-0@2022-06-08-220000:<0x0>
rpool/enc/vm-104-disk-0@rep_default_2022-06-08_19:30:56:<0x0>
rpool/enc/vm-110-disk-0@rep_default_2022-06-08_20:45:40:<0x0>
rpool/enc/vm-111-disk-0@rep_default_2022-06-08_21:15:25:<0x0>
rpool/enc/vm-103-disk-0@2022-06-08-220000:<0x0>

I didn't make the znapzend connection (2 destinations) until I started writing this but.. I have a suspicion znapzend is causing corruption... Any other ideas? I remade the pool because I changed from raidz2 to raidz1 (same ashift=12) and the SMART stats look perfectly fine on all the disks. If the backplane was failing the spinning disk pool should be affected and if an individual drive was failing there'd be a checksum error. I don't believe I have a failing drive. I think this is a bug.

piku · Jun 15, 2022

This has since happened without the dual destination znap as well as without the new zfs swap I had setup...

For anyone hunting for this, it appears to be a zfs bug. I think related to these:
https://www.truenas.com/community/t...lated-to-snapshots-post-2-0-x-upgrade.101334/
and
https://github.com/openzfs/zfs/issues/12014

It appears to be related to znapzend use on encrypted datasets on ssd with zfs >= version 2.x. Some kind of race condition. I use zfs with native encryption on my nas at home but never make snapshots and it's been reliable for years. I used to use proxmox+zfs+znap on non-encrypted datasets and it worked fine for years.

Search

Search

Unexplainable ZFS Corruption

piku

New Member

piku

New Member

We value your privacy