ZFS errors

promoxer

Member
Apr 21, 2023
188
17
18
Code:
root@pve:/# zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat Sep  9 10:05:15 2023
        7.84G scanned at 50.5M/s, 2.77G issued at 17.8M/s, 7.84G total
        0B repaired, 35.32% done, 00:04:51 to go
config:


        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          sda3      ONLINE       0     0    12


errors: Permanent errors have been detected in the following files:


        rpool/ROOT/pve-1:<0x25899>

I detected this error and decided to peform a `zfs scrub rpool`
How do I remove the file in question?
 
Code:
# creating a snapshot
zfs destroy vpool/vm-101-disk-1@backup
zfs snapshot vpool/vm-101-disk-1@backup

# zipping
zfs send vpool/vm-101-disk-1@backup | gzip > /tmp/email.vma.gz

# overwriting
rm /var/lib/vz/dump/email.vma.gz
mv /tmp/email.vma.gz /var/lib/vz/dump/email.vma.gz

# destroying the snapshot
zfs destroy vpool/vm-101-disk-1@backup

This script causes the errors and they keep recurring whenever the script runs, any idea why?
 
This script causes the errors and they keep recurring whenever the script runs, any idea why?
Maybe do a dd from the virtual disk (inside or outside the VM) and see if it encounters the same problems? Your script reads the whole virtual disk and will therefore encounter the corruption on the underlying drive? Or maybe its the mv to /var/lib/vz/dump that causes the error? Checksum errors pop-up when reading from the drive. Do a long SMART self-test to check the drive? Maybe the SATA cable is not reliable?
 
Hmm I don't think there is a wire dedicated to checksum on the SATA cable.

rpool is the proxmox disk, so there should be a decent number of reads + writes, but there are only checksum errors, and these errors are reproducible since I added the script. It feels more like a bug or I'm not getting the commands exactly right.
 
Last edited:
Hmm I don't think there is a wire dedicated to checksum on the SATA cable.
The ZFS-checksums don't match the data read by ZFS. This can be caused by an undetected write error or by the data being corrupted by the drive (medium or controller) or the wires and connectors (cable, motherboard, corrosion, signal degradation). You can disable checksums for rpool, if they are bothering you.
rpool is the proxmox disk, so there should be a decent number of reads + writes, but there are only checksum errors, and these errors are reproducible since I added the script. It feels more like a bug or I'm not getting the commands exactly right.
Please check the health of your drive (and cables and connections) because it looks like its returning corrupt data. Maybe your system memory has issue causing corruption? I don't see how your script can introduce ZFS checksum errors, but maybe someone else can.
 
There seems to be a bias towards my hardware without even asking about the physical conditions of my environment. How often does corrosion and signal degradation occur on a modern motherboard?

Anyway, this script has no more errors, the only difference is I wrote directly to the destination folder instead of eventually mv from the tmp folder

Code:
zfs destroy vpool/vm-101-disk-1@backup
zfs snapshot vpool/vm-101-disk-1@backup
zfs send vpool/vm-101-disk-1@backup | gzip > /var/lib/vz/dump/temp-email.vma.gz
rm /var/lib/vz/dump/email.vma.gz
mv /var/lib/vz/dump/temp-email.vma.gz /var/lib/vz/dump/email.vma.gz
zfs destroy vpool/vm-101-disk-1@backup
 
Anyway, this script has no more errors, the only difference is I wrote directly to the destination folder instead of eventually mv from the tmp folder
I'm just trying to give you ideas about what could be causing checksum errors. Since you get less (or even none) errors when you change your script to read less from the disk, it suggest that the read data is sometimes returned corrupted (since the checksum does not match) but the drive itself does not notice it (as there are not I/O errors). This indicated a silent (intermittent) problem somewhere between (and including) the main memory and the drive medium. I fear that it will get worse over time and you will lose or corrupt data, but I'm happy you resolved your issue and don't have any problems with data loss.
 
My interest is in why the 1st script causes problems and 2nd does not. And please take note of this important observation: they are both consistently reproducible like an on/off switch. (Thus ruling out most hardware issues)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!