[SOLVED] Input/Output Error ZFS

vanlampe

New Member
Apr 22, 2023
11
1
3
Hello together,

I am fairly new to the proxmox world and currently having a problem using a ZFS store.

The zfs storage was created with two Seagate IronWolf 4 TB in mirror mode.

The affected VM has two Disks, one for the operating system on a separate SSD (lvm-thin storage) and one disk (1TB) for the data as a directory in the ZFS dataset. I mounted the 1TB disk inside the vm as ext4 partition.

When creating backups of a VM I always get the error ERROR: Backup of VM 100 failed - job failed with err -5 - Input/output error.

I also get the same error when trying to move the data from one VM to another via rsync (on the same dataset).
Interestingly, the error does not occur when I sync the data vai rsync on my local PC from the VM.

I have already tried to recreate the ZFS pool and the datasets, unfortunately without success. The error does not seem to be related to a broken hdd, as they work perfectly fine without zfs (Power_On_Hours 1529) - smart values are also fine.

I would appreciate any tips on troubleshooting and fixing the problem.

Best regards,
Frank

Code:
root@pve:~# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 6.0.19-edge)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-2
pve-kernel-6.1: 7.3-6
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-6.0-edge: 6.0.19-1
pve-kernel-6.0.19-edge: 6.0.19-1
pve-kernel-5.15.107-1-pve: 5.15.107-1
pve-kernel-5.15.104-1-pve: 5.15.104-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-2
pve-firewall: 4.3-1
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
HI,
please post the storage config cat /etc/pve/storage.cfg, the VM config qm config 100 and attach the journal since boot journalctl -b > journal.txt as well as the backup task log, which you can download by double clicking on the task in the WebUI.

Further, check and post the zpool status.

Edit: Also, did you run out of space? Check via zfs list
 
Last edited:
Hi Chris, thanks for the quick response.

Here are the outputs of the mentioned commands:

Code:
root@pve:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,iso,vztmpl

lvmthin: local-lvm
        thinpool data
        vgname pve
        content images,rootdir

lvmthin: vm-store
        thinpool vm-store
        vgname vm-store
        content rootdir,images
        nodes pve

zfspool: ZFS01
        pool ZFS01
        content images,rootdir
        mountpoint /ZFS01
        nodes pve

dir: ZFSData01
        path /zfsdata
        content iso,vztmpl,images,backup,snippets,rootdir
        prune-backups keep-all=1
        shared 0

dir: Backup-ZFS
        path /zfsbackup
        content iso,vztmpl,images,backup,rootdir,snippets
        prune-backups keep-all=1
        shared 0

dir: Backup-USB
        path /mnt/pve/Backup-USB
        content backup,images,snippets,rootdir,iso,vztmpl
        is_mountpoint 1
        nodes pve

Code:
root@pve:~# qm config 100
agent: 1
boot: order=scsi0;net0
cores: 2
memory: 6144
meta: creation-qemu=7.2.0,ctime=1681479414
name: debian-11
net0: virtio=3A:60:9E:1D:98:C7,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
parent: Initital
scsi0: vm-store:vm-100-disk-0,aio=threads,iothread=1,size=20G
scsi2: ZFSData01:100/vm-100-disk-0.qcow2,cache=writethrough,discard=on,iothread=1,size=1000G
scsihw: virtio-scsi-single
smbios1: uuid=ec3d069e-d241-4de2-8402-e1b21f32f3e4
sockets: 1
tags: seafile
vmgenid: ba8b2e94-7515-46f4-b4fe-d7d9bdfb153e

There seems to be an error in the zfs pool:
Code:
zpool status
  pool: ZFS01
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                                 STATE     READ WRITE CKSUM
        ZFS01                                ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000VN006-3CW104_ZW60BZDY  ONLINE       0     0     0
            ata-ST4000VN006-3CW104_ZW60BY6Z  ONLINE       0     0     0

Code:
NAME           USED  AVAIL     REFER  MOUNTPOINT
ZFS01          253G  3.27T       96K  /ZFS01
ZFS01/Data01   250G  3.27T      250G  /zfsdata
ZFS01/Data02  2.43G  3.27T     2.43G  /zfsbackup

I am not really sure how to fix this. But I am pretty sure, that those errors are quite fresh. But I will try to recreate the affected Disk.

Maybe I should also mention that I had problems with my motherboard that resulted in CPU errors. I replaced the mainboard yesterday and did not recreate the ZFS pool again.
 

Attachments

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
Yes, seems like some hardware is misbehaving. Also perform a memory check and test your harddisks again before recreating the zfs pool from backup.

On a side note: Why do you use a dir based storage to create a qcow2 on top of a zfs dataset instead of using a zvol directly? Is this intentional? I would rather recommend to use a zvols for VM disks on a zfs backed storage.
 
Okay, I will do that and report the results, thanks :)

This has no specific reason and I did this since I did not know better. But I think I also read, that its not possible to create snapshots if using zvols, is this correct?
 
All right, thanks for the clarification.

Just to be sure, I would create a ZFS pool in mirror mode, then data sets if necessary, and then add the storage as ZFS via the GUI?
 
All right, thanks for the clarification.

Just to be sure, I would create a ZFS pool in mirror mode, then data sets if necessary, and then add the storage as ZFS via the GUI?
You can create the zfs pool directly via the WebUI, by going to <Host> > Disks > ZFS and create the zpool from there (if the disks are not recognized as unused, delete the partitions on them beforehand via gdisk). Once you have created the zpool, you can add it as storage via Datacenter > Storage > Add > ZFS, make sure you have Disk Image and Container selected for Content and you are good to go. From there on you can use the zfs backed storage for your VMs/CTs disks.
 
Hi Chris,
smartctl scan was fine so I created a zfs pool as mentioned. The backup still failed with the same error message.

Code:
INFO: starting new backup job: vzdump 104 --quiet 1 --mailnotification always --mailto frank@lambrette.net --mode snapshot --compress zstd --storage Backup-USB --notes-template '{{guestname}}'
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2023-04-28 01:00:00
INFO: status = running
INFO: VM Name: seafile
INFO: include disk 'scsi0' 'vm-store:vm-104-disk-0' 20G
INFO: include disk 'scsi1' 'data:vm-104-disk-0' 1000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/Backup-USB/dump/vzdump-qemu-104-2023_04_28-01_00_00.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '4e344565-3485-4f4e-8b20-682f807b989e'
INFO: resuming VM again
INFO:   0% (988.1 MiB of 1020.0 GiB) in 3s, read: 329.4 MiB/s, write: 211.7 MiB/s
INFO:   1% (10.4 GiB of 1020.0 GiB) in 42s, read: 246.7 MiB/s, write: 221.9 MiB/s
INFO:   2% (20.5 GiB of 1020.0 GiB) in 1m 50s, read: 152.5 MiB/s, write: 138.7 MiB/s
INFO:   2% (21.9 GiB of 1020.0 GiB) in 2m 3s, read: 109.5 MiB/s, write: 105.8 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 104 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-04-28 01:02:05
INFO: Backup job finished with errors
TASK ERROR: job errors


Now I am getting some new errors, I will attach a screenshot of the serial console output of the affected VM.

I think I will reinstall proxmox to make sure there was no errors during the installation process.
 

Attachments

  • Screenshot 2023-04-28 091319.png
    Screenshot 2023-04-28 091319.png
    351.4 KB · Views: 14
Hi Chris,
smartctl scan was fine so I created a zfs pool as mentioned. The backup still failed with the same error message.

Code:
INFO: starting new backup job: vzdump 104 --quiet 1 --mailnotification always --mailto frank@lambrette.net --mode snapshot --compress zstd --storage Backup-USB --notes-template '{{guestname}}'
INFO: Starting Backup of VM 104 (qemu)
INFO: Backup started at 2023-04-28 01:00:00
INFO: status = running
INFO: VM Name: seafile
INFO: include disk 'scsi0' 'vm-store:vm-104-disk-0' 20G
INFO: include disk 'scsi1' 'data:vm-104-disk-0' 1000G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/pve/Backup-USB/dump/vzdump-qemu-104-2023_04_28-01_00_00.vma.zst'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '4e344565-3485-4f4e-8b20-682f807b989e'
INFO: resuming VM again
INFO:   0% (988.1 MiB of 1020.0 GiB) in 3s, read: 329.4 MiB/s, write: 211.7 MiB/s
INFO:   1% (10.4 GiB of 1020.0 GiB) in 42s, read: 246.7 MiB/s, write: 221.9 MiB/s
INFO:   2% (20.5 GiB of 1020.0 GiB) in 1m 50s, read: 152.5 MiB/s, write: 138.7 MiB/s
INFO:   2% (21.9 GiB of 1020.0 GiB) in 2m 3s, read: 109.5 MiB/s, write: 105.8 MiB/s
ERROR: job failed with err -5 - Input/output error
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 104 failed - job failed with err -5 - Input/output error
INFO: Failed at 2023-04-28 01:02:05
INFO: Backup job finished with errors
TASK ERROR: job errors


Now I am getting some new errors, I will attach a screenshot of the serial console output of the affected VM.

I think I will reinstall proxmox to make sure there was no errors during the installation process.
Hi, please post an up-to-date journal journalctl -b > journal.txt and zpool status. The information from the screenshot is incomplete and not sufficient to help you narrow down the issue.

All I can tell is that there seems to be a kvm related backtrace.
 
Hi Chris, thanks for the super fast response :)


Code:
root@pve:~# zpool status
  pool: ZFSDATA
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

        NAME                                 STATE     READ WRITE CKSUM
        ZFSDATA                              ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST4000VN006-3CW104_ZW60BZDY  ONLINE       0     0     4
            ata-ST4000VN006-3CW104_ZW60BY6Z  ONLINE       0     0     2

errors: No known data errors

Please find the jornal attached. I had to compress it, because otherwise its too largefor upload.
 

Attachments

Okay, so according to your logs the kernel has issues with memory pages being in a bad state. So first of all i would recommend to upgrade your motherboard to the latest available firmware [0] and run a prolonged memory test. This is not related to your drives, but seems to be a memory issue, so maybe check the physical connection of the DIMM slots and/or clean them. Also, try to install and run with the latest opt-in kernel by running apt update && apt install pve-kernel-6.2 and reboot using this kernel.

[0] https://www.asus.com/us/motherboard...x670-p/helpdesk_bios/?model2Name=PRIME-X670-P
 
  • Like
Reactions: vanlampe
Thanks for the tips, I will flash the new bios and update to opt-in kernel this afternoon. I installed the edge-kernel as I thought in may be more compatible to my new hardware. But I'll give the opt in kernel a try.
 
Thanks to your hint with the faulty Ram I was able to solve the problem: The mainboard was running the RAM at too low frequency. I pinned the frequency to 5200 MHz according to the specification and no more errors occur. Also backups now finally run through any without problems.

Without your hint, I probably wouldn't have figured it out so quickly, since I thought the errors were caused by the CPU. So thanks a lot for your help :)

Have a nice Weekend!

Best regards,
Frank