io_uring feedback

chrcoluk

Renowned Member
Oct 7, 2018
175
29
68
45
what kind of data corruption? Could you provide a few more details about the configuration, i.e. what storage was used, what disk controllers, disk settings? Best to open a separate thread an mention me there with @fiona
@fiona

Hi

The confirmed corruption was on the boot files in a windows guest, it happened multiple times after upgrading from proxmox 6.x to 7.x, and I then noticed a new default using io_uring, as soon asI changed it to aio native, the problems stopped and stayed stopped.

I then setup a new windows VM using io_ring purely to test if the problem would come back, this was using different physical drives as well. The boot files got corrupted again.

VM configuration.

6 gig ram no balloon
1 socket 4 cores cpu type EPYC
seabios
virtio gpu 16m
q35 machine
virtio scsi
zvol drive 50g size, ssd=1, discard on, cache=none, throttled to 30000 i/o writes, 500mB/sec writes, the zvol is 64k volblocksize. the pool is a zfs mirror with 2 ssd's. no smart errors, no scrub errors.

proxmox 7.1-12. so needs updating, out of my 3 proxmox, this one I updated to 7.x first, I can upgrade to 7.3 and retest on io_uring.
 
Last edited:
Found this, I can confirm 'detect-zeroes=unmap' is configured by proxmox on the VM and this is an issue affecting even the latest qemu 7.2 according to those discussions. However not using virtio block device, using virtio scsi instead.

https://gitlab.com/qemu-project/qemu/-/issues/1404

According to this its only useful for legacy OS's that have no native unmap/trim support and can be compute intensive as well.

https://serverfault.com/a/1022675/588681
 
Last edited:
Hi,
@fiona

Hi

The confirmed corruption was on the boot files in a windows guest, it happened multiple times after upgrading from proxmox 6.x to 7.x, and I then noticed a new default using io_uring, as soon asI changed it to aio native, the problems stopped and stayed stopped.

I then setup a new windows VM using io_ring purely to test if the problem would come back, this was using different physical drives as well. The boot files got corrupted again.

VM configuration.

6 gig ram no balloon
1 socket 4 cores cpu type EPYC
seabios
virtio gpu 16m
q35 machine
virtio scsi
zvol drive 50g size, ssd=1, discard on, cache=none, throttled to 30000 i/o writes, 500mB/sec writes, the zvol is 64k volblocksize. the pool is a zfs mirror with 2 ssd's. no smart errors, no scrub errors.

proxmox 7.1-12. so needs updating, out of my 3 proxmox, this one I updated to 7.x first, I can upgrade to 7.3 and retest on io_uring.
What kernel were you using at the time the issues appeared?

Found this, I can confirm 'detect-zeroes=unmap' is configured by proxmox on the VM and this is an issue affecting even the latest qemu 7.2 according to those discussions. However not using virtio block device, using virtio scsi instead.

https://gitlab.com/qemu-project/qemu/-/issues/1404
we have not released our version of QEMU 7.2 yet (prior versions are not affected by this bug) and the version we release will already include a fix for it, see here.
 
  • Like
Reactions: Neobin
It looks like I ran into a similar issue last weekend. I had to restore 2 VM's from my backup in order to get things working again. Windows 2012 and Windows 2019. Windows 2019 went straight in repair mode.

proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-2
pve-kernel-5.15: 7.3-1
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.195-1-pve: 5.4.195-1
pve-kernel-5.4.189-2-pve: 5.4.189-2
pve-kernel-5.4.189-1-pve: 5.4.189-1
pve-kernel-5.4.178-1-pve: 5.4.178-1
pve-kernel-5.4.174-2-pve: 5.4.174-2
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.106-1-pve: 5.4.106-1
ceph: 15.2.17-pve1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-2
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.7-pve3

2019 VM configuration:

BIOS UEFI
Machine: pc-i440fx-5.2
Virtio SCSI Single with 2 disks:
Cache write back
Discard enabled
IO Thread enabled
SSD Emulation enabled
Backup enabled
Async IO: Default (io_uring)

Checked all other VM's; Async IO is set to default (io_uring)

Cluster has been running Proxmox 7.3 for over 2 weeks. When i look at the time the servers reported for the last time my backup was still running.

Syslogs comes back with the following:

Jan 28 05:37:56 BENS-NODE02 kernel: [47204.701852] CIFS: VFS: \\IPadress sends on sock 00000000ce9e3f20 stuck for 15 seconds
Jan 28 05:37:56 BENS-NODE02 kernel: [47204.701936] CIFS: VFS: \\IPadress Error -11 sending data on socket to server
Jan 28 05:37:56 BENS-NODE02 pvestatd[2820]: unable to activate storage 'Synology-NAS' - directory '/mnt/pve/Synology-NAS' does not exist or is unreachable
Jan 28 05:37:56 BENS-NODE02 pvestatd[2820]: status update time (54.554 seconds)
Jan 28 05:38:00 BENS-NODE02 pvestatd[2820]: got timeout
Jan 28 05:38:00 BENS-NODE02 pvestatd[2820]: unable to activate storage 'Synology-NAS' - directory '/mnt/pve/Synology-NAS' does not exist or is unreachable
Jan 28 05:38:10 BENS-NODE02 pvestatd[2820]: got timeout
Jan 28 05:38:10 BENS-NODE02 pvestatd[2820]: unable to activate storage 'Synology-NAS' - directory '/mnt/pve/Synology-NAS' does not exist or is unreachable
Jan 28 05:38:19 BENS-NODE02 pvestatd[2820]: got timeout
Jan 28 05:38:19 BENS-NODE02 pvestatd[2820]: unable to activate storage 'Synology-NAS' - directory '/mnt/pve/Synology-NAS' does not exist or is unreachable
Jan 28 05:38:30 BENS-NODE02 pvestatd[2820]: got timeout

Still its strange that 2 VM's gets coprrupted while 7 others are not?
 
Hi,
It looks like I ran into a similar issue last weekend. I had to restore 2 VM's from my backup in order to get things working again. Windows 2012 and Windows 2019. Windows 2019 went straight in repair mode.
do you know how the corruption looked like? Did the VMs stop working right away or during the next boot (lost partition table?)?

Jan 28 05:37:56 BENS-NODE02 kernel: [47204.701852] CIFS: VFS: \\IPadress sends on sock 00000000ce9e3f20 stuck for 15 seconds
Jan 28 05:37:56 BENS-NODE02 kernel: [47204.701936] CIFS: VFS: \\IPadress Error -11 sending data on socket to server
Jan 28 05:37:56 BENS-NODE02 pvestatd[2820]: unable to activate storage 'Synology-NAS' - directory '/mnt/pve/Synology-NAS' does not exist or is unreachable
Might've been a network issue/hang. Is Synology-NAS the CIFS mount that errored out? Do your VM's disks reside on that storage? That could be the root cause of the corruption in your case.

Still its strange that 2 VM's gets coprrupted while 7 others are not?
Maybe they were in a more consistent (file system) state when the storage disconnect happened so it didn't affect them as badly?
 
Good morning Fiona,

My RMM software reported that the 2012 & 2019 server wasnt responding at all. I did a reboot after that it went straight in repair mode.
After starting the 2012 server Windows logo was briefly shown and then the screen went black.

Is Synology-NAS the CIFS mount that errored out? Thats correct.
Do your VM's disks reside on that storage? No i use the Synology NAS to store my backups. Both restores i did from the Synology NAS.
Iam running a 3 node ceph cluster.

Regarding the corruption using Async IO: Default (io_uring). I am seeing several posts mentioning possible corruption because of this setting?

I am gonna test my Synology NAS with my test server running kernel 6.1-2.1. If the does not exist or is unreachable error stays away then i have to upgrade my cluster servers to a newer kernel.
 
Last edited:
The system which had the problem has been updated from proxmox 7.1 to 7.3, and I have created a snapshot, so am prepared to try it again on windows guest with io_uring, and fall back to the snapshot if it breaks.
 
The system which had the problem has been updated from proxmox 7.1 to 7.3, and I have created a snapshot, so am prepared to try it again on windows guest with io_uring, and fall back to the snapshot if it breaks.
If you manage to trigger the issue again, please share the output of pveversion -v and qm config <ID> with the affected VM's ID and the relevant part of the storage configuration (/etc/pve/storage.cfg), i.e. for the storage the VM uses.
 
Its been a while, but thats because time is needed really for this stuff, I havent yet seen signs of new corruption, and have assigned io_uring now to multiple drives. So either its solved or for whatever reason it went away, will never know I guess, but will report back if its a problem again.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!