KVM guests freeze (hung tasks) during backup/restore/migrate

Thanks for confirming.

I did a check with our new cluster and all the drive caches are on by default.

these proprietary cards from Dell and HP seem to be a common thread for these types of issues Is what I’ve discovered so far.

””Cheers
g
You're welcome. It seams counterintuitive to disable any cache of the controller card to speed up the whole zpool. I guess this has to do something with ZFS needs to have access to the individual blocks. What i don't understand then is: Why doesn't the same explanation apply to the disk's cache? These caches have to be turned on - at least if you want to speed up your zpool.

Anyways, i am glad i could solve the initial issues with the Perc controller. I just wonder, why this information is nowhere to be found in the various threads in different forums....
 
You're welcome. It seams counterintuitive to disable any cache of the controller card to speed up the whole zpool. I guess this has to do something with ZFS needs to have access to the individual blocks. What i don't understand then is: Why doesn't the same explanation apply to the disk's cache? These caches have to be turned on - at least if you want to speed up your zpool.

Anyways, i am glad i could solve the initial issues with the Perc controller. I just wonder, why this information is nowhere to be found in the various threads in different forums....
Hi @rakurtz

I would say that the Controller cache is another level of cache that cant be controlled by ZFS while drive cache may act in a different way.

Raid controller cache is specifically designed to be a middle man cache while drive cache is direct on drive.

ZFS uses Ram for cache and with direct access to the drives it can manage what it needs to, with no direct access to the controller it has no way to manage the controller cache.

this is my understanding happy to be corrected by someone who knows the more granular details.

Reading through these forums and others on the TrueNas website i'm seeing the same issues pop up over and over with these same controllers from HP and Dell that can be switched between RAID and HBA mode, they always run into some sort of issues.

The best controller cards are dedicated HBA like an LSI or Broadcom 9300 SAS 8-port

""Cheers
G
 
  • Like
Reactions: rakurtz
apparently, things are different with pve7.

all vm's now seem to use io_uring async io per default:

"QEMU 6.0: The latest QEMU version with new functionalities is included in Proxmox VE 7. This
includes support for the Linux IO interface ‘io_uring’. The asynchronous I/O engine for virtual drives
will be applied to all newly launched or migrated guest systems by default"


i need to manually set aio=threads in vm conf to make the workaround described above work again.

it seems there is something weird with aio.

why do VMs heavily stutter/lag when disk subsystem is put under load AND when VMs do io with aio syscalls io_submit() or io_uring() ?

what can explain this behaviour ?

is it a bug or "by design" ?

is there a way to switch back to the previous default behaviour, i.e. use aio=threads instead aio=io_uring ?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!