Ok I diagnosed it, I am not 100% on this yet, but have gone from 100% OOM, to 100% no OOM.
Summary of issue again.
Host machine
Proxmox 6.4
5.11 kernel but also happened on 5.4 kernel when retested
32 gig ram
No other running VMs
zvol backed storage
sync=standard for writes
qemu cache=nocache
os type set to other
Guest
Windows 10 21H2
40 gig virtual disk
default write cache enabled with flushes for sync writes
8 gig memory
ARC
Capped to 8 gig
So as above was getting OOM every time the write test was running on crystal diskmark, oddly this doesnt happen on smaller virtual disks, only the boot disk.
When I changed the os type to windows 10 on the machine config, it got even worse.
The entire proxmox server went down instead of just the guest with a kernel panic, memory deadlock. (this makes me wonder how much the os type setting adjusts things on the host to cause this behaviour)
I decided this was down to memory fragmentation issues, it had to be, as there 10s of gigs of free ram when this was occurring.
cat /proc/buddyinfo showed there was little free continuous regions of memory when the write part of crystal diskmark was running. I observed it was using
transparent hugepages, so I added 'transparent_hugepage=never' to the kernel cmdline.
After the reboot I retested with both os type other and os type windows and the problem is gone.
I also discovered if I cap the max txg size for the zfs dirty cache it will also prevent the problem even with transparent hugepages enabled. By default on this machine it was set to 4 gig, I was able to go as big as 2 gig without issues.
There is serious write amplification as is been shown in the ssd benchmark thread, This amplification may well be contributing to the memory fragmentation. The volblock size is 4k on the guest drive as ntfs is 4k clusters, I dont know if a bigger volblocksize would have eased up the issue, I may do more testing in the future.
Note the kernel documentation says this, so when allocation fails it is supposed to fallback to standard pages, however something isnt right given the issues I diagnosed so this might be raised as a kernel bug by myself later today. Might also raise on the openzfs github.
if a hugepage allocation fails because of memory fragmentation,
regular pages should be gracefully allocated instead and mixed in
the same vma without any failure or significant delay and without
userland noticing