VMs crashing with Out of Memory (OOM) on ZFS

philippt

Active Member
Nov 21, 2018
12
2
43
75
Hi,

I have a single PVE 7.3-4 machine that runs on the (default) ZFS setup. It is equipped with 32 GB of RAM and 512 GB + 1 TB disk space.
On this machine, I have three VMs:
- Windows #1 with 4 GB
- Windows #2 with 4 GB
- Linux with 256 MB
Ballooning is enabled, but I did set fixed RAM limits (min/max) for all VMs.

So all VMs together should be using a little more than 25% of all available RAM at max.

Nevertheless, it happened several times already that one of the VMs crashed because the system ran out of memory - so an OOM kill of the kvm process, usually for Windows #1. This is obviously really annoying and is the reason why I already reduced the RAM size for the VMs, Windows actually would need more than 4 GB.

Could this behavior be related to ZFS' caching functionality? If so, is it safe to limit the usage to, let's say, 4 GB, so that I can assign more RAM to my VMs without taking the risk that they crash because of OOM?
If I understand https://pve.proxmox.com/wiki/ZFS_on_Linux#sysadmin_zfs_limit_memory_usage correctly, 2 GB + 1.5 GB = 3.5 GB should be sufficient for ZFS in my setup.

Thanks
Philipp


Code:
root@XXX:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Jan 28 20:07:49 2023
Linux 5.15.83-1-pve                                           2.1.7-pve1
Machine: XXX (x86_64)                              2.1.7-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    99.7 %   15.5 GiB
        Target size (adaptive):                       100.0 %   15.6 GiB
        Min size (hard limit):                          6.2 %  995.5 MiB
        Max size (high water):                           16:1   15.6 GiB
        Most Frequently Used (MFU) cache size:         50.4 %    7.3 GiB
        Most Recently Used (MRU) cache size:           49.6 %    7.2 GiB
        Metadata cache size (hard limit):              75.0 %   11.7 GiB
        Metadata cache size (current):                 11.4 %    1.3 GiB
        Dnode cache size (hard limit):                 10.0 %    1.2 GiB
        Dnode cache size (current):                     0.7 %    8.4 MiB

Code:
root@XXX:~# pveversion -v
proxmox-ve: 7.3-1 (running kernel: 5.15.83-1-pve)
pve-manager: 7.3-4 (running version: 7.3-4/d69b70d4)
pve-kernel-helper: 7.3-2
pve-kernel-5.15: 7.3-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.15.60-2-pve: 5.15.60-2
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 15.2.16-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-1
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.2-1
proxmox-backup-file-restore: 2.3.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.0-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-2
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-2
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.7-pve3

Code:
root@XXX:~# cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,iso,vztmpl

zfspool: local-zfs
        pool rpool/data
        content images,rootdir
        sparse 1

zfspool: hdpool
        pool hdpool
        content images,rootdir
        mountpoint /hdpool
        nodes XXX
        sparse 1

nfs: nas-XXX
        export /volume1/srv-XXX
        path /mnt/pve/nas-XXX
        server 192.168.XXX.XXX
        content backup

pbs: XXX
        datastore XXX
        server XXX
        content backup
        encryption-key XXX
        fingerprint XXX
        prune-backups keep-all=1
        username XXX@pbs
 
Last edited:
I mean, beside of adding more RAM (if possible at all), not much options here.
So yes, I would try it with a reduced ARC-max of 4 GB in this case.
But with this, I nevertheless would not assign more than a total of additional 8 GB* (despite the fact, that you get around 11,5 GB more available) to the VMs; otherwise you might end up in a similar OOM-situation again.
* The exact amount has to be tested by you, I guess.
 
Options

1 - disable transparent huge pages. (add 'transparent_hugepage=never' to kernel cmdline'
2 - reduce zfs dirty cache, this is independent of the ARC, it defaults to either 4 gigs or 10 % of ram (the larger of the two) with systems that have less than 4 gigs of ram it means that zfs dirty cache takes a larger portion of ram.
https://openzfs.github.io/openzfs-docs/Performance and Tuning/Module Parameters.html#zfs-dirty-data-max
The documentation states 10% for both values but if you have less than 40 gigs, the 'zfs_dirty_data_max' parameter is set to 4 gigs which overrides the 'zfs_dirty_data_max_percent' parameter.
3 - throttle your guest writes to limit the utilisation level of the host dirty cache.
4 - add a swap, even a very small swap can prevent oom's.

I found a combination of setting it to 2 gigs and disabling huge pages was enough to allow unthrottled zfs writes on a 32 gig system.
 
Last edited:
Thanks for the quick responses! For now, I changed the ARC cache size to 4 GB
Code:
root@XXX:~# cat /sys/module/zfs/parameters/zfs_arc_max
4294967296
and increased the Windows VMs' RAM to 8 GB. I am now at 70 % RAM consumption on the host, so I am happy for the moment ;-)

Will definitely look into chrcoluk's points as well and obviously observe the system further!

Thanks again!
 
I know this is an old thread, but I am having similar issues and hoping for some additional information. 40gb ram on the host, a total of 4.25 TB of storage in two pools: a single disk 256gb OS disk and a 2TB/2TB mirror pool. I'm trying to run a single VM (I know that's weird but there are reasons not worth getting into here why I did it this way) and would like it to have access to at least 8gb RAM, and preferably as much as possible.

1. I just reduced zfs_arc_max to 8gb. Can I reduce zfs_dirty_data_max to less than 4gb, and if so what's a good number to start with?

2. I'm not sure how to "1 - disable transparent huge pages. (add 'transparent_hugepage=never' to kernel cmdline'" Would someone be willing to explain more? This will likely be my first kernel edit...

3. How can I throttle the guest writes?

4. If I set the guest VM ram to min 8gb, how high can I safely go and just let the hypervisor manage it dynamically?

Thank you!
 
Last edited: