QEMU (KVM) high SWAP Usage

bogo22

Renowned Member
Nov 4, 2016
56
6
73
Hi,
I know there are a lot of threads regarding high SWAP usage, but I can't find a good explanation and/or problem solving. So I request help to understand the high SWAP usage and how to fix this. I gathered some information below and can't see any over-usage of the QEMU VM or the PVE host, so I do not know why the SWAP is used so aggressively.

Much thanks in advance for any help!

1) Current situation: SWAP is over 20GB
pve dashboard.png

2) Overview of the RAM usage of the PVE host for "Week (maximum)" do not show an overallocation of RAM
Max-RAM of PVE.png

3) SWAP is a mirror
Code:
root@pve:~# vgdisplay swap0
  --- Volume group ---
  VG Name               swap0
  System ID        
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               59.99 GiB
  PE Size               4.00 MiB
  Total PE              15358
  Alloc PE / Size       15358 / 59.99 GiB
  Free  PE / Size       0 / 0

root@pve:~# swapon
NAME      TYPE      SIZE  USED PRIO
/dev/dm-4 partition  30G 20.7G   -2

4) swappiness is set to 1 at startup via /etc/sysctl.d
Code:
root@pve:~# cat /proc/sys/vm/swappiness
1

5) ZFS ARC is limited at startup via /etc/sysctl.d
Code:
root@pve:~# cat /sys/module/zfs/parameters/zfs_arc_max
17179869184

root@pve:~# arcstat
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
22:28:13     4       4   100       0     0      0    0    16G    16G   116G

6) Identify process with high SWAP usages: its /usr/bin/kvm and QEMU VM 151
Code:
root@pve:~# /root/show_swap_usage.sh
SWAP            PID             COMMAND
17.31 GB PID: 268197     /usr/bin/kvm -id 151 -name...
1.13 GB  PID: 1435137    /usr/bin/kvm -id 161 -name...
0.46 GB  PID: 265915     /usr/bin/kvm -id 100 -name...
0.39 GB  PID: 269074     /usr/bin/kvm -id 999 -name...
0.34 GB  PID: 266530     /usr/bin/kvm -id 110 -name...
0.13 GB  PID: 752520     /usr/bin/kvm -id 152 -name...
0.13 GB  PID: 1631307    /usr/bin/kvm -id 150 -name...

7) Overview of the QEMU VM 151 "Week (maximum)" do not show an overallocation of RAM
Max-RAM QEMU 151.png

8) I'm not sure if this is relevant, because I am not familiar with cgroup
Code:
root@pve:/sys/fs/cgroup/system.slice# ll memory.swap.*
-r--r--r-- 1 root root 0 Jun 20 22:06 memory.swap.current
-r--r--r-- 1 root root 0 Jun 20 22:06 memory.swap.events
-rw-r--r-- 1 root root 0 Jun 20 22:06 memory.swap.high
-rw-r--r-- 1 root root 0 Jun 13 11:27 memory.swap.max
-r--r--r-- 1 root root 0 Jun 20 22:06 memory.swap.peak

root@pve:/sys/fs/cgroup/system.slice# cat memory.swap.*
572837888
high 0
max 0
fail 0
max
max
575606784

root@pve:/sys/fs/cgroup/qemu.slice/151.scope# ll memory.swap.*
-r--r--r-- 1 root root 0 Jun 20 22:04 memory.swap.current
-r--r--r-- 1 root root 0 Jun 20 22:04 memory.swap.events
-rw-r--r-- 1 root root 0 Jun 20 22:04 memory.swap.high
-rw-r--r-- 1 root root 0 Jun 16 00:32 memory.swap.max
-r--r--r-- 1 root root 0 Jun 20 22:04 memory.swap.peak

root@pve:/sys/fs/cgroup/qemu.slice/151.scope# cat memory.swap.*
18597539840
high 0
max 0
fail 0
max
max
18597543936

Code:
proxmox-ve: 8.4.0 (running kernel: 6.8.12-11-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-11-pve-signed: 6.8.12-11
proxmox-kernel-6.8: 6.8.12-11
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
ceph-fuse: 17.2.8-pve2
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
frr-pythontools: 10.2.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20250512.1~deb12u1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.11
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-4
pve-ha-manager: 4.0.7
pve-i18n: 3.4.4
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2
 
Last edited:
A lot going on there and I see you've done some of your homework. My take:

Why
1. From wiki: "ZFS uses 50 % of the host memory for the Adaptive Replacement Cache (ARC) by default. For new installations starting with Proxmox VE 8.1, the ARC usage limit will be set to 10 % of the installed physical memory, clamped to a maximum of 16 GiB. This value is written to /etc/modprobe.d/zfs.conf."
2. Paraphrased from CGroup docs: Once Linux swaps out memory under pressure, it stays in swap until actively accessed again - your 17GB swap usage is "historical" not current pressure.


Quick Fix to test
1. You want to confirm it is actually 16gb #You did this! Just in case someone else is viewing this later
cat /sys/module/zfs/parameters/zfs_arc_max
2. Increase ZFS ARC from 16GB to 60GB #Based on your memory, size accordingly otherwise
echo 64424509440 > /sys/module/zfs/parameters/zfs_arc_max
3. If all working as intended, can make permanent in /etc/modprobe.d/zfs.conf

We could go pretty deep into CGroups, swappiness/ballooning/digging for regressions and such, but I would start here first.
 
Thanks you for the reply.

Regarding "Why"
1. We limit the ARC to much lower, see below
2. Good to know that the SWAP stays actively, eventhough its not "used" anymore. Is there a way to empty the SWAP regularly? Or even better to avoid the high allocation of the SWAP? Since the diagrams shows that the system do not need to SWAP because it got plenty of ressource I dont know why it swaps anyway. Since the VM which triggers the SWAP is an SQL Server, maybe it has something to do with VSS / MS-SQL Backups / etc? Of course QEMU Agent is installed and running.

Regarding the "Quick FIxes"
1. Yes, the ARC is limited to 16GB via /etc/modprobe.d/zfs.conf (i confused it with path /etc/sysctl.d in my post above. Also did update-initramfs -u -k all after changig of course)

2. I increased it temporarily but it doesnt "empty" the allocated SWAP

3. Already done like described in 1. (but much lower than 60GB because we need the RAM vor VMs)