kvm invoked oom-killer

alpha754293 · Apr 25, 2023

So here is the last 150 lines from /var/log/syslog:

(see link)
(Apparently, I can't post a comment with more than 16384 chars here, and it didn't give me the option to upload the text file with the last 150 lines here directly, so pastebin.com it is then.)

Here is my hardware specs:
Dual Xeon E5-2697A v4
Supermicro X10DRi-T4+ motherboard
256 GB DDR4-2400 ECC Reg Ram
16x HGST 10 TB + 8x HGST 6 TB ZFS pool for bulk storage (in three raidz2 vdevs in one ZFS pool)
8x HGST 6 TB for VM disk storage (also raidz2)
4x HGST 3TB RAID6 handled by Broadcom MegaRAID SAS 12 Gbps SAS RAID HW HBA for the Promox 7.3-3 OS

Here is the output of free -g:

Code:

root@pve:/var/log# free -g
               total        used        free      shared  buff/cache   available
Mem:             251          93         125          31          32         124
Swap:              7           7           0

And here is the output of ZFS arc_summary:

Code:

root@pve:/var/log# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Apr 24 23:44:24 2023
Linux 5.15.74-1-pve                                           2.1.6-pve1
Machine: pve (x86_64)                                         2.1.6-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                     9.1 %   11.4 GiB
        Target size (adaptive):                         9.2 %   11.5 GiB
        Min size (hard limit):                          6.2 %    7.9 GiB
        Max size (high water):                           16:1  125.9 GiB
        Most Frequently Used (MFU) cache size:         44.5 %    4.9 GiB
        Most Recently Used (MRU) cache size:           55.5 %    6.2 GiB
        Metadata cache size (hard limit):              75.0 %   94.4 GiB
        Metadata cache size (current):                  0.9 %  875.0 MiB
        Dnode cache size (hard limit):                 10.0 %    9.4 GiB
        Dnode cache size (current):                     1.9 %  181.9 MiB

ARC hash breakdown:
        Elements max:                                               1.6M
        Elements current:                              13.7 %     220.4k
        Collisions:                                                 7.6M
        Chain max:                                                     4
        Chains:                                                      778

ARC misc:
        Deleted:                                                  242.7M
        Mutex misses:                                              28.8M
        Eviction skips:                                             2.6G
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                  24.2 TiB
        L2 eligible MFU evictions:                     26.3 %    6.4 TiB
        L2 eligible MRU evictions:                     73.7 %   17.8 TiB
        L2 ineligible evictions:                                 6.1 TiB

ARC total accesses (hits + misses):                                 2.8G
        Cache hit ratio:                               89.4 %       2.5G
        Cache miss ratio:                              10.6 %     292.8M
        Actual hit ratio (MFU + MRU hits):             88.4 %       2.4G
        Data demand efficiency:                        69.6 %     487.5M
        Data prefetch efficiency:                       4.4 %      89.6M

Cache hits by cache type:
        Most frequently used (MFU):                    80.3 %       2.0G
        Most recently used (MRU):                      18.7 %     460.2M
        Most frequently used (MFU) ghost:               0.7 %      17.1M
        Most recently used (MRU) ghost:                 1.8 %      44.6M

Cache hits by data type:
        Demand data:                                   13.8 %     339.5M
        Demand prefetch data:                           0.2 %       4.0M
        Demand metadata:                               84.2 %       2.1G
        Demand prefetch metadata:                       1.9 %      46.5M

Cache misses by data type:
        Demand data:                                   50.5 %     148.0M
        Demand prefetch data:                          29.3 %      85.7M
        Demand metadata:                                7.4 %      21.6M
        Demand prefetch metadata:                      12.8 %      37.5M

DMU prefetch efficiency:                                          359.8M
        Hit ratio:                                     19.7 %      70.9M
        Miss ratio:                                    80.3 %     288.8M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

I am not sure why the system appears to be sending the oom-killer when I have plenty of free RAM available.

Any help in this regard would be greatly appreciated.

Thank you.

t.lamprecht · Apr 25, 2023

alpha754293 said:
I am not sure why the system appears to be sending the oom-killer when I have plenty of free RAM available.

Well, I'd guess that the outputs from free/arcstat are from after the OOM kill triggered, so they are not really relevant.

Checking the situation at time of the OOM kill, i.e., looking at:

Code:

Mem-Info:
active_anon:13470119 inactive_anon:47822592 isolated_anon:795
 active_file:202 inactive_file:0 isolated_file:5
 unevictable:7733 dirty:0 writeback:2
 slab_reclaimable:141499 slab_unreclaimable:982494
 mapped:41725281 shmem:41729991 pagetables:159697 bounce:0
 kernel_misc_reclaimable:0
 free:489683 free_pcp:217 free_cma:0
Node 0 active_anon:43079172kB inactive_anon:76148660kB active_file:504kB inactive_file:0kB unevictable:30908kB isolated(anon):1816kB isolated(file):0kB mapped:82317904kB dirty:0kB writeback:8kB shmem:82332012kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1067008kB writeback_tmp:0kB kernel_stack:20376kB pagetables:365480kB all_unreclaimable? no
Node 1 active_anon:10801304kB inactive_anon:115141708kB active_file:304kB inactive_file:180kB unevictable:24kB isolated(anon):1364kB isolated(file):20kB mapped:84583220kB dirty:0kB writeback:0kB shmem:84587952kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 34064384kB writeback_tmp:0kB kernel_stack:8776kB pagetables:273308kB all_unreclaimable? no
Node 0 DMA free:11264kB min:4kB low:16kB high:28kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Node 0 DMA32 free:508344kB min:628kB low:2468kB high:4308kB reserved_highatomic:0KB active_anon:665260kB inactive_anon:652200kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1966908kB managed:1901372kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 126926 126926 126926
Node 0 Normal free:1246488kB min:140628kB low:270600kB high:400572kB reserved_highatomic:1120256KB active_anon:42413912kB inactive_anon:75496460kB active_file:504kB inactive_file:0kB unevictable:30908kB writepending:0kB present:132120576kB managed:129979764kB mlocked:30908kB bounce:0kB free_pcp:768kB local_pcp:648kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 Normal: 103955*4kB (UMH) 5561*8kB (UH) 4432*16kB (UH) 3357*32kB (UH) 2433*64kB (UH) 1655*128kB (UH) 556*256kB (UH) 106*512kB (H) 37*1024kB (H) 4*2048kB (H) 0*4096kB = 1248884kB

We see that there are only 489,683 free pages, which indicates that the system might be running out of memory, kernel has a hard time on reclaiming pages for itself. This is also a NUMA system FWICT, so it might be relevant to tell that the VM so that QEMU & the VM OS can better handle the actual HW.

What's the config of VM 1161?

It seems also that either the whole system was overloaded, or at least that VM was in trouble before already, as there are quite some timeouts logged when Proxmox VE tries to query the VM status.

It also seems that you use the relatively new virtiofsd tech, that can add some extra memory pressure as the VM and the host are more coupled together.

Finally, what QEMU version are you running? At least your booted kernel (5.15.74-1-pve) is a bit dated.
So what's your pveversion -v output?

alpha754293 · Apr 27, 2023

t.lamprecht said:
Well, I'd guess that the outputs from free/arcstat are from after the OOM kill triggered, so they are not really relevant.

Checking the situation at time of the OOM kill, i.e., looking at:

Code:

Mem-Info: active_anon:13470119 inactive_anon:47822592 isolated_anon:795 active_file:202 inactive_file:0 isolated_file:5 unevictable:7733 dirty:0 writeback:2 slab_reclaimable:141499 slab_unreclaimable:982494 mapped:41725281 shmem:41729991 pagetables:159697 bounce:0 kernel_misc_reclaimable:0 free:489683 free_pcp:217 free_cma:0 Node 0 active_anon:43079172kB inactive_anon:76148660kB active_file:504kB inactive_file:0kB unevictable:30908kB isolated(anon):1816kB isolated(file):0kB mapped:82317904kB dirty:0kB writeback:8kB shmem:82332012kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1067008kB writeback_tmp:0kB kernel_stack:20376kB pagetables:365480kB all_unreclaimable? no Node 1 active_anon:10801304kB inactive_anon:115141708kB active_file:304kB inactive_file:180kB unevictable:24kB isolated(anon):1364kB isolated(file):20kB mapped:84583220kB dirty:0kB writeback:0kB shmem:84587952kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 34064384kB writeback_tmp:0kB kernel_stack:8776kB pagetables:273308kB all_unreclaimable? no Node 0 DMA free:11264kB min:4kB low:16kB high:28kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Node 0 DMA32 free:508344kB min:628kB low:2468kB high:4308kB reserved_highatomic:0KB active_anon:665260kB inactive_anon:652200kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1966908kB managed:1901372kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 0 126926 126926 126926 Node 0 Normal free:1246488kB min:140628kB low:270600kB high:400572kB reserved_highatomic:1120256KB active_anon:42413912kB inactive_anon:75496460kB active_file:504kB inactive_file:0kB unevictable:30908kB writepending:0kB present:132120576kB managed:129979764kB mlocked:30908kB bounce:0kB free_pcp:768kB local_pcp:648kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 0 Node 0 Normal: 103955*4kB (UMH) 5561*8kB (UH) 4432*16kB (UH) 3357*32kB (UH) 2433*64kB (UH) 1655*128kB (UH) 556*256kB (UH) 106*512kB (H) 37*1024kB (H) 4*2048kB (H) 0*4096kB = 1248884kB

We see that there are only 489,683 free pages, which indicates that the system might be running out of memory, kernel has a hard time on reclaiming pages for itself. This is also a NUMA system FWICT, so it might be relevant to tell that the VM so that QEMU & the VM OS can better handle the actual HW.

What's the config of VM 1161?

It seems also that either the whole system was overloaded, or at least that VM was in trouble before already, as there are quite some timeouts logged when Proxmox VE tries to query the VM status.

It also seems that you use the relatively new virtiofsd tech, that can add some extra memory pressure as the VM and the host are more coupled together.

Finally, what QEMU version are you running? At least your booted kernel (5.15.74-1-pve) is a bit dated.
So what's your pveversion -v output?

Here is the 1161.conf:

Code:

affinity: 20-27
agent: 1
args: -cpu host,hv_vapic,+invtsc,-hypervisor -chardev socket,id=char0,path=/var/run/shared-fs.sock -device vhost-user-fs-pci,chardev=char0,tag=myfs -object memory-backend-memfd,id=mem,size=64G,share=on -numa node,memdev=mem
bios: ovmf
boot: order=sata0;ide2
cores: 8
cpu: host,hidden=1
efidisk0: export_pve:1161/vm-1161-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
hookscript: local:snippets/gpu-hookscript-virtiofs.pl
hostpci0: 0000:81:00,pcie=1,x-vga=1,romfile=NVIDIA.RTXA2000.6144.210720.rom
ide2: none,media=cdrom
machine: pc-q35-7.1
memory: 65536
meta: creation-qemu=7.1.0,ctime=1679081283
name: win10pve
net0: virtio=92:B3:C4:6C:55:7F,bridge=vmbr2,firewall=1
net1: virtio=02:31:84:C3:78:CF,bridge=vmbr0,firewall=1
net2: virtio=8E:A1:05:DE:9A:EC,bridge=vmbr1,firewall=1
numa: 0
ostype: win10
sata0: export_pve:1161/vm-1161-disk-1.qcow2,size=1T
scsihw: virtio-scsi-single
smbios1: uuid=4f042484-6fcd-4f45-ac64-222b7aab4c5b
sockets: 1
vga: memory=128
vmgenid: f865b20b-4918-499f-b4e2-5a47bc337155

t.lamprecht said:
It seems also that either the whole system was overloaded, or at least that VM was in trouble before already, as there are quite some timeouts logged when Proxmox VE tries to query the VM status.

Load average typically is < 16.00 (out of a 64 thread system), so that is generally unlikely.

I only noticed this because the remote desktop connection to the VM would be killed, so now I am trying to figure out why.

Yes, I'm using virtiofsd. It's FANTASTIC!

(Not sure how much system resources it consumes though, but as far as I can tell -- not much. There's not a ton of traffic over virtiofsd.)

t.lamprecht said:
Finally, what QEMU version are you running?

Output of apt show qemu-system-x86 | grep Version:

Code:

Version: 1:5.2+dfsg-11+deb11u2

t.lamprecht said:
At least your booted kernel (5.15.74-1-pve) is a bit dated.

This system is my PROD system, so if it isn't broken, I am not going to try and fix it.

t.lamprecht said:
So what's your pveversion -v output?

Code:

proxmox-ve: 7.3-1 (running kernel: 5.15.74-1-pve)
pve-manager: 7.3-3 (running version: 7.3-3/c3928077)
pve-kernel-5.15: 7.2-14
pve-kernel-helper: 7.2-14
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-8
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.2-12
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.7-1
proxmox-backup-file-restore: 2.2.7-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.3
pve-cluster: 7.3-1
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.5-6
pve-ha-manager: 3.5.1
pve-i18n: 2.8-1
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-1
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.6-pve1

Thank you.

t.lamprecht · Apr 27, 2023

alpha754293 said:
Load average typically is < 16.00 (out of a 64 thread system), so that is generally unlikely.

Load is a bit limited and just accounts for CPU resources.
I'd rather recommend checking the pressure stall information for getting also memory and IO pressure.
I normally use head /proc/pressure/* for a quick overview, it shows avg stalls of some or all (full) processes due to waiting on either CPU (or scheduling), IO or memory averaged over the last 10s, 60s or 300s. Once the values are >>0 you know that there was some (or all) processed stalled for a bit in that time frame.

alpha754293 said:
This system is my PROD system, so if it isn't broken, I am not going to try and fix it.

The constant stream of security fixes in the kernel are for fixing some kind of brokenness though, but sure if everything works and hosts/services are not publicly accessible without any protection their risk is relatively low.

alpha754293 said:
Output of apt show qemu-system-x86 | grep Version:

That's Debian's upstream package, not ours, but as you posted the full pveversion -v output, we got the info anyway:

alpha754293 said:
pve-qemu-kvm: 7.1.0-4

So one version older than our current supported 7.2 one, but it's not like the 7.1 had some excessive memory usage bugs, so probably not the cause (at least on its own).

alpha754293 said:
affinity: 20-27

That pinning is all to cores from the same NUMA node or?

alpha754293 said:
memory: 65536

alpha754293 said:
-object memory-backend-memfd,id=mem,size=64G,share=on

alpha754293 said:
Max size (high water): 16:1 125.9 GiB

So you give the VM 64 GiB of host memory as guest memory, plus another 64 GiB of host memory as memfd, totaling 128 GiB + some QEMU & virtiofs overhead for that VM alone, then the ARC size being allowed to grow up to ~ 126 GiB, which in sum with the VM memory is already close to the total of 256 GiB installed.
If there's lots of IO happening the ARC will grow over time, and the Kernel will have a hard time of reclaiming memory, if there are even other VMs running too at the same time the OOM isn't really that surprising.

See our doc's about how to limit the ZFS ARC to avoid this:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage

alpha754293 said:
sata0: export_pve:1161/vm-1161-disk-1.qcow2,size=1T

FYI; this is bound to be slow and SATA is not that well maintained compared to (virtio) SCSI or VirtioBlk, so switching over to that and then enable IO threads on the guest disk would make it faster, less buggy and avoid stalls of the QEMU VM main thread (which can result in guest stalls in the worst case).

alpha754293 · Apr 27, 2023

t.lamprecht said:
Load is a bit limited and just accounts for CPU resources.
I'd rather recommend checking the pressure stall information for getting also memory and IO pressure.
I normally use head /proc/pressure/* for a quick overview, it shows avg stalls of some or all (full) processes due to waiting on either CPU (or scheduling), IO or memory averaged over the last 10s, 60s or 300s. Once the values are >>0 you know that there was some (or all) processed stalled for a bit in that time frame.

I can check that when I get home from work later on today.

I know that my %iowait is usually hovering around the 10% mark (nominal), but when one of my QNAP NAS units, which is also running Plex Media Server, is scanning the media libraries for updates, then there is more network traffic, and the %iowait goes up significantly.

Also, interestingly enough, load and %CPU Usage on the dashboard is not necessarily related to each other, because again, when there is a high network activity (media scanning) -- that will increase the load, but the %CPU Usage will still only hover around the 5-6% mark.

t.lamprecht said:
The constant stream of security fixes in the kernel are for fixing some kind of brokenness though, but sure if everything works and hosts/services are not publicly accessible without any protection their risk is relatively low.

I am usually weary of pushing updates to the system because I've had it happen on other Linux-based systems where I update something, and as a result of that, it would break something else, and trying to "undo" said update is not always easy nor a trivial task. Sometimes, that can be quite involved.

And this problem CAN be potentially compounded by the virtue that in pursuit of efficiency, newer releases of packages may not necessarily carry (nor maintain) all of the bloat that comes from prior releases, and therefore; there is a potential risk that a new version of something can break something else, because what used to be present in the package has been removed and therefore; the other thing (which needed that thing), no longer works anymore.

I've had that happen to me before. (And then I have to go to like rpmfind.net to go try and dig up the older version of the package, so that I can revert back to it.)

But I digress...

t.lamprecht said:
That's Debian's upstream package, not ours, but as you posted the full pveversion -v output, we got the info anyway:

Yeah, I had to google how to find out what the QEMU version I'm running, and that was the command that I found.

(Sorry -- for n00bs such as myself, I take what you say very literally because I don't know enough to be able to infer what you might have meant, with the questions that you were asking.)

t.lamprecht said:
That pinning is all to cores from the same NUMA node or?

Yes. It should be. My server is a dual Xeon E5-2697A v4 server, so it should be that CPUs 0-31 are on Socket 1 and CPUs 32-63 are on Socket 2 (with the implication that Socket = NUMA).

t.lamprecht said:
So you give the VM 64 GiB of host memory as guest memory, plus another 64 GiB of host memory as memfd, totaling 128 GiB + some QEMU & virtiofs overhead for that VM alone, then the ARC size being allowed to grow up to ~ 126 GiB, which in sum with the VM memory is already close to the total of 256 GiB installed.

I'm not sure if that's how that works.

Whenever I have tried to make the size to be LESS than the VM's RAM, when I attempt to start said VM, it will complain that the size doesn't match the amount of RAM, and therefore; I always have to change it back so that it matches.

As to whether it actually uses or doubles the amount of memory - I am not so sure about that because there is nothing that is obviously apparent that that's the behaviour that it is doing. (I think that I have maybe like 11 or 12 VMs running now - maybe? At least 10 VMs.) And despite the provisioning of RAM for each of those VMs, on the dashboard, it was showing ~145 GiB out of 251 GiB of RAM used, of which, I think some 45 GiB of it is ZFS ARC.

I would think that if the statement is true (that those objects double the memory usage), I would've expected the system to have ran out of memory a LONG time ago, but that isn't the behaviour that I have observed, in using and running the system. (And it's been probably about a month-and-a-half that I've been using the system and I don't have any indications that this has been a problem, but I also don't really know how to check it either.)

t.lamprecht said:
If there's lots of IO happening the ARC will grow over time, and the Kernel will have a hard time of reclaiming memory, if there are even other VMs running too at the same time the OOM isn't really that surprising.

See our doc's about how to limit the ZFS ARC to avoid this:
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_zfs_limit_memory_usage

Yeah, so funny story about this one - I had originally set the zfs_arc_min to be 4 GiB and zfs_arc_max to be 32 GiB.

But when I saw that the system was only using maybe just south of 90 GiB of RAM and I have all this other extra RAM that the dashboard didn't show that the system was using, I actually removed both of those lines from the zfs conf file, so that it can go back to defaulting to using upto 1/2 of the installed RAM as ARC. (As shown above.)

I DID, originally cap it. But seeing that that the dashboard was showing that the system wasn't using as much as I thought the system would be using, I removed the cap.

t.lamprecht said:
FYI; this is bound to be slow and SATA is not that well maintained compared to (virtio) SCSI or VirtioBlk, so switching over to that and then enable IO threads on the guest disk would make it faster, less buggy and avoid stalls of the QEMU VM main thread (which can result in guest stalls in the worst case).

Thank you for your feedback.

My thought process when I set this up was two things:

1) The hard drives that the VM disk images sit on is spinning rust. (8x HGST 6 TB SATA 3 Gbps 7200 rpm HDDs in a raidz2 zpool), so it's not going to be breaking any records any time soon anyways.

2) Thus, since it was going to be on slow storage, I also figured that the speed of the disk would unlikely be critical because the slowest link in the chain will be the HDDs themselves.

(For the record, that zfs pool is set for full allocation when the VMs are created, so unlike thin provisioning on ZFS, which can eventually result in the data being "shotgunned out", by using thick provisioning, full allocation, at VM create, it keeps the disk image relatively and reasonably sequential/contiguous (at least that's the theory).)

And as always, the downside with thick provisioning is that there is a high probability of wasted space, but I would rather take the space penalty than the performance penalty that would've likely arisen from thin provisioning the VM disks on SATA 3 Gbps HDDs.

Code:

scsihw: virtio-scsi-single

I didn't want to use the virtio block storage because I read the instructions on how to connect a tape drive to the Proxmox system, and it seemed to be significantly more complicated than if I just left/kept the VM disk image as a file, and then I can send that file to my LTO-8 tapes as a file.

Backing up block devices to LTO-8 tape is significantly more challenging. (I posted this question on the Level1Tech forums. Last I checked, nobody has submitted a reply to that, which would suggest that at least I'm not the only one who would find trying to do that a challenge.)

With LTO-8 tape, and qcow2 files, I can back that up to take, and if the entire system "takes a dump", I can always build another server, and then restore the qcow2 files from said LTO-8 tapes (and/or backup the VZdump files).

Files are easier to work with, for this reason (than block devices).

And I don't have a PBS server set up where it can write the data straight to LTO-8 tape.

alpha754293 · Apr 28, 2023

t.lamprecht said:
I normally use head /proc/pressure/* for a quick overview, it shows avg stalls of some or all (full) processes due to waiting on either CPU (or scheduling), IO or memory averaged over the last 10s, 60s or 300s. Once the values are >>0 you know that there was some (or all) processed stalled for a bit in that time frame.

Here is the output from that command:

Code:

root@pve:~# head /proc/pressure/*
==> /proc/pressure/cpu <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=4221496022
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

==> /proc/pressure/io <==
some avg10=36.82 avg60=29.20 avg300=24.39 total=100677828579
full avg10=36.06 avg60=28.55 avg300=23.82 total=97982537046

==> /proc/pressure/memory <==
some avg10=0.00 avg60=0.00 avg300=0.00 total=7731950056
full avg10=0.00 avg60=0.00 avg300=0.00 total=7588543275

(sidebar: Plex Media Server on QNAP NAS is scanning the system when this command was executed, so the io pressure is higher than normal)

System load averages, however is only at ~17.00, IO delay is around 16%, and CPU usage is about 6%, according to the Proxmox dashboard.

stasmin · Oct 20, 2024

alpha754293 said:

So here is the last 150 lines from /var/log/syslog:

(see link)
(Apparently, I can't post a comment with more than 16384 chars here, and it didn't give me the option to upload the text file with the last 150 lines here directly, so pastebin.com it is then.)

Here is my hardware specs:
Dual Xeon E5-2697A v4
Supermicro X10DRi-T4+ motherboard
256 GB DDR4-2400 ECC Reg Ram
16x HGST 10 TB + 8x HGST 6 TB ZFS pool for bulk storage (in three raidz2 vdevs in one ZFS pool)
8x HGST 6 TB for VM disk storage (also raidz2)
4x HGST 3TB RAID6 handled by Broadcom MegaRAID SAS 12 Gbps SAS RAID HW HBA for the Promox 7.3-3 OS

Here is the output of free -g:

Code:

root@pve:/var/log# free -g
               total        used        free      shared  buff/cache   available
Mem:             251          93         125          31          32         124
Swap:              7           7           0

And here is the output of ZFS arc_summary:

Code:

root@pve:/var/log# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Apr 24 23:44:24 2023
Linux 5.15.74-1-pve                                           2.1.6-pve1
Machine: pve (x86_64)                                         2.1.6-pve1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                     9.1 %   11.4 GiB
        Target size (adaptive):                         9.2 %   11.5 GiB
        Min size (hard limit):                          6.2 %    7.9 GiB
        Max size (high water):                           16:1  125.9 GiB
        Most Frequently Used (MFU) cache size:         44.5 %    4.9 GiB
        Most Recently Used (MRU) cache size:           55.5 %    6.2 GiB
        Metadata cache size (hard limit):              75.0 %   94.4 GiB
        Metadata cache size (current):                  0.9 %  875.0 MiB
        Dnode cache size (hard limit):                 10.0 %    9.4 GiB
        Dnode cache size (current):                     1.9 %  181.9 MiB

ARC hash breakdown:
        Elements max:                                               1.6M
        Elements current:                              13.7 %     220.4k
        Collisions:                                                 7.6M
        Chain max:                                                     4
        Chains:                                                      778

ARC misc:
        Deleted:                                                  242.7M
        Mutex misses:                                              28.8M
        Eviction skips:                                             2.6G
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                  24.2 TiB
        L2 eligible MFU evictions:                     26.3 %    6.4 TiB
        L2 eligible MRU evictions:                     73.7 %   17.8 TiB
        L2 ineligible evictions:                                 6.1 TiB

ARC total accesses (hits + misses):                                 2.8G
        Cache hit ratio:                               89.4 %       2.5G
        Cache miss ratio:                              10.6 %     292.8M
        Actual hit ratio (MFU + MRU hits):             88.4 %       2.4G
        Data demand efficiency:                        69.6 %     487.5M
        Data prefetch efficiency:                       4.4 %      89.6M

Cache hits by cache type:
        Most frequently used (MFU):                    80.3 %       2.0G
        Most recently used (MRU):                      18.7 %     460.2M
        Most frequently used (MFU) ghost:               0.7 %      17.1M
        Most recently used (MRU) ghost:                 1.8 %      44.6M

Cache hits by data type:
        Demand data:                                   13.8 %     339.5M
        Demand prefetch data:                           0.2 %       4.0M
        Demand metadata:                               84.2 %       2.1G
        Demand prefetch metadata:                       1.9 %      46.5M

Cache misses by data type:
        Demand data:                                   50.5 %     148.0M
        Demand prefetch data:                          29.3 %      85.7M
        Demand metadata:                                7.4 %      21.6M
        Demand prefetch metadata:                      12.8 %      37.5M

DMU prefetch efficiency:                                          359.8M
        Hit ratio:                                     19.7 %      70.9M
        Miss ratio:                                    80.3 %     288.8M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

I am not sure why the system appears to be sending the oom-killer when I have plenty of free RAM available.

Any help in this regard would be greatly appreciated.

Thank you.

If you haven't solved the problem yet, then my advice will probably help.
I don't know why it's done so stupidly, but the kernel parameter vm.swappiness is set to 60 by default in Debian-based systems. When free RAM is less than 60%, the OS will dump the "excess" to swap. Since memory allocation happens quickly when starting the VM, the system does not have time to dump it to swap and the qemu process is killed.
When configuring the OS, I always set this parameter to 3-5% and never encounter such problems. But after upgrading from Debian 11 to 12, sysctl.conf became the default and I started having the same problems as you.

UdoB · Oct 20, 2024

stasmin said:
vm.swappiness is set to 60 by default in Debian-based systems. When free RAM is less than 60%, the OS will dump the "excess" to swap

I believe you are over-simplifying. And "60" does probably not mean "%".

Random pieces from "the web":
"So, if the system's distress value is low and swappiness is set to 60, the system will not swap process memory until 80% of the total RAM in the system is allocated." - https://access.redhat.com/solutions/103833

"The swappiness value is a number between 0 and 200." - https://phoenixnap.com/kb/swappiness

But yeah, a lower value seems more appropriate for a hypervisor than that default 60.

My rule for me: you can overcommit CPU by a vast amount. You can not overcommit RAM - at least not more than some percent. You might look for zram to gain a little bit headroom on-the-fly.

Search

Search

kvm invoked oom-killer

alpha754293

Member

t.lamprecht

Proxmox Staff Member

alpha754293

Member

t.lamprecht

Proxmox Staff Member

alpha754293

Member

alpha754293

Member

stasmin

New Member

UdoB

Distinguished Member