Memory leak(?) on 6.8.4-2/3 PVE 8.2

jnkraft

Member
Mar 30, 2021
9
4
8
37
Node1 kernel 6.8.4-3
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: residual config
ksmtuned: 4.20150326+b1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

Node2, kernel 6.5.13-5 pinned
proxmox-ve: 8.2.0 (running kernel: 6.5.13-5-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.5.13-1-pve-signed: 6.5.13-1
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: residual config
ksmtuned: 4.20150326+b1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.0.11
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

Both nodes are used as nfs-storage servers for compute-nodes, so there is no other load on them than nfs-server. PVE is used for OS for unification of infrastructure, basically used it's debian part. Hardware on both nodes differs only on RAM amount and ssd models, all OS settings and network are equal (made with OVS).
Until PVE8.2/6.8 kernel everything was ok. For testing purposes i upgraded Node1 from PVE8.1 to PVE8.2, and it crashed after few days: all RAM was eaten by nothing (full green bar in htop's Mem), OOM killed almost every service, starting from OVS. Hardresetted Node1 via KVM ( only serial console was alive), pinned kernel 6.5.13-5, few days no memory consuming. Before and after crash there was no any kind of load on Node1, no VM disks, nothing.
Then upgraded Node2 from PVE8.1 to PVE8.2 but pinned kernel to 6.5.13-5, there are about 300 qcow2 disks on it (20TB), it is fully healthy.
After few days i unpinned kernel on empty Node1 and returned to 6.8.4-3. 1.5 days passed, 48 of 64G consumed by... nothing again. I'm not kind of bash-fu guy, so there are free command outputs and screens of htop from sick and healthy nodes:
└─# free -m
Bash:
               total        used        free      shared  buff/cache   available
Mem:           64187       50711       13188          35         968       13475
Swap:          65535           0       65535
└─# free -m
Bash:
               total        used        free      shared  buff/cache   available
Mem:           48050        5774        1868           9       41017       42275
Swap:           8191         691        7500
Node1.jpg
Node2.jpg
All time ranges are not exact, approximate.
Forum did not allow me to insert part of Node1's dmesg before hard reset from KVM as it is too long, attached in the file.
Any help with suggestions and diagnostics would be appreciated...
 

Attachments

  • dmesg.txt
    314.3 KB · Views: 11
Last edited:
  • Like
Reactions: ebiss
After 14 hours from the first message i lost connection to Node1, this time even the local console died, only blinking cursor.
 
Same story. After upgrade to pve 8.2 (kernel 6.8.4-3) I’m facing the same memory leak.
Zfs pool with NFS shares and high IO load
(ZFS arc size limited and does not grow up)
 
Last edited:
@Whatever Might be useful to downgrade the kernel to 6.5 for now, until they get the issue resolved?

The release notes for PVE 8.2 mention a way of doing that, which works. I'm presently using the 6.5 kernel with a system that has Nvidia hardware in it, which also doesn't seem to like kernel 6.8 (for now). ;)
 
Last edited:
  • Like
Reactions: Whatever
Some images
First server (active) - noticeable memory leak (I had to drop ARC cache at the end)

1715670484601.png


Second server (exactly the same but without NFS activity - redundant storage with ZFS replicas)

1715670619153.png
 
Last edited:
Hmm, for some reason i didn't receive any notifications on this topic new messages.
First, thank you guys, for paying attention to my post.
Second, after 3 days of testing and reinstalling PVE on test nodes, i found cause of memory leaks in my case. It's as silly as strange, for me at least.

Before leaks started, on sick Node1 during maintenance work i commented out (with nfs-server restart) main share from /etc/exports without disabling according storage on PVE cluster level, so 8 compute nodes were bombarding it every n seconds with useless attempts to mount non-existing share. And that was the cause.
Somehow kernel 6.5 handled this countless requests safely, but not 6.8.
Leak stops to grow if i disable storage on datacenter level and unmount it by hand from all compute nodes, or if i just stop nfs-server on Node1.
I had (false) suspicions that Node1 was upgraded from 7.4 to 8.1 in due time and there is some mysterious trash left after updates and upgrades. I took test server with similar hardware, did a clean install on PVE8.2, played my ansible scripts with network tunings for 10G network. And got the same result: 6.5 is ok, 6.8 is not ok in the scenario described above.
If nfs-share exists then on 6.8 kernel there is no leaks.
One thing i could not resolve - memory "corrupted" by the leak was not freed until reboot. Restarting of everything nfs- and network-related does not help.
To round up, my case was very specific so i doubt that anyone competent would investigate it.
 
Last edited:
Last edited:
  • Like
Reactions: justinclift
Hmm, for some reason i didn't receive any notifications on this topic new messages.
First, thank you guys, for paying attention to my post.
Second, after 3 days of testing and reinstalling PVE on test nodes, i found cause of memory leaks in my case. It's as silly as strange, for me at least.

Before leaks started, on sick Node1 during maintenance work i commented out (with nfs-server restart) main share from /etc/exports without disabling according storage on PVE cluster level, so 8 compute nodes were bombarding it every n seconds with useless attempts to mount non-existing share. And that was the cause.
Somehow kernel 6.5 handled this countless requests safely, but not 6.8.
Leak stops to grow if i disable storage on datacenter level and unmount it by hand from all compute nodes, or if i just stop nfs-server on Node1.
I had (false) suspicions that Node1 was upgraded from 7.4 to 8.1 in due time and there is some mysterious trash left after updates and upgrades. I took test server with similar hardware, did a clean install on PVE8.2, played my ansible scripts with network tunings for 10G network. And got the same result: 6.5 is ok, 6.8 is not ok in the scenario described above.
If nfs-share exists then on 6.8 kernel there is no leaks.
One thing i could not resolve - memory "corrupted" by the leak was not freed until reboot. Restarting of everything nfs- and network-related does not help.
To round up, my case was very specific so i doubt that anyone competent would investigate it.

In my case I don't have non-existent export shares. Only dozen active exported from different ZFS pools/dataset
 
Hi,
Before leaks started, on sick Node1 during maintenance work i commented out (with nfs-server restart) main share from /etc/exports without disabling according storage on PVE cluster level, so 8 compute nodes were bombarding it every n seconds with useless attempts to mount non-existing share. And that was the cause.
Somehow kernel 6.5 handled this countless requests safely, but not 6.8.
Leak stops to grow if i disable storage on datacenter level and unmount it by hand from all compute nodes, or if i just stop nfs-server on Node1.
Thanks! With this, I can reproduce the issue and we'll investigate. Newer upstream kernels seem to contain a fix for this which we might find and be able to backport.

To round up, my case was very specific so i doubt that anyone competent would investigate it.
Luckily, I'm incompetent ;)
 
I have nvidia vgpu working fine on kernel 6.8
Cool. The GPU in the system I'm using is a Geforce 3070 though, which apparently doesn't work for that approach. ;)

There are a lot of problems being reported with the Proxmox 6.8 kernel currently, so keeping on the 6.5 one for now seems like a good move anyway.
 
Last edited:
Luckily, I'm incompetent ;)
Please forgive me and forget about my words, i was little bit upset that noone from Proxmox staff was interested in my problem before this moment:) I'll be glad to be useful in any way.
I updated one of compute nodes with non-critical VMs to 8.2/6.8 and after few hours got this in dmesg:
Bash:
[Wed May 15 01:10:44 2024] RPC: Could not send backchannel reply error: -110
[Wed May 15 01:29:56 2024] RPC: Could not send backchannel reply error: -110
As for now, mounts from storages to this node are healthy. I found this topic after googling that RPC error https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568 could that bug be somehow related to situation with memory leak?
For safety reasons i'll rollback everything to 6.5 kernel for now.

In my case I don't have non-existent export shares. Only dozen active exported from different ZFS pools/dataset
Who knows, maybe the root case of these problems is common for our situations.
 
Last edited:
FYI, a fix for the issue was applied in git and will be part of the next kernel version (but it's not packaged yet at the time of this writing).


Please forgive me and forget about my words, i was little bit upset that noone from Proxmox staff was interested in my problem before this moment:) I'll be glad to be useful in any way.
I updated one of compute nodes with non-critical VMs to 8.2/6.8 and after few hours got this in dmesg:
Bash:
[Wed May 15 01:10:44 2024] RPC: Could not send backchannel reply error: -110
[Wed May 15 01:29:56 2024] RPC: Could not send backchannel reply error: -110
As for now, mounts from storages to this node are healthy. I found this topic after googling that RPC error https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568 could that bug be somehow related to situation with memory leak?

For safety reasons i'll rollback everything to 6.5 kernel for now.

Who knows, maybe the root case of these problems is common for our situations.
Was your NFS server already close to running out of memory there?
Otherwise, I'd guess it's not directly related, because the fix for the memory leak issue is just adding a missing call to release a certain buffer.
 
  • Like
Reactions: fiona
Hi,
Hi, I can see a new version of the pve-kernel has been published. Does anyone know how to update? Doesn’t seem to come up as a version when I do `apt install pve-kernel` but not sure if I’m linked to the right repositories?

https://git.proxmox.com/?p=pve-kernel.git;a=shortlog
please note that the kernel packages are called proxmox-kernel-... nowadays, since the same are used for Proxmox Backup Server and Proxmox Mail Gateway too.

And the answer is:
 
Have a good day everyone. I have the same problem as described in this topic. I updated the system from kernel 6.5.13-5-pve to version 6.8.8-2-pve and my system started crashing. it works for about an hour or an hour and a half, and then the load average rises sharply to 1400-2000(but memory does not leaks), despite the fact that on the old kernel it did not rise above 10. And the most interesting thing is that on the test server with version Proxmox 8.1.10, everything works great and the server does not disconnects from the NFS server.
 
Last edited:
Please tell me, is it possible to somehow roll back PVE back to 8.1.10 without a clean reinstallation?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!