Kernel 6.8.5.2 and 6.8.5.3 consumes MORE memory than 6.5

SimonR

Active Member
Jan 2, 2020
35
6
28
Many problems I've read now with the new 6.8.5.2 and 6.8.5.3 Kernel. In my case the server hops away and is rebooting in a simple backupjob, that has run for years through all other kernels. I think there must be something wrong with the memory management of the new kernel. In some hosts I also had the problem, that they can't boot, I solved that with the intel_iommu=off on one host, happy to have a remote console to start an old kernel, sad for the people who don't have that.

But what I'm wondering for.... here you see the host memory usage MAX for one month. Till 5th or so it was the old 6.5 kernel, all nice, like it was for years. ZFS ARC set to 5 GB RAM. After 5th the upgrade to kernel 6.8.5.2, directly 2 GB more usage, why? I put the ZFS ARC size down to 4 GB, one GB less. And what you see? Same higher memory usage killing the host sometimes, mostly in backup jobs with the integrated proxmox backup.

Hope, you solve all that fast, this evening is the third in this week, I need to start a server, cause my monitoring give me an evening-killing alert and I need to reboot the server again.

It's absolut untypical like you see in my screenshot with the RAM usage. Never have seen such graphics after the 5th before in proxmox for years now. Also if you look at the yearly max, it was never that high with the ram-usage the last 12 months, like it was in the last 2 weeks with kernel 6.8.5.2 and 6.8.5.3.

For now I stay on 6.5 kernel, till this is solved.

Or is it a new feature, that the kernel 6.8 consumes 2 GB more memory? Not nice, especially if you tuned your system exactly with ZFS ARC on little systems with only 16 GB RAM.

1716500260684.png

1716501118132.png
 
Last edited:
  • Like
Reactions: zoldveg
I'm getting crashes on kernel 6.8.5.3 so I am also sticking with kernel 6.5.
 
We have the same issue :
We have upgraded our hosts from proxmox 7.4 to 8.2.
We use the same hardware but we have been surprised to notice that the total ram usage is higher on 8.2 than on 7.4.

To be more specifi, on a 512 Go node with 50 Virtual machines we had 400 Go of ram used when running proxmox 7.4.
Migrating all these VM's on a proxmox 8.2 host, same hardware, we have a ~440 Go ram usage.
(Total ram allowed to the VM's is around 500 Go)
We are on kernel 6.8.4.3.
 
The problem is still there and it has something to do with qemu, I think. Every day there is a bit more RAM consumed and after a week or so the whole PVE is crashing an rebooting. The RAM screenshot shows it. We have two PVE in a little cluster, one of it is running only a small Win2019 Domaincontroller with fixed 6 GB RAM. One of it is showing this behaviour, the other is only a zfs replication host and is not increasing the RAM Usage for weeks on kernel 6.8.4.3 The ZFS ARC usage is o.k. and is not increasing. The red lines show the PVE crashes.

1720164025020.png
 
Last edited:
Hi,
are you running an NFS server on these nodes? Kernels >= 6.8.4-4 contain a fix for a memory leak there.
 
Hi,
are you running an NFS server on these nodes? Kernels >= 6.8.4-4 contain a fix for a memory leak there.
No, it's a simple ZFS Volume, I'm running the older 6.5.13-5 Kernel now, and without changing anything, there is no problem, the RAM usage is not increasing there. I'm waiting for further official kernel updates now after 6.8.4-4. First I thought, it might have something to do with a backup USB-drive connected as ZFS, but that can't be the problem, cause on another server without any USB I've this increasing RAM-Usage too. Every day 250 MB more or so. It's also difficult to find the process that's leading to that behavior. I will take a few screenshots of htop now every day, maybe I'll find that.
 
No, it's a simple ZFS Volume, I'm running the older 6.5.13-5 Kernel now, and without changing anything, there is no problem, the RAM usage is not increasing there. I'm waiting for further official kernel updates now after 6.8.4-4. First I thought, it might have something to do with a backup USB-drive connected as ZFS, but that can't be the problem, cause on another server without any USB I've this increasing RAM-Usage too. Every day 250 MB more or so. It's also difficult to find the process that's leading to that behavior. I will take a few screenshots of htop now every day, maybe I'll find that.
Can you share the system journal from around the time the crashes happen? If the whole host crashes, it might be a leak in the kernel itself, rather than a user process. For those, the OOM-killer should kick in at some point, but it still won't hurt to monitor the usage with htop, to be sure. How often do you run replication? Please share the output of zpool status -v and cat /etc/modprobe.d/zfs.conf
 
Can you share the system journal from around the time the crashes happen? If the whole host crashes, it might be a leak in the kernel itself, rather than a user process. For those, the OOM-killer should kick in at some point, but it still won't hurt to monitor the usage with htop, to be sure. How often do you run replication? Please share the output of zpool status -v and cat /etc/modprobe.d/zfs.conf
It's mostly crashing in a normal backup job to a Windows-SMB-Share. After every backup-job there are about 300MB more consumed in total.
But with the oder 6.5 Kernel there never was any problem with the SMB-Shares during the backup. And if I switch back the same PVE to the 6.5-Kernel, it is all running fine without memory increasing. Replication is running weekdays from 9 - 18:00, outside the crash time window. But a backup is starting at 22:30.

Maybe there is a problem with SMB in the new kernel?

I cannot see any process consuming 300MB more every day, so I think you're right that I cannot find it with a comparison between htop-screenshots. The RAM-increase is happening in every backup-job to the SMB-Share. In the log you see the PVE loosing the SMB-connection to the share. But: It's only happenig after a few days, if the RAM is increased and has low free space. I checked the SMB-connection in the logs now for a few days, and it's stable on SMB-side, not loosing any connection, backup-jobs are running fine. So it's not a network- or Windows-SMB problem.

Code:
root@pve3:~# zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 03:18:50 with 0 errors on Sun Jun  9 03:42:52 2024
config:

        NAME                                 STATE     READ WRITE CKSUM
        rpool                                ONLINE       0     0     0
          mirror-0                           ONLINE       0     0     0
            ata-ST2000NX0253_W462RKHW-part3  ONLINE       0     0     0
            ata-ST2000NX0253_W462RL0D-part3  ONLINE       0     0     0

errors: No known data errors

Code:
root@pve3:~# cat /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=1073741824
options zfs zfs_arc_max=3221225472
 

Attachments

  • pve_syslog.txt
    32.8 KB · Views: 2
Last edited:
It's mostly crashing in a normal backup job to a Windows-SMB-Share. After every backup-job there are about 300MB more consumed in total.
But with the oder 6.5 Kernel there never was any problem with the SMB-Shares during the backup. And if I switch back the same PVE to the 6.5-Kernel, it is all running fine without memory increasing. Replication is running weekdays from 9 - 18:00, outside the crash time window. But a backup is starting at 22:30.

Maybe there is a problem with SMB in the new kernel?
That sounds plausible.
I cannot see any process consuming 300MB more every day, so I think you're right that I cannot find it with a comparison between htop-screenshots. The RAM-increase is happening in every backup-job to the SMB-Share. In the log you see the PVE loosing the SMB-connection to the share. But: It's only happenig after a few days, if the RAM is increased and has low free space.
That sounds interesting and would support the suspicion that it's the SMB client code in the kernel.

From a quick search, so it's a long shot, but this sounds like it could be the same issue (or rather a fix for it): https://lore.kernel.org/linux-cifs/20240625034332.750312-1-yangerkun@huawei.com/ It's tagged for stable kernels 6.6+ so it should come in via the Ubuntu 6.8 kernel at some point too.
 
That sounds plausible.

That sounds interesting and would support the suspicion that it's the SMB client code in the kernel.

From a quick search, so it's a long shot, but this sounds like it could be the same issue (or rather a fix for it): https://lore.kernel.org/linux-cifs/20240625034332.750312-1-yangerkun@huawei.com/ It's tagged for stable kernels 6.6+ so it should come in via the Ubuntu 6.8 kernel at some point too.
O.K. I will wait for the next kernel update and post here, if the new kernel solves this problem I have.
 
O.K. Thank you, for your info: problem is still existing in Kernel 6.8.8.2. Increasing RAM started again with change to the new kernel.

View attachment 71208
The patch is not yet applied or part of any Proxmox VE kernel package, so not surprising if it is the same issue.
 
The patch is not yet applied or part of any Proxmox VE kernel package, so not surprising if it is the same issue.
Problem is solved in Kernel 6.8.8.4, no CIFS memory leaks. The RAM usage is stable now. But: How can this kernel come to the productive (hardly tested?) repo? Someone was sleeping while testing 1-3 runs of a simple backupjob to a smb/cifs-share? ;)
 
Problem is solved in Kernel 6.8.8.4, no CIFS memory leaks. The RAM usage is stable now. But: How can this kernel come to the productive (hardly tested?) repo? Someone was sleeping while testing 1-3 runs of a simple backupjob to a smb/cifs-share? ;)
Can you give actual numbers, comparing with a previous kernel? Haven't heard about other reports of CIFS performing badly with the new kernel.
 
Hi,
that link is missing an l at the end ;)
Can you confirm that the patch will be applied starting with kernel 6.9?
There won't be a 6.9 kernel on Proxmox VE anytime soon, as we are based on the Ubuntu kernel.
6.8.12-1-pve does not contain the patch
It does: https://git.proxmox.com/?p=pve-kern...5;hb=12593f0f92a4407c2d5c752fd602a44701af98e2
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!