Suspected memory leak in Proxmox CE 9.0.6

The graph is a weekly average, so the dip in the graph is the kernel update + reboot I did yesterday, and then migrated back the VMs that were living there before the kernel change, to have an apples-to-apples comparison. So it is 24H runtime data with the new kernel.
This is exactly why I did not upgrade the second node today, to get a longer timeframe :-)
 
Hi,
Again, update/report.
The second day is also good!1758725300117.png

Though I am a little bit concerned about the "CPU Pressure Stall" graph starting to show constant pressure, not much, but still present when compared to what it was with the previous kernel:
1758725393906.png

Meanwhile, updated the second node, will continue monitoring, and will report the results.
 
there's two test kernels here:

http://download.proxmox.com/temp/kernel-6.8-ice-memleak-fix-1/ (6.8 for Bookworm)
http://download.proxmox.com/temp/kernel-6.14-ice-memleak-fix-1/ (6.14 for Trixie)

with a potential upstream fix. feedback would be appreciated!
Hello Fabian, when will the fix be added to the repositories? I would like to update additional clusters without having to manually update the kernel.
This version is currently the only way to operate/patch servers with E810 NICs.
 
we haven't yet managed to reproduce this issue in our test lab (as far as I know), and the feedback regarding the patch was not 100% good, so we need more testing. if you are on 9.x, you can also try the 6.17 opt in kernel.
 
seems like the patch got into 6.17 and 6.16.9, so it should be included in our kernels one way or another soon.
 
@Falk R.: Maybe you are already aware, but you can check this post for a summary of the current status of the issue. Feel free to share your experience with the different kernel versions with us. As @fabian wrote, there were multiple kernel versions with different fixes for different memory leaks, but mixed results were reported (helped in some cases, didn't help in others, or helped but only partially).
 
@Falk R.: Maybe you are already aware, but you can check this post for a summary of the current status of the issue. Feel free to share your experience with the different kernel versions with us. As @fabian wrote, there were multiple kernel versions with different fixes for different memory leaks, but mixed results were reported (helped in some cases, didn't help in others, or helped but only partially).
Thanks for the info. I can't test the 6.17 kernel at the moment because I only manage productive clusters with E810 NICs. It might help to reproduce the issue: they all work with Ceph and MTU9000 on the E810 NICs and in bond with LACP Layer3+4.