We do have the pdpe1gb feature enabled for all of our guests. Ours are VM's as well.
Definitely something major changed in the kernel for these environments.
A bit hacky but it is absolutely possible to download the last 5.15 kernel from the bullseye no-subscription repo and install it manually. E.g. http://download.proxmox.com/debian/...ve-kernel-5.15.116-1-pve_5.15.116-1_amd64.deb@aaron any news?
by the way, is there any change to build 5.15 kernel for PVE 8 for testing or build 6.2 with patch applied?
proxmox-boot-tool
and pin it to make booting from it the default for the time being.So, over the last 2 days we tested this with the new dual socket hardware and were able to reproduce it. With a working kernel (5.15, 5.19) we see a KSM of ~120 GiB after about an hour of booting up the system and all the test VMs. On the problematic kernels (6.1, 6.2) we only see about 45 GiB of KSM, even after letting it settle over night.
The commit found by @spirit seems to be the fix. It got introduced in the Ubuntu mainline kernel with 6.4.13. Tests with it and the previous version, 6.4.12, show that 6.4.12 still has the problematic behavior, and 6.4.13 behaves like the older 5.15 and 5.19 kernels, resulting in a good KSM usage.
Unfortunately, it seems that backporting the patch won't be easy. It is possible that the fix will be available only in a newer major kernel version in the (near) future.
Do you mind to compile it with latest zfs?A bit hacky but it is absolutely possible to download the last 5.15 kernel from the bullseye no-subscription repo and install it manually. E.g. http://download.proxmox.com/debian/...ve-kernel-5.15.116-1-pve_5.15.116-1_amd64.deb
You will then have to add it via theproxmox-boot-tool
and pin it to make booting from it the default for the time being.
Thats how I got it into the PVE 8 installed test machines for example.
May I ask why you would need it? With the kernel I linked to in my previous answer, I was able to boot a ZFS based Proxmox VE installation that was installed with the 8.0 ISO and did not notice any issues.Do you mind to compile it with latest zfs?
And release as separate deb package?
Because Proxmox VE 7 is in maintenance mode we focus on more targeted fixes, this kernel update contained an important fix for recent AMD microcode updates, and throwing in some ZFS update, that may well spot some kernel release-specific regressions, is far from ideal if one want's to roll out another fix relatively fast.so why don't you want to release one new update for 7.2 with latest ZFS (2.1.13) integrated?
Because Proxmox VE 7 is in maintenance mode we focus on more targeted fixes, this kernel update contained an important fix for recent AMD microcode updates, and throwing in some ZFS update, that may well spot some kernel release-specific regressions, is far from ideal if one want's to roll out another fix relatively fast.
ZFS 2.1.13 will get build and released over the next few weeks, but as the majority of the changes are for compatibility with newer kernels, BSD stuff that doesn't affect Proxmox projects, or the ZFS test system (ZTS) the impact will be relatively low for most systems.
Anyway, back on topic: I would also like to remind all here that overcoming memory is not always ideal, to say lightly, KSM working or not, especially on shared hosts one opens up side-channels between the different VMs and the trust boundary between them is weakened, and it isn't out of question if it can be breached completely (extrapolating secrets, a bit in similar vain as the specter attacks on CPU caches).
So in general, I'd only use KSM for applications where all VMs trust each other fully, and still leave good headroom to avoid an overly high memory pressure when the KSM scales tips due to changes in the guest (updates, some abnormality in work load, ...) to avoid sudden spikes in memory pressure.
Don't get me wrong, that it went unnoticed is certainly not ideal, and we'll improve on that, but that it was broke for almost a year shows, that while some of you might depend heavily on this feature, the majority doesn't depend on such over commitment of memory, at least not with multi-socket setups.
While we normally can take on most backport and even help to upstream them to stable kernels, this is a bit too deep in specifics of the memory subsystem, which itself is a bit arcane where seemingly innocent changes can have a huge impact, so the time needed to ensure a specific backport can be made without other regressions is relatively big. As we do not see a lot of reports via enterprise channels affected by this to justify moving lots of resources to this topic, we currently recommend to keep using Proxmox VE 7 with it's 5.15 based kernel until we either manage to find a targeted fix or provide a 6.5 kernel, where currently the latter is more likely to happen sooner.
We also added a note for this issue in the Proxmox VE 7 to 8 upgrade guides known issue section, and also in the release notes known issue section so that others have a heads-up - thanks again to all the community that brought this to our attention and I naturally would have also preferred if there was a manageable fix available.