KSM Memory sharing not working as expected on 6.2.x kernel

adamb · Sep 7, 2023

fiona said:
Hi,

the commit it Fixes is present since kernel 6.1, so might be a candidate.

@Shomo @spirit One of the linked mails in the commit says that it affects huge pages. The commit message doesn't sound like it's limited to that though, but for completeness sake: are you using huge pages?

We do have the pdpe1gb feature enabled for all of our guests. Ours are VM's as well.

Definitely something major changed in the kernel for these environments.

fiona · Sep 8, 2023

Before leaving for vacation, @aaron told me that he still wasn't able to reproduce the issue. Unfortunately, we'll only receive dual socket hardware to test around the end of September. If we can reproduce the issue with that, we can try backporting the commit @spirit found to see if it helps.

adamb · Sep 8, 2023

fiona said:
Before leaving for vacation, @aaron told me that he still wasn't able to reproduce the issue. Unfortunately, we'll only receive dual socket hardware to test around the end of September. If we can reproduce the issue with that, we can try backporting the commit @spirit found to see if it helps.

Back to 5.13.x I go. Between 5.15.x live migration issues and 6.2.x KSM issues, this is rough for production.

adamb · Oct 3, 2023

@fiona @aaron Any luck on the new dual socket server for testing?

aaron · Oct 3, 2023

They should arrive within the next days

Whatever · Oct 11, 2023

@aaron any news?

by the way, is there any change to build 5.15 kernel for PVE 8 for testing or build 6.2 with patch applied?

aaron · Oct 11, 2023

So, over the last 2 days we tested this with the new dual socket hardware and were able to reproduce it. With a working kernel (5.15, 5.19) we see a KSM of ~120 GiB after about an hour of booting up the system and all the test VMs. On the problematic kernels (6.1, 6.2) we only see about 45 GiB of KSM, even after letting it settle over night.

The commit found by @spirit seems to be the fix. It got introduced in the Ubuntu mainline kernel with 6.4.13. Tests with it and the previous version, 6.4.12, show that 6.4.12 still has the problematic behavior, and 6.4.13 behaves like the older 5.15 and 5.19 kernels, resulting in a good KSM usage.

Unfortunately, it seems that backporting the patch won't be easy. It is possible that the fix will be available only in a newer major kernel version in the (near) future.

aaron · Oct 11, 2023

Whatever said:
@aaron any news?

by the way, is there any change to build 5.15 kernel for PVE 8 for testing or build 6.2 with patch applied?

A bit hacky but it is absolutely possible to download the last 5.15 kernel from the bullseye no-subscription repo and install it manually. E.g. http://download.proxmox.com/debian/...ve-kernel-5.15.116-1-pve_5.15.116-1_amd64.deb
You will then have to add it via the proxmox-boot-tool and pin it to make booting from it the default for the time being.

Thats how I got it into the PVE 8 installed test machines for example.

adamb · Oct 11, 2023

aaron said:
So, over the last 2 days we tested this with the new dual socket hardware and were able to reproduce it. With a working kernel (5.15, 5.19) we see a KSM of ~120 GiB after about an hour of booting up the system and all the test VMs. On the problematic kernels (6.1, 6.2) we only see about 45 GiB of KSM, even after letting it settle over night.

The commit found by @spirit seems to be the fix. It got introduced in the Ubuntu mainline kernel with 6.4.13. Tests with it and the previous version, 6.4.12, show that 6.4.12 still has the problematic behavior, and 6.4.13 behaves like the older 5.15 and 5.19 kernels, resulting in a good KSM usage.

Unfortunately, it seems that backporting the patch won't be easy. It is possible that the fix will be available only in a newer major kernel version in the (near) future.

What can be done to prevent this in the future? KSM is critical to production for so many environments.

IMO PVE8 isn't production ready with this issue.

Whatever · Oct 11, 2023

aaron said:
A bit hacky but it is absolutely possible to download the last 5.15 kernel from the bullseye no-subscription repo and install it manually. E.g. http://download.proxmox.com/debian/...ve-kernel-5.15.116-1-pve_5.15.116-1_amd64.deb
You will then have to add it via the proxmox-boot-tool and pin it to make booting from it the default for the time being.

Thats how I got it into the PVE 8 installed test machines for example.

Do you mind to compile it with latest zfs?
And release as separate deb package?

aaron · Oct 12, 2023

There will always be issues that will only occur under very specific combinations of hardware and software. And while we try to keep a very diverse set of hardware around for this exact reason, we cannot cover every possible combination.
This time it was KSM on multi-socket systems and got noticed on systems where KSM was heavily used. The next time it will be something else.
The more people report the issue and can help to narrow down the root cause the better. Then we can hopefully provide a fix quickly.

We will definitely take a closer look at KSM on single and multi-socket systems in the future. Luckily, backporting fixes to older kernels is usually not as hard as it is in this particular case. The kernel 6.5 that will have the fix will be released in the upcoming weeks.

Whatever said:
Do you mind to compile it with latest zfs?
And release as separate deb package?

May I ask why you would need it? With the kernel I linked to in my previous answer, I was able to boot a ZFS based Proxmox VE installation that was installed with the 8.0 ISO and did not notice any issues.

Whatever · Oct 12, 2023

Why? Quite simple... I used to use KSM in my environment and there are lots of improvements in ZFS from version 2.1.11 (from version you mentioned).
As was mentioned above, KSM is one of the key feature and from my perspective 8.x can not not be production ready without this be fully working
Even more, this is not only about amount of memory available for VMs to start. Like many other users I'm facing significant performance drop with KSM enabled in kernel 6.2

By the way, there was an update in PVE kernel for 7.2 dated by October,3

(pve-kernel (5.15.126-1) bullseye; urgency=medium * update to Ubuntu-5.15.0-88.98 basing of 5.15.126 -- Proxmox Support Team <support@proxmox.com> Tue, 03 Oct 2023 19:24:13 +0)

so why don't you want to release one new update for 7.2 with latest ZFS (2.1.13) integrated?

P.S. @aaron just want to remind you that for now the only working workaround to somehow compensate performance drop in guests is disabling mitigations as well as KSM service. Hope you would agree that it is not the best solution
Just one link for example: https://forum.proxmox.com/threads/p...ows-server-2019-vms.130727/page-5#post-595613

t.lamprecht · Oct 16, 2023

Whatever said:
so why don't you want to release one new update for 7.2 with latest ZFS (2.1.13) integrated?

Because Proxmox VE 7 is in maintenance mode we focus on more targeted fixes, this kernel update contained an important fix for recent AMD microcode updates, and throwing in some ZFS update, that may well spot some kernel release-specific regressions, is far from ideal if one want's to roll out another fix relatively fast.
ZFS 2.1.13 will get build and released over the next few weeks, but as the majority of the changes are for compatibility with newer kernels, BSD stuff that doesn't affect Proxmox projects, or the ZFS test system (ZTS) the impact will be relatively low for most systems.

Anyway, back on topic: I would also like to remind all here that overcoming memory is not always ideal, to say lightly, KSM working or not, especially on shared hosts one opens up side-channels between the different VMs and the trust boundary between them is weakened, and it isn't out of question if it can be breached completely (extrapolating secrets, a bit in similar vain as the specter attacks on CPU caches).
So in general, I'd only use KSM for applications where all VMs trust each other fully, and still leave good headroom to avoid an overly high memory pressure when the KSM scales tips due to changes in the guest (updates, some abnormality in work load, ...) to avoid sudden spikes in memory pressure.

Don't get me wrong, that it went unnoticed is certainly not ideal, and we'll improve on that, but that it was broke for almost a year shows, that while some of you might depend heavily on this feature, the majority doesn't depend on such over commitment of memory, at least not with multi-socket setups.
While we normally can take on most backport and even help to upstream them to stable kernels, this is a bit too deep in specifics of the memory subsystem, which itself is a bit arcane where seemingly innocent changes can have a huge impact, so the time needed to ensure a specific backport can be made without other regressions is relatively big. As we do not see a lot of reports via enterprise channels affected by this to justify moving lots of resources to this topic, we currently recommend to keep using Proxmox VE 7 with it's 5.15 based kernel until we either manage to find a targeted fix or provide a 6.5 kernel, where currently the latter is more likely to happen sooner.

We also added a note for this issue in the Proxmox VE 7 to 8 upgrade guides known issue section, and also in the release notes known issue section so that others have a heads-up - thanks again to all the community that brought this to our attention and I naturally would have also preferred if there was a manageable fix available.

adamb · Oct 19, 2023

t.lamprecht said:
Because Proxmox VE 7 is in maintenance mode we focus on more targeted fixes, this kernel update contained an important fix for recent AMD microcode updates, and throwing in some ZFS update, that may well spot some kernel release-specific regressions, is far from ideal if one want's to roll out another fix relatively fast.
ZFS 2.1.13 will get build and released over the next few weeks, but as the majority of the changes are for compatibility with newer kernels, BSD stuff that doesn't affect Proxmox projects, or the ZFS test system (ZTS) the impact will be relatively low for most systems.

Anyway, back on topic: I would also like to remind all here that overcoming memory is not always ideal, to say lightly, KSM working or not, especially on shared hosts one opens up side-channels between the different VMs and the trust boundary between them is weakened, and it isn't out of question if it can be breached completely (extrapolating secrets, a bit in similar vain as the specter attacks on CPU caches).
So in general, I'd only use KSM for applications where all VMs trust each other fully, and still leave good headroom to avoid an overly high memory pressure when the KSM scales tips due to changes in the guest (updates, some abnormality in work load, ...) to avoid sudden spikes in memory pressure.

Don't get me wrong, that it went unnoticed is certainly not ideal, and we'll improve on that, but that it was broke for almost a year shows, that while some of you might depend heavily on this feature, the majority doesn't depend on such over commitment of memory, at least not with multi-socket setups.
While we normally can take on most backport and even help to upstream them to stable kernels, this is a bit too deep in specifics of the memory subsystem, which itself is a bit arcane where seemingly innocent changes can have a huge impact, so the time needed to ensure a specific backport can be made without other regressions is relatively big. As we do not see a lot of reports via enterprise channels affected by this to justify moving lots of resources to this topic, we currently recommend to keep using Proxmox VE 7 with it's 5.15 based kernel until we either manage to find a targeted fix or provide a 6.5 kernel, where currently the latter is more likely to happen sooner.

We also added a note for this issue in the Proxmox VE 7 to 8 upgrade guides known issue section, and also in the release notes known issue section so that others have a heads-up - thanks again to all the community that brought this to our attention and I naturally would have also preferred if there was a manageable fix available.

IMO most of us running large real production clusters are having to many issues on any of the 5.15.x and 6.2.x kernels. Its been a mess.

KSM is and always has been a solid tool, I would have known far sooner about KSM if I didn't move to 5.15.x and realize that live migration was completely broken between CPU generations. Right back to 5.13.x I went.

I have been around here long enough to know that things have been rock solid from Proxmox3 -> Proxmox7.1 in our enviroment. We are talking about almost a decade here.

If you guys want to be a enterprise solution these excuses need to stop. Just my 2 cents.

adamb · Nov 2, 2023

6.5.x looking good.

Search

Search

KSM Memory sharing not working as expected on 6.2.x kernel

adamb

Famous Member

fiona

Proxmox Staff Member

adamb

Famous Member

adamb

Famous Member

aaron

Proxmox Staff Member

Whatever

Renowned Member

aaron

Proxmox Staff Member

aaron

Proxmox Staff Member

adamb

Famous Member

Whatever

Renowned Member

aaron

Proxmox Staff Member

Whatever

Renowned Member

t.lamprecht

Proxmox Staff Member

adamb

Famous Member

adamb

Famous Member

We value your privacy