Hi,
Maybe you can try this version (with kernel headers installed):
https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-586756
What kernel version do they have? The patch fixing the issue is
82d811ff566594de3676f35808e8a9e19c5c864c
in stable v6.1.51. The commit introducing the issue was
a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
I did have the kernel-headers and kernel-devel packages installed for AlmaLinux.
The key... at least I think... was changing mmu_invalidate_seq to mmu_notifier_seq in the bpftrace script.
The AlmaLinux 8 kernel is based on the 4.18 kernel... but AlmaLinux (and RHEL before it) was known to backport a lot of stuff from other packages and keep the version number the same, so it's somewhat difficult to know what may be included in this kernel.
The kernel-devel package for the latest kernel - kernel-devel-4.18.0-477.27.2.el8_8.x86_64 - does not seem to have a is_page_fault_stale() function. So that's why I'm not sure if all of this discussion really relates to my AlmaLinux 8 issue - except for the fact that this whole thread is describing my symptoms to a tee.
There is a mention of mmu_notifier_seq in the /usr/src/kernels/4.18.0-477.27.2.el8_8.x86_64/include/linux/kvm_host.h file, but it declares mmu_notifier_seq as an unsigned long.
And in fact, I have run the modified bpftrace script looking at mmu_notifier_seq counts, and one server is showing this value to be 3,405,720,771 which is north of the 2,147,483,647 value for an int. But this particular server has never had this VM freezing issue.
I suppose one question would be if mmu_notifier_seq correlates directly to mmu_invalidate_seq?
The other question would be if the CPU in use plays a role in this somehow.
The server that has never had this VM freezing issue (the one with a mmu_notifier_seq count of 3,405,720,771) is using an AMD Ryzen 9 3900X CPU.
The other servers that are experiencing this VM freezing issue are using CPUs:
Intel Xeon E3-1230v2
Intel Xeon E3-1270v2
Intel Core i9-11900
The VM freezes happen randomly and I've never been able to find any cause. The last VM froze up after 2 days of uptime. Another froze up after 130+ days of uptime.
When the freeze ups happen, the qemu-kvm process is running at 100% of CPU. All of the CPUs dedicated to that VM (these are all single tenant node servers - only one VM running on the server) are showing 0% idle and just get stuck.
Again, sorry for muddying up this thread - as I said, I'm not using Proxmox - but I've been pulling my hair out for months trying to figure this out. I found this thread through a Google search and other than being AlmaLinux and not Proxmox, everything this thread describes seems to be happening during my freeze ups. A Google search doesn't seem to reveal any other AlmaLinux users experiencing this issue, which is puzzling itself.