Thanks for all the reports and discussions. We have tried to reproduce the intermittent freezes (CPU spikes / lost pings) reported in this thread in our test environment, but have not succeeded so far.
Hence, the root cause of the intermittent freezes is unfortunately still unclear. Let me try to summarize the reports from this thread:
Since @Neobin @Whatever @mygeeknc mentioned the KSM regressions on dual-socket machines with kernel 6.2 discussed at [2], we also tried to reproduce the intermittent freezes on the dual-socket test machine which does exhibit the KSM regressions, but no luck so far.
To everyone who can easily reproduce the freezes on a test machine:
[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
[2] https://forum.proxmox.com/threads/ksm-memory-sharing-not-working-as-expected-on-6-2-x-kernel.131082/
[3] https://forum.proxmox.com/threads/k...ted-on-6-2-x-kernel.131082/page-3#post-595600
[4] https://kernel.ubuntu.com/mainline/v6.4.12/amd64/
[5] https://kernel.ubuntu.com/mainline/v6.4.13/amd64/
Hence, the root cause of the intermittent freezes is unfortunately still unclear. Let me try to summarize the reports from this thread:
- All freezes were reported on host kernels 6.2, no freezes reported on kernel 5.15
- There are no reports whether kernels 5.19 or 6.1 are affected or not
- On kernel 6.2, disabling KSM and mitigations fixes the issue. Obviously, disabling mitigations is not advisable in the general case.
- Most reported freezes mention a dual-socket CPU
- The issue primarily affects Windows guests, more specifically Win2019
- Freezes become more likely with higher amounts (> 128G) of configured guest memory
- The intermittent freezes seem unrelated to the (permanent) 100% CPU freeze issues that was discussed over at [1], as those are fixed in kernels
>=6.2.16-12
, but e.g. @Whatever @Sebi-S report intermittent freezes here still with 6.2.16-15 and 6.2.16-12.
Since @Neobin @Whatever @mygeeknc mentioned the KSM regressions on dual-socket machines with kernel 6.2 discussed at [2], we also tried to reproduce the intermittent freezes on the dual-socket test machine which does exhibit the KSM regressions, but no luck so far.
To everyone who can easily reproduce the freezes on a test machine:
- Could you check whether there is anything in the (host) journal during the freezes?
- Please fill in $YOUR_VMID in the following script, save it and run it during an intermittent freeze.
Code:#/bin/bash VMID=$YOUR_VMID PID=$(cat /var/run/qemu-server/$VMID.pid) timeout 5 strace -c -p $PID grep '' /sys/fs/cgroup/qemu.slice/$VMID.scope/*.pressure for _ in {1..5}; do grep '' /proc/$PID/ksm*; sleep 1 done
- If you do *not* use ZFS, there may be one interesting (but very hacky, so beware!) thing you could try to see if there is any connection to the KSM regressions [2]:
- 1) Install and boot Ubuntu mainline kernel 6.4.12 on the host, check whether the intermittent freezes are reproducible
- 2) Install and boot Ubuntu mainline kernel 6.4.13 on the host, check whether they are still reproducible.
If you do want to try this: To install a Ubuntu mainline kernel, download thelinux-image-unsigned-[...].deb
andlinux-modules-[...].deb
from https://kernel.ubuntu.com/mainline/ (for 6.4.12 see [4], for 6.4.13 [5]), and install them with oneapt
command, i.e.,apt install ./linux-image-unsigned-[...].deb ./linux-modules-[...].deb
. Note that running PVE with a Ubuntu mainline kernel 6.4 is definitely not a supported setup, but this experiment could potentially be helpful in debugging this issue.
[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
[2] https://forum.proxmox.com/threads/ksm-memory-sharing-not-working-as-expected-on-6-2-x-kernel.131082/
[3] https://forum.proxmox.com/threads/k...ted-on-6-2-x-kernel.131082/page-3#post-595600
[4] https://kernel.ubuntu.com/mainline/v6.4.12/amd64/
[5] https://kernel.ubuntu.com/mainline/v6.4.13/amd64/
Last edited: