VM time jump when migrating

Andreas Pflug · May 5, 2022

Today, I live-migrated several VMs away from a PVE 7.1 to a 7.2 machine (shared storage). Most VMs behaved fine, but two showed a strange issue and needed reboot (all VMs 4.19 kernel, NTP synced using systemd-timesyncd).

Migration took place at 11:25, took a few seconds with no problems reported, but VM kernel.log shows
11:26:53 vm01 kernel: [3196062.316169] INFO: rcu_sched self-detected stall on CPU
....
May 5 12:14:26 vm10 kernel: [3196062.330071] INFO: rcu_sched detected stalls on CPUs/tasks:
May 5 12:14:26 vm10 kernel: [3198927.720881] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 2669s! [systemd-journal:447]

Then the machine was rebooted at 11:36. Apparently, the migration made some CPU counter jump 40 minutes into the future (on both machines).

A third machine showed "task kworker/0:1:14690 blocked for more than 120 seconds." six times (10 minutes), then made a time jump of 65 minutes, showed
"rcu: INFO: rcu_sched self-detected stall on CPU" and went back to normal operation without admin intervention.

No storage or networking performance issues. Some Qemu problem?

cubbi · Aug 31, 2022

Hey Andreas,

that sounds awfully like a problem we encountered on RHEL while upgrading from 8.5 to 8.6. We saw a lot of running VMs (all different kinds of OS and kernels) that made huge timejumps into the future (sometimes even up to years) whenever a new VM was started or migrated in to a host running the new kernel of EL8.6.

Since we fixed it by downgrading to an older kernel, I would like to know, if you were able to collect some more details over the last months or how you solved it.

Reading the release notes from PVE 7.1 to 7.2, it seems that the kernel was upgraded vom 5.13 to 5.15. So if we share the same problem, maybe there was some regression between those two releases.

BTW: since I could not find an other report for that problem, a bug was opened at Redhat (https://bugzilla.redhat.com/show_bug.cgi?id=2121795), but no feedback there yet.

Andreas Pflug · Aug 31, 2022

There's another discussion at https://forum.proxmox.com/threads/zeitsprünge-in-vms-seit-pve-7-2.112756/, and the bug https://bugzilla.proxmox.com/show_bug.cgi?id=4073.
My latest action was to evacuate one host that showed the problem from production VMs, deploying only test VMs (no reboot in the meantime), but currently I can't reproduce the problem with the 5.15 kernel. Consequently Ican't verify that rolling back to 5.13 hot-fixes the problem.

cubbi · Aug 31, 2022

Many thanks Andreas,
I already found those thread and bugreport after going through your post-history

From that I gather the proxmox-team has a patched test-kernel which seems to fix the issue for most users. I cross-referenced that proxmox-bug and the origin of that patch to the Redhat-developers. Hopefully they can sort some things out too.

At least I am quite happy that other users seeing our problems too, and that it is indeed kernel related. That gives me a better feeling for our production VMs currently running under the old-kernel version, so that we can wait for some proper fix.

Andreas Pflug · Aug 31, 2022

According to https://forum.proxmox.com/threads/p...x-freeze-on-windows.109645/page-2#post-494607, the problem seems not to be fixed by that patched kernel.
I found that just starting a VM may affect other VMs, so no migration or dissimilar CPUs that could cause problems.

Andreas Pflug · Aug 31, 2022

Issue is being handled here https://lists.proxmox.com/pipermail/pve-devel/2022-August/053814.html

cubbi · Aug 31, 2022

Hmmm, thanks for that pointer above. From what I grasp that user talkes about Proxmox-kernel 5.15.39-4-pve, while the "test-patched" one was 5.15.39-3-pve-guest-fpu.
So either that newer kernel doesn't carry that patch (from which other users said it fixes the issue), or the user has an other problem not related to the original bug, or that patch indeed is not good enough. Unfortunately I am coming from the RHEL-side and not use proxmox myself, so I cannot test the kernel myself.

But according to your last sentence: yes, we are seeing absolutely the same. On a host running the problematic kernel it is enough to start a new VM from the CLI (without a migration) to trigger timejumps in other guest VMs running on that same host. We were able to reproduce that even for guests that were started on that host, had no migration-history and were still affected by another freshly started VMs.

As I stated in that Redhat-bug the affected guests made some serios timejumps into the future, sometime for a couple of hours or days, sometimes for even several hundred years into the future.

My guesstimate is, that the start of a new KVM-instance triggers some kind of corruption in the state of guest that might do some context-switch in that same moment.

cubbi · Aug 31, 2022

Andreas Pflug said:
Issue is being handled here https://lists.proxmox.com/pipermail/pve-devel/2022-August/053814.html

Thanks, I added that posting and patchset to the Redhat-developers as well. Hopefully they can see something similar on there end.

Biggest problem for me is to actually go through the kernel-changes that Redhat went through between EL8.5 and 8.6, since there are so many backports to all kind of stuff. But I am pretty sure that your problem and mine are actually the same, which makes me believe that it results in some upstream change that came from the vanilla kernel.

cubbi · Aug 31, 2022

And just to confirm: you also only see the issue on AMD? Since Redhat just mentioned a similar patch they prepared "KVM: SVM: fix tsc scaling cache logic" and from what I understand only affects AMD.

cubbi · Aug 31, 2022

The issue Redhat referenced is a bug resulting from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ab9287add5e. That one was fixed with https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11d39e8cc43e (now upstream since kernel 5.19-rc2)

Search

Search

VM time jump when migrating

Andreas Pflug

Active Member

cubbi

New Member

Andreas Pflug

Active Member

cubbi

New Member

Andreas Pflug

Active Member

Andreas Pflug

Active Member

cubbi

New Member

cubbi

New Member

cubbi

New Member

cubbi

New Member