VM time jump when migrating

Andreas Pflug

Active Member
Nov 13, 2019
32
2
28
Today, I live-migrated several VMs away from a PVE 7.1 to a 7.2 machine (shared storage). Most VMs behaved fine, but two showed a strange issue and needed reboot (all VMs 4.19 kernel, NTP synced using systemd-timesyncd).

Migration took place at 11:25, took a few seconds with no problems reported, but VM kernel.log shows
11:26:53 vm01 kernel: [3196062.316169] INFO: rcu_sched self-detected stall on CPU
....
May 5 12:14:26 vm10 kernel: [3196062.330071] INFO: rcu_sched detected stalls on CPUs/tasks:
May 5 12:14:26 vm10 kernel: [3198927.720881] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 2669s! [systemd-journal:447]

Then the machine was rebooted at 11:36. Apparently, the migration made some CPU counter jump 40 minutes into the future (on both machines).

A third machine showed "task kworker/0:1:14690 blocked for more than 120 seconds." six times (10 minutes), then made a time jump of 65 minutes, showed
"rcu: INFO: rcu_sched self-detected stall on CPU" and went back to normal operation without admin intervention.

No storage or networking performance issues. Some Qemu problem?
 
Hey Andreas,

that sounds awfully like a problem we encountered on RHEL while upgrading from 8.5 to 8.6. We saw a lot of running VMs (all different kinds of OS and kernels) that made huge timejumps into the future (sometimes even up to years) whenever a new VM was started or migrated in to a host running the new kernel of EL8.6.

Since we fixed it by downgrading to an older kernel, I would like to know, if you were able to collect some more details over the last months or how you solved it.

Reading the release notes from PVE 7.1 to 7.2, it seems that the kernel was upgraded vom 5.13 to 5.15. So if we share the same problem, maybe there was some regression between those two releases.

BTW: since I could not find an other report for that problem, a bug was opened at Redhat (https://bugzilla.redhat.com/show_bug.cgi?id=2121795), but no feedback there yet.
 
Many thanks Andreas,
I already found those thread and bugreport after going through your post-history :)

From that I gather the proxmox-team has a patched test-kernel which seems to fix the issue for most users. I cross-referenced that proxmox-bug and the origin of that patch to the Redhat-developers. Hopefully they can sort some things out too.

At least I am quite happy that other users seeing our problems too, and that it is indeed kernel related. That gives me a better feeling for our production VMs currently running under the old-kernel version, so that we can wait for some proper fix.
 
Hmmm, thanks for that pointer above. From what I grasp that user talkes about Proxmox-kernel 5.15.39-4-pve, while the "test-patched" one was 5.15.39-3-pve-guest-fpu.
So either that newer kernel doesn't carry that patch (from which other users said it fixes the issue), or the user has an other problem not related to the original bug, or that patch indeed is not good enough. Unfortunately I am coming from the RHEL-side and not use proxmox myself, so I cannot test the kernel myself.

But according to your last sentence: yes, we are seeing absolutely the same. On a host running the problematic kernel it is enough to start a new VM from the CLI (without a migration) to trigger timejumps in other guest VMs running on that same host. We were able to reproduce that even for guests that were started on that host, had no migration-history and were still affected by another freshly started VMs.

As I stated in that Redhat-bug the affected guests made some serios timejumps into the future, sometime for a couple of hours or days, sometimes for even several hundred years into the future.

My guesstimate is, that the start of a new KVM-instance triggers some kind of corruption in the state of guest that might do some context-switch in that same moment.
 
Thanks, I added that posting and patchset to the Redhat-developers as well. Hopefully they can see something similar on there end.

Biggest problem for me is to actually go through the kernel-changes that Redhat went through between EL8.5 and 8.6, since there are so many backports to all kind of stuff. But I am pretty sure that your problem and mine are actually the same, which makes me believe that it results in some upstream change that came from the vanilla kernel.
 
And just to confirm: you also only see the issue on AMD? Since Redhat just mentioned a similar patch they prepared "KVM: SVM: fix tsc scaling cache logic" and from what I understand only affects AMD.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!