Today, I live-migrated several VMs away from a PVE 7.1 to a 7.2 machine (shared storage). Most VMs behaved fine, but two showed a strange issue and needed reboot (all VMs 4.19 kernel, NTP synced using systemd-timesyncd).
Migration took place at 11:25, took a few seconds with no problems reported, but VM kernel.log shows
11:26:53 vm01 kernel: [3196062.316169] INFO: rcu_sched self-detected stall on CPU
....
May 5 12:14:26 vm10 kernel: [3196062.330071] INFO: rcu_sched detected stalls on CPUs/tasks:
May 5 12:14:26 vm10 kernel: [3198927.720881] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 2669s! [systemd-journal:447]
Then the machine was rebooted at 11:36. Apparently, the migration made some CPU counter jump 40 minutes into the future (on both machines).
A third machine showed "task kworker/0:1:14690 blocked for more than 120 seconds." six times (10 minutes), then made a time jump of 65 minutes, showed
"rcu: INFO: rcu_sched self-detected stall on CPU" and went back to normal operation without admin intervention.
No storage or networking performance issues. Some Qemu problem?
Migration took place at 11:25, took a few seconds with no problems reported, but VM kernel.log shows
11:26:53 vm01 kernel: [3196062.316169] INFO: rcu_sched self-detected stall on CPU
....
May 5 12:14:26 vm10 kernel: [3196062.330071] INFO: rcu_sched detected stalls on CPUs/tasks:
May 5 12:14:26 vm10 kernel: [3198927.720881] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 2669s! [systemd-journal:447]
Then the machine was rebooted at 11:36. Apparently, the migration made some CPU counter jump 40 minutes into the future (on both machines).
A third machine showed "task kworker/0:1:14690 blocked for more than 120 seconds." six times (10 minutes), then made a time jump of 65 minutes, showed
"rcu: INFO: rcu_sched self-detected stall on CPU" and went back to normal operation without admin intervention.
No storage or networking performance issues. Some Qemu problem?