VMs unreachables after migration

afede · Nov 27, 2023

Hi,
I'm running the latest Proxmox (8.1.3) on two Dell Servers (first node is a R630, the second one is a R640). They are configured in cluster mode, sharing a nfs storage which contains the virtual disks of the vms. HA is enabled on every VM.
After the upgrade to this version, I'm experiencing a problem when I try to migrate a vm from one node to the other one.
VM seems to migrate correctly, console is responsive but the guest OS shows this

and the vm is unreachable from outside, neither reacts to the reboot/shutdown command (I need to reset it forcefully).

Note 1: Both nodes are running the same kernel, Linux 6.5.11-4-pve
Note 2: This seems to happen ONLY with Ubuntu VMs.

If you need further details, I'll provide you.

afede · Nov 27, 2023

This is the full trace I can see on the guest OS.

Nov 27 12:22:02 nxw-dns1 kernel: [2159560.631953] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632067] rcu: 0-...!: (0 ticks this GP) idle=894/0/0x0 softirq=38964753/38964753 fqs=0 (false positive?)
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632143] (detected by 1, t=15002 jiffies, g=57462465, q=7749)
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632150] Sending NMI from CPU 1 to CPUs 0:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632197] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633154] rcu: rcu_sched kthread timer wakeup didn't happen for 15001 jiffies! g57462465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633209] rcu: Possible timer handling issue on cpu=0 timer-softirq=25773615
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633246] rcu: rcu_sched kthread starved for 15002 jiffies! g57462465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633289] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633392] rcu: RCU grace-period kthread stack dump:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633555] task:rcu_sched state:I stack: 0 pid: 14 ppid: 2 flags:0x00004000
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633563] Call Trace:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633571] <TASK>
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633587] __schedule+0x24e/0x590
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633596] schedule+0x69/0x110
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633599] schedule_timeout+0x87/0x140
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633603] ? __bpf_trace_tick_stop+0x20/0x20
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633609] rcu_gp_fqs_loop+0xe5/0x330
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633615] rcu_gp_kthread+0xa7/0x130
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633619] ? rcu_gp_init+0x5f0/0x5f0
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633622] kthread+0x12a/0x150
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633628] ? set_kthread_struct+0x50/0x50
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633632] ret_from_fork+0x22/0x30
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633639] </TASK>
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633642] rcu: Stack dump where RCU GP kthread last ran:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633811] Sending NMI from CPU 1 to CPUs 0:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633823] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10

sb-jw · Nov 27, 2023

Are you using Host as CPU Type? The R630 has E5 v3/v4 and the R640 Scalable v1/v2, these are different architectures and could represent such an error pattern. If I'm right, then try whether it's better with kvm64 as CPU or the new Type x86-64-v2-AES with deactivated pcid flag.

Otherwise, look for rcu, there are already a few threads about it here.

bbgeek17 · Nov 27, 2023

As mentioned by @sb-jw - when using heterogeneous environments there is always a chance that things will not be compatible, unless you force the lowest common denominator. Specifically for CPU. With new PVE comes new Qemu, new host Kernel - new and improved support for CPU functions and flags.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

baggar11 · Nov 27, 2023

I actually just noticed this a couple days ago. Found a reddit post where someone else had stumbled across it as well. On that post, they referenced noticing it with an Ubuntu vm after migration.

I did some testing with a fresh install of Debian 12 and Windows11 using a Virtio NIC. Windows had no issues with migrations on any of the CPU profile types I tested.

Debian lost vm agent info and locked up with the following cpu profiles: KVM64, Qemu64 and x86-64-v2-aes.
Otherwise, migrations were fine using the following cpu profiles: x86-64-v2, x86-64-v3 and host.

Oddly, when I swapped the Debian vm over to an e1000 NIC, I had issues with migrations using Qemu64, x86-64-v2-aes and x86-64-v3 profiles.

This was tested on a fresh install of Proxmox 8.0-2 ISO, then all nodes upgraded to 8.1.3.

baggar11 · Nov 27, 2023

@sb-jw I just did some more testing with PCID turned off. I mostly had no issues on Debian 12 using a virtio NIC. I did have 1 failed migration using x86-64-v3, but tested it again and it went fine. Forgot to mention, CPU arch is an I5-12450H.

fabian · Nov 28, 2023

see https://forum.proxmox.com/threads/p...-11-4-rcu_sched-stall-cpu.136992/#post-610193

afede · Nov 28, 2023

Thank you all for the useful comments. So we can say the factors playing a role in this case are Host CPUs types mismatching and virtual hardware configuration of the vms, right?
Can we hope for a bugfix in the next kernel/pve release?

fabian · Nov 28, 2023

the patched kernel is already on pvetest (6.5.11-5-pve)

Search

Search

VMs unreachables after migration

afede

New Member

afede

New Member

sb-jw

Famous Member

bbgeek17

Distinguished Member

baggar11

Renowned Member

baggar11

Renowned Member

fabian

Proxmox Staff Member

afede

New Member

fabian

Proxmox Staff Member

We value your privacy