VMs unreachables after migration

afede

New Member
Jun 19, 2023
4
0
1
Hi,
I'm running the latest Proxmox (8.1.3) on two Dell Servers (first node is a R630, the second one is a R640). They are configured in cluster mode, sharing a nfs storage which contains the virtual disks of the vms. HA is enabled on every VM.
After the upgrade to this version, I'm experiencing a problem when I try to migrate a vm from one node to the other one.
VM seems to migrate correctly, console is responsive but the guest OS shows this

errore dns.png

and the vm is unreachable from outside, neither reacts to the reboot/shutdown command (I need to reset it forcefully).

Note 1: Both nodes are running the same kernel, Linux 6.5.11-4-pve
Note 2: This seems to happen ONLY with Ubuntu VMs.

If you need further details, I'll provide you.
 
This is the full trace I can see on the guest OS.

Nov 27 12:22:02 nxw-dns1 kernel: [2159560.631953] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632067] rcu: 0-...!: (0 ticks this GP) idle=894/0/0x0 softirq=38964753/38964753 fqs=0 (false positive?)
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632143] (detected by 1, t=15002 jiffies, g=57462465, q=7749)
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632150] Sending NMI from CPU 1 to CPUs 0:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.632197] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633154] rcu: rcu_sched kthread timer wakeup didn't happen for 15001 jiffies! g57462465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633209] rcu: Possible timer handling issue on cpu=0 timer-softirq=25773615
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633246] rcu: rcu_sched kthread starved for 15002 jiffies! g57462465 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633289] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633392] rcu: RCU grace-period kthread stack dump:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633555] task:rcu_sched state:I stack: 0 pid: 14 ppid: 2 flags:0x00004000
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633563] Call Trace:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633571] <TASK>
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633587] __schedule+0x24e/0x590
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633596] schedule+0x69/0x110
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633599] schedule_timeout+0x87/0x140
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633603] ? __bpf_trace_tick_stop+0x20/0x20
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633609] rcu_gp_fqs_loop+0xe5/0x330
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633615] rcu_gp_kthread+0xa7/0x130
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633619] ? rcu_gp_init+0x5f0/0x5f0
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633622] kthread+0x12a/0x150
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633628] ? set_kthread_struct+0x50/0x50
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633632] ret_from_fork+0x22/0x30
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633639] </TASK>
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633642] rcu: Stack dump where RCU GP kthread last ran:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633811] Sending NMI from CPU 1 to CPUs 0:
Nov 27 12:22:02 nxw-dns1 kernel: [2159560.633823] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xb/0x10
 
Are you using Host as CPU Type? The R630 has E5 v3/v4 and the R640 Scalable v1/v2, these are different architectures and could represent such an error pattern. If I'm right, then try whether it's better with kvm64 as CPU or the new Type x86-64-v2-AES with deactivated pcid flag.

Otherwise, look for rcu, there are already a few threads about it here.
 
As mentioned by @sb-jw - when using heterogeneous environments there is always a chance that things will not be compatible, unless you force the lowest common denominator. Specifically for CPU. With new PVE comes new Qemu, new host Kernel - new and improved support for CPU functions and flags.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I actually just noticed this a couple days ago. Found a reddit post where someone else had stumbled across it as well. On that post, they referenced noticing it with an Ubuntu vm after migration.

I did some testing with a fresh install of Debian 12 and Windows11 using a Virtio NIC. Windows had no issues with migrations on any of the CPU profile types I tested.

Debian lost vm agent info and locked up with the following cpu profiles: KVM64, Qemu64 and x86-64-v2-aes.
Otherwise, migrations were fine using the following cpu profiles: x86-64-v2, x86-64-v3 and host.

Oddly, when I swapped the Debian vm over to an e1000 NIC, I had issues with migrations using Qemu64, x86-64-v2-aes and x86-64-v3 profiles.

This was tested on a fresh install of Proxmox 8.0-2 ISO, then all nodes upgraded to 8.1.3.
 
Last edited:
@sb-jw I just did some more testing with PCID turned off. I mostly had no issues on Debian 12 using a virtio NIC. I did have 1 failed migration using x86-64-v3, but tested it again and it went fine. Forgot to mention, CPU arch is an I5-12450H.
 
Last edited:
Thank you all for the useful comments. So we can say the factors playing a role in this case are Host CPUs types mismatching and virtual hardware configuration of the vms, right?
Can we hope for a bugfix in the next kernel/pve release?
 
the patched kernel is already on pvetest (6.5.11-5-pve)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!