Ubuntu 24 VM stuck after live migration - AMD EPYC

Henri44

New Member
Oct 29, 2024
4
0
1
Hi,

did some experiments with a Ubuntu24 LTS desktop VM (apt upgrade done today), after about 5 live migrations, the VM is stuck now and consumes a lot of CPU.

Any idea?

Thanks

Henri

root@pve2:/etc/pve/nodes/pve3/qemu-server# cat 136.conf
boot: order=scsi0;ide2;net0
cores: 4
cpu: x86-64-v3
ide2: dssNFS1:iso/ubuntu-24.04.2-desktop-amd64.iso,media=cdrom,size=6194550K
memory: 4096
meta: creation-qemu=9.0.2,ctime=1741074999
name: ZZZ
net0: virtio=BC:24:11:D2:77:9D,bridge=vmbr5,firewall=1
numa: 0
ostype: l26
scsi0: cephFS1:vm-136-disk-0,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=23aa1ed9-09c0-4685-b8f7-ace69df4134c
sockets: 1
vmgenid: d7223b43-9740-464f-b012-735e50c69f7c

pve-manager/8.3.4/65224a0f9cd294a3 (running kernel: 6.8.12-8-pve)

root@pve2:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
BIOS Vendor ID: AMD
Model name: AMD EPYC 7601 32-Core Processor
BIOS Model name: AMD EPYC 7601 32-Core Processor CPU @ 2.2GHz
BIOS CPU family: 107
CPU family: 23
Model: 1
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 118%
CPU max MHz: 2200.0000
CPU min MHz: 1200.0000
BogoMIPS: 4391.64
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mm
xext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl p
ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm ex
tapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext per
fctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsa
veopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushby
asid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 2 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 64 MiB (8 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60
NUMA node1 CPU(s): 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
NUMA node2 CPU(s): 2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62
NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Mitigation; untrained return thunk; SMT vulnerable
Spec rstack overflow: Mitigation; Safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected

root@pve3:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
BIOS Vendor ID: AMD
Model name: AMD EPYC 7513 32-Core Processor
BIOS Model name: AMD EPYC 7513 32-Core Processor CPU @
2.6GHz
BIOS CPU family: 107
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 77%
CPU max MHz: 3681.6399
CPU min MHz: 1500.0000
BogoMIPS: 5190.20
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
ca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall
nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep
_good nopl nonstop_tsc cpuid extd_apicid aperfmperf ra
pl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 ss
e4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lah
f_lm cmp_legacy svm extapic cr8_legacy abm sse4a misal
ignsse 3dnowprefetch osvw ibs skinit wdt tce topoext p
erfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb ca
t_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall
fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed
adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetb
v1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_
local user_shstk clzero irperf xsaveerptr rdpru wbnoin
vd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_s
cale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku
ospke vaes vpclmulqdq rdpid overflow_recov succor smca
debug_swap
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 128 MiB (4 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; Safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prct
l
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointe
r sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STI
BP always-on; RSB filling; PBRSB-eIBRS Not affected; B
HI Not affected
Srbds: Not affected
Tsx async abort: Not affected
 
Hi,
please share the output of pveversion -v and the system logs/journal from around the time of the issue from both source and target of the migration. Did you migrate the VM 5 times between those two nodes or were other nodes involved too?

Is the latest CPU microcode installed on both hosts: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysadmin_firmware_cpu ?

Do you see CPU usage inside the VM too or just for the QEMU process on the host?
 
What might also be interesting is getting a backtrace for the VM while it is in this state. You can run apt install pve-qemu-kvm-dbgsym gdb and then:
Code:
gdb --batch --ex 't a a bt' -p $(cat /var/run/qemu-server/136.pid) &> /tmp/vm-136-backtrace.txt
 
Hi,

thanks, I migrated just between the mentioned nodes (lscpu). The VM was stuck, have seen the CPU util. in Proxmox only.

Regards

Henri

proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.4 (running version: 8.3.4/65224a0f9cd294a3)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 18.2.4-pve3
ceph-fuse: 18.2.4-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.0
libpve-rs-perl: 0.9.1
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.6
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.0
pve-qemu-kvm: 9.0.2-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1
 
Other than providing the requested logs and ensuring latest CPU microcode is installed, you could also try with the newer 6.11 kernel and QEMU 9.2: