Is issue known? VMs can stuck after live migration.

Kizu · Oct 8, 2022

Good day!

Cluster of 9 nodes proxmox-ve 7.2

HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Dell Inc. PowerEdge R430 CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Dell Inc. PowerEdge R910 CPU: Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz

Updated regularly. VM(windows and linux of different variations) based on KVM+CEPH.

Observed and can easily reproduce the problem of the VM hanging right after live-migration.
Main indications:
- 100% usage of CPU cores allocated to the virtual machine
- Complete ignoring of any actions in the console
- No response over network (even icmp)
Only way out: reset virtual machine.

Hypervisor logs(systemd-journal) just say that guest-agent of the virtual machine is not responding anymore.

The possibility of getting a hang is very high but not 100%.

Tried using pve-kernel-5.19 on all nodes of the cluster - the problem persists.

Tried using pve-kernel-5.13 - the problem is no longer observed.

But as far as I understand this kernel is out of date and the last update was half a year ago.

As a result, we now using:
proxmox-ve 7.2-1
ceph 17.2.1-pve1
pve-qemu-kvm 7.0.0-3
pve-kernel-5.13 7.1-9
pve-kernel-5.13.19-6-pve 5.13.19-15
qemu-server 7.2-4

On the forum and on bugtracker I haven't found the problem described in the same way. Everywhere has its own specifics, and somewhere upgrading to pve-kernel-5.19 helped someone, someone stopped updating his status etc.

Is there a known and described problem, or there is no confirmation of this problem, and research on this issue does not go?

LnxBil · Oct 8, 2022

Yes, this is known. If you live migrate from one cpu type to another, the result is undefined. It may work, or it does not. Try to have homogenous hardware all the time.
Have you set cpu=host or something similar?

Kizu · Oct 8, 2022

LnxBil said:
Yes, this is known. If you live migrate from one cpu type to another, the result is undefined. It may work, or it does not. Try to have homogenous hardware all the time.
Have you set cpu=host or something similar?

Thank you for your reply.

No, we are using CPU default(kvm64) everywhere for all VMs. We also do not adjust the cpu flag settings when creating the VM.
Our infrastructure growth is gradual and unfortunately our budget does not allow us to make such large purchases of exactly the same hardware.

# cat /etc/pve/qemu-server/1000001.conf
agent: 1
boot: order=virtio0;ide2
cores: 4
ide2: none,media=cdrom
memory: 16384
meta: creation-qemu=7.0.0,ctime=1664257819
name: TestProxMigrateVM
net0: virtio=20:CF:30:2B:45:3C,bridge=vmbr0,tag=160
numa: 0
ostype: win10
scsihw: virtio-scsi-pci
smbios1: uuid=3a884080-3b09-45ae-99fb-2a84d091fc99
sockets: 1
virtio0: cpool4:vm-1000001-disk-0,size=50G
vmgenid: 22cba37f-6311-4dbf-8b31-712be0fe5a38

# cat /etc/pve/nodes/pve402/qemu-server/320.conf
agent: 1
boot: order=virtio0;ide2;net0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: sentry
net0: virtio=00:16:3E:23:8B:22,bridge=vmbr0
numa: 0
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=be72b9e1-b65b-4e25-8148-3381563792b3
sockets: 1
virtio0: cpool4:vm-320-disk-0,size=32G
vmgenid: b29329c0-0643-4d7d-a4fc-3bab293c46a0

LnxBil · Oct 8, 2022

Kizu said:
No, we are using CPU default(kvm64) everywhere for all VMs. We also do not adjust the cpu flag settings when creating the VM.

Okay, so the best of circumstances with respect to the configuration.

Do the problems occur in both directions?

Kizu · Oct 8, 2022

Unfortunately we don't have detailed statistics because we only have a cluster of productive environments.

If we take a short one, today migrations(10+ tests) from Dell R430 to Dell R910 were successful, and from Dell R910 to Dell R430 always(10+tests) with hangs. After downgrading on these nodes to 5.13 the problem disappeared.

I can make tests between nodes with different CPUs on kernel 5.15, but it takes time.

LnxBil · Oct 8, 2022

Kizu said:
After downgrading on these nodes to 5.13 the problem disappeared.

Yes, 5.15 had some strange problems, maybe this is one of them.

Kizu said:
I can make tests between nodes with different CPUs on kernel 5.15, but it takes time.

No biggy, just try when you can.

Kizu · Oct 8, 2022

pve402 - HPE ProLiant DL360 Gen10 - Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz - pve-kernel-5.15

2 sockets, 20 cores, 40 vcores

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
stepping        : 7
microcode       : 0x5002f00
cpu MHz         : 2200.000
cache size      : 14080 KB
physical id     : 0
siblings        : 20
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit mmio_stale_data retbleed
bogomips        : 4400.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

pve403 - HP ProLiant DL380 Gen9 - Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz - pve-kernel-5.15

2 sockets, 24 cores, 48 vcores

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
stepping        : 1
microcode       : 0xb00001f
cpu MHz         : 2497.454
cache size      : 30720 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data
bogomips        : 4394.82
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

pve406 - HPE ProLiant DL360 Gen10 - Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz - pve-kernel-5.15

2 sockets, 20 cores, 40 vcores

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
stepping        : 4
microcode       : 0x200005e
cpu MHz         : 2200.000
cache size      : 14080 KB
physical id     : 0
siblings        : 20
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke md_clear flush_l1d
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit mmio_stale_data retbleed
bogomips        : 4400.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

pve408 - Dell Inc. PowerEdge R430 CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - pve-kernel-5.13

2 sockets, 12 cores, 24 vcores

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x43
cpu MHz         : 2096.901
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 4794.61
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

pve409 - Dell Inc. PowerEdge R910 CPU: Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz - pve-kernel-5.13

4 sockets, 40 cores, 80 vcores

Code:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 47
model name      : Intel(R) Xeon(R) CPU E7- 4860  @ 2.27GHz
stepping        : 2
microcode       : 0x3b
cpu MHz         : 2260.861
cache size      : 24576 KB
physical id     : 0
siblings        : 20
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida arat flush_l1d
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs            : clflush_monitor cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 4521.72
clflush size    : 64
cache_alignment : 64
address sizes   : 44 bits physical, 48 bits virtual
power management:

Each migration test was performed five times. The results are the same for each of the five migrations.

pve402 -> pve403 : VM STUCK, MANUALLY RESET TO RECOVER
pve402 -> pve406 : VM OK
pve402 -> pve408(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve402 -> pve409(5.13) : VM STUCK, MANUALLY RESET TO RECOVER

pve403 -> pve402 : VM OK
pve403 -> pve406 : VM OK
pve403 -> pve408(5.13) : VM OK
pve403 -> pve409(5.13) : VM OK

pve406 -> pve402 : VM OK
pve406 -> pve403 : VM STUCK, MANUALLY RESET TO RECOVER
pve406 -> pve408(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve406 -> pve409(5.13) : VM STUCK, MANUALLY RESET TO RECOVER

pve408(5.13) -> pve402 : VM OK
pve408(5.13) -> pve403 : VM OK
pve408(5.13) -> pve406 : VM OK
pve408(5.13) -> pve409 : VM OK

pve409(5.13) -> pve402 : VM OK
pve409(5.13) -> pve403 : VM OK
pve409(5.13) -> pve406 : VM OK
pve409(5.13) -> pve408(5.13) : VM OK

Kizu · Oct 8, 2022

Additional statistics.

Each migration test was performed five times. The results are the same for each of the five migrations.
Each node using pve-kernel-5.15.
Each CPU is Intel Xeon.

Silver 4210 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> Silver 4114 : VM OK
Silver 4210 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER

E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> Silver 4114 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK

Silver 4114 -> Silver 4210 : VM OK
Silver 4114 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER

E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> Silver 4114 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK

E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> Silver 4114 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK

tom · Oct 8, 2022

Test 5.19 kernel, there are some fixes regarding live migrations

=> https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090/

Kizu · Oct 8, 2022

I have to admit that I was wrong in the first post due to the lack of tests and test equipment. The behavior of pve-kernel-5.19 is different and partly solves the problem and partly creates a new one with a CPU that did not have it.

When upgrading nodes for tests, the problem with

using 4 sockets; 40 cores; 80 vcores

Code:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 47
model name : Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz
stepping : 2
microcode : 0x3b
cpu MHz : 1064.096
cache size : 24576 KB
physical id : 0
siblings : 20
core id : 0
cpu cores : 10
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt aes lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm arat flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs : clflush_monitor cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips : 4522.07
clflush size : 64
cache_alignment : 64
address sizes : 44 bits physical, 48 bits virtual
power management:

The problem is solved with other CPUs under test:

pve-kernel-5.19

Silver 4210 -> E5-2650 v4 : VM OK
Silver 4210 -> E5-2620 v3 : VM OK
Silver 4210 -> E7- 4860 : VM OK

E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK

E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK

E7- 4860 -> Silver 4210 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER

Neobin · Oct 8, 2022

The E7-4860 is over 11 years old [1], so...
Have you made sure, that the bios and all firmwares are on the most recent version?

In addition, you could also see, if installing the Intel-microcode package [2] helps, if not already done.

[1] https://ark.intel.com/content/www/u...60-24m-cache-2-26-ghz-6-40-gts-intel-qpi.html
[2] https://wiki.debian.org/Microcode

Kizu · Oct 8, 2022

Neobin said:
The E7-4860 is over 11 years old [1], so...
Have you made sure, that the bios and all firmwares are on the most recent version?

In addition, you could also see, if installing the Intel-microcode package [2] helps, if not already done.

[1] https://ark.intel.com/content/www/u...60-24m-cache-2-26-ghz-6-40-gts-intel-qpi.html
[2] https://wiki.debian.org/Microcode

Yes, the server is old, but it is powerful and stable enough to do its job. When working with proxmox-6 cluster there were no complaints to it. And here before installing kernel 5.19 too.

All firmware is updated to the latest available versions. The intel-microcode/non-free package was installed - no new microcode was found.

Kizu · Oct 10, 2022

Good afternoon!

On the VM-free cluster nodes, I made a kernel downgrade to pve-kernel-5.13 and did more tests.

I will attach all the results here for clarity:

proxmox-ve: 7.2-1 (running kernel: 5.13.19-6-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-11
pve-kernel-5.19: 7.2-11
pve-kernel-5.13: 7.1-9
pve-kernel-5.19.7-1-pve: 5.19.7-1
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.15.39-4-pve: 5.15.39-4
pve-kernel-5.15.30-2-pve: 5.15.30-3
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph: 17.2.1-pve1
ceph-fuse: 17.2.1-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-3
libpve-guest-common-perl: 4.1-3
libpve-http-server-perl: 4.1-4
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-4
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

pve-kernel-5.13

Code:

Silver 4210 -> E5-2650 v4 : VM OK
Silver 4210 -> E5-2620 v3 : VM OK
Silver 4210 -> E7- 4860 : VM OK

E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK

E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK

E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK

pve-kernel-5.15

Code:

Silver 4210 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> Silver 4114 : VM OK
Silver 4210 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER

E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> Silver 4114 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK

Silver 4114 -> Silver 4210 : VM OK
Silver 4114 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER

E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> Silver 4114 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK

E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> Silver 4114 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK

pve-kernel-5.19

Code:

Silver 4210 -> E5-2650 v4 : VM OK
Silver 4210 -> E5-2620 v3 : VM OK
Silver 4210 -> E7- 4860 : VM OK

E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK

E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK

E7- 4860 -> Silver 4210 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER

What can you advise other than to downgrade to pve-kernel-5.13 on all nodes of the productive cluster?

Jens-FIT · Feb 1, 2023

Same here:
PVE 7.3 wth 5.15
Platinum 8268 -> E5-2687W : VM STUCK
E5-2687W -> Platinum 8268 : VM OK

Haven't tested other kernels jet.

Any news on that issue?

Kizu · Feb 1, 2023

We have to use pve-kernel-5.13 at all nodes to avoid this problem

ZooKeeper · Feb 6, 2023

I have same issue. Both nodes are on same kernel 5.15.83-1-pve

Both intel system one being on i7-12700 (node 2 ) and other i5-9500t (node 1). When I move a vm from node 1 to node 2, it works fine and no freeze. When I move same VM back to node 1, then it freezes. I tried different VM just to be sure. Same issue.

Jens-FIT · Jun 2, 2023

Is there any update? I'm hesitating to upgrade to 7.4.

ZooKeeper · Jun 2, 2023

Jens-FIT said:
Is there any update? I'm hesitating to upgrade to 7.4.

Doesn't seems like. I recently did complete new install on two devices and having same issue. Both are intel (i7-12700 and i5-10500T based, latest kernel.

Search

Search

Is issue known? VMs can stuck after live migration.

Kizu

Member

LnxBil

Distinguished Member

Kizu

Member

LnxBil

Distinguished Member

Kizu

Member

LnxBil

Distinguished Member

Kizu

Member

Kizu

Member

tom

Proxmox Staff Member

Kizu

Member

Neobin

Distinguished Member

Kizu

Member

Kizu

Member

Jens-FIT

New Member

Kizu

Member

ZooKeeper

Active Member

Jens-FIT

New Member

ZooKeeper

Active Member