HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HP ProLiant DL380 Gen9 CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
HPE ProLiant DL360 Gen10 CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
Dell Inc. PowerEdge R430 CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Dell Inc. PowerEdge R910 CPU: Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz
Updated regularly. VM(windows and linux of different variations) based on KVM+CEPH.
Observed and can easily reproduce the problem of the VM hanging right after live-migration.
Main indications:
- 100% usage of CPU cores allocated to the virtual machine
- Complete ignoring of any actions in the console
- No response over network (even icmp)
Only way out: reset virtual machine.
Hypervisor logs(systemd-journal) just say that guest-agent of the virtual machine is not responding anymore.
The possibility of getting a hang is very high but not 100%.
Tried using pve-kernel-5.19 on all nodes of the cluster - the problem persists.
Tried using pve-kernel-5.13 - the problem is no longer observed.
But as far as I understand this kernel is out of date and the last update was half a year ago.
As a result, we now using:
proxmox-ve 7.2-1
ceph 17.2.1-pve1
pve-qemu-kvm 7.0.0-3
pve-kernel-5.13 7.1-9
pve-kernel-5.13.19-6-pve 5.13.19-15
qemu-server 7.2-4
On the forum and on bugtracker I haven't found the problem described in the same way. Everywhere has its own specifics, and somewhere upgrading to pve-kernel-5.19 helped someone, someone stopped updating his status etc.
Is there a known and described problem, or there is no confirmation of this problem, and research on this issue does not go?
Yes, this is known. If you live migrate from one cpu type to another, the result is undefined. It may work, or it does not. Try to have homogenous hardware all the time.
Have you set cpu=host or something similar?
Yes, this is known. If you live migrate from one cpu type to another, the result is undefined. It may work, or it does not. Try to have homogenous hardware all the time.
Have you set cpu=host or something similar?
No, we are using CPU default(kvm64) everywhere for all VMs. We also do not adjust the cpu flag settings when creating the VM.
Our infrastructure growth is gradual and unfortunately our budget does not allow us to make such large purchases of exactly the same hardware.
Unfortunately we don't have detailed statistics because we only have a cluster of productive environments.
If we take a short one, today migrations(10+ tests) from Dell R430 to Dell R910 were successful, and from Dell R910 to Dell R430 always(10+tests) with hangs. After downgrading on these nodes to 5.13 the problem disappeared.
I can make tests between nodes with different CPUs on kernel 5.15, but it takes time.
Each migration test was performed five times. The results are the same for each of the five migrations.
pve402 -> pve403 : VM STUCK, MANUALLY RESET TO RECOVER
pve402 -> pve406 : VM OK
pve402 -> pve408(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve402 -> pve409(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve403 -> pve402 : VM OK
pve403 -> pve406 : VM OK
pve403 -> pve408(5.13) : VM OK
pve403 -> pve409(5.13) : VM OK
pve406 -> pve402 : VM OK
pve406 -> pve403 : VM STUCK, MANUALLY RESET TO RECOVER
pve406 -> pve408(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve406 -> pve409(5.13) : VM STUCK, MANUALLY RESET TO RECOVER
pve408(5.13) -> pve402 : VM OK
pve408(5.13) -> pve403 : VM OK
pve408(5.13) -> pve406 : VM OK
pve408(5.13) -> pve409 : VM OK
pve409(5.13) -> pve402 : VM OK
pve409(5.13) -> pve403 : VM OK
pve409(5.13) -> pve406 : VM OK
pve409(5.13) -> pve408(5.13) : VM OK
Each migration test was performed five times. The results are the same for each of the five migrations.
Each node using pve-kernel-5.15.
Each CPU is Intel Xeon.
Silver 4210 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> Silver 4114 : VM OK
Silver 4210 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER
E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> Silver 4114 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK
Silver 4114 -> Silver 4210 : VM OK
Silver 4114 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER
E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> Silver 4114 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK
E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> Silver 4114 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK
I have to admit that I was wrong in the first post due to the lack of tests and test equipment. The behavior of pve-kernel-5.19 is different and partly solves the problem and partly creates a new one with a CPU that did not have it.
Yes, the server is old, but it is powerful and stable enough to do its job. When working with proxmox-6 cluster there were no complaints to it. And here before installing kernel 5.19 too.
All firmware is updated to the latest available versions. The intel-microcode/non-free package was installed - no new microcode was found.
Silver 4210 -> E5-2650 v4 : VM OK
Silver 4210 -> E5-2620 v3 : VM OK
Silver 4210 -> E7- 4860 : VM OK
E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK
E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK
E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK
pve-kernel-5.15
Code:
Silver 4210 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> Silver 4114 : VM OK
Silver 4210 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4210 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER
E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> Silver 4114 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK
Silver 4114 -> Silver 4210 : VM OK
Silver 4114 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
Silver 4114 -> E7- 4860 : VM STUCK, MANUALLY RESET TO RECOVER
E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> Silver 4114 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK
E7- 4860 -> Silver 4210 : VM OK
E7- 4860 -> E5-2650 v4 : VM OK
E7- 4860 -> Silver 4114 : VM OK
E7- 4860 -> E5-2620 v3 : VM OK
pve-kernel-5.19
Code:
Silver 4210 -> E5-2650 v4 : VM OK
Silver 4210 -> E5-2620 v3 : VM OK
Silver 4210 -> E7- 4860 : VM OK
E5-2650 v4 -> Silver 4210 : VM OK
E5-2650 v4 -> E5-2620 v3 : VM OK
E5-2650 v4 -> E7- 4860 : VM OK
E5-2620 v3 -> Silver 4210 : VM OK
E5-2620 v3 -> E5-2650 v4 : VM OK
E5-2620 v3 -> E7- 4860 : VM OK
E7- 4860 -> Silver 4210 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2650 v4 : VM STUCK, MANUALLY RESET TO RECOVER
E7- 4860 -> E5-2620 v3 : VM STUCK, MANUALLY RESET TO RECOVER
What can you advise other than to downgrade to pve-kernel-5.13 on all nodes of the productive cluster?
I have same issue. Both nodes are on same kernel 5.15.83-1-pve
Both intel system one being on i7-12700 (node 2 ) and other i5-9500t (node 1). When I move a vm from node 1 to node 2, it works fine and no freeze. When I move same VM back to node 1, then it freezes. I tried different VM just to be sure. Same issue.
Doesn't seems like. I recently did complete new install on two devices and having same issue. Both are intel (i7-12700 and i5-10500T based, latest kernel.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.