As for Intel vs AMD side of this issue, we are seeing this on both, but AMD is way more impacted.
We have 10 node cluster of rather older hardware - CPUs are Intel Xeons E5 (dual socket) and one AMD Opteron 6344 (quad socket).
Nodes were running stable.
We're using PBS.
We did a PVE update from 6.2-11 to 6.3-4 on 2-3 nodes in one day, observe, move to next nodes on following day.
All was good, node with AMD Opteron was last to update.
After updating Opteron all VMs on that node crashed during night:
- no response for ICMP, ssh from VM
- can't connect to VM's console in node GUI
- qm stop managed to stop VM by killing process (not gracefully)
- qm start started VM normally
Load on node was about 1-1,5 for every VM, lots of qmp command failed and qmp socket timeout in syslog on node.
Situation happened again next morning, same scenario - VMs dead, high load on node, qmp errors in syslog.
Updated node with Opteron to 6.3-6 (kernel 5.4.103-1), rebooted.
Next morning - same scenario plus one of Intel Xeons E5 got the same issue (some 5-6 days after it was updated to 6.3-4).
Then we did advised here downgrade
One morning later all VMs are good on AMD (and one affected Intel).
8 Intels - no issues.
One Intel had issue once.
AMD had issues multiple times.
I've noticed that mass qmp errors in syslog tends to start after midnight, we don't have anything specific scheduled then, PBS backup starts at 22:00 and are done about 23:30-ish.
We have 10 node cluster of rather older hardware - CPUs are Intel Xeons E5 (dual socket) and one AMD Opteron 6344 (quad socket).
Nodes were running stable.
We're using PBS.
We did a PVE update from 6.2-11 to 6.3-4 on 2-3 nodes in one day, observe, move to next nodes on following day.
All was good, node with AMD Opteron was last to update.
After updating Opteron all VMs on that node crashed during night:
- no response for ICMP, ssh from VM
- can't connect to VM's console in node GUI
- qm stop managed to stop VM by killing process (not gracefully)
- qm start started VM normally
Load on node was about 1-1,5 for every VM, lots of qmp command failed and qmp socket timeout in syslog on node.
Situation happened again next morning, same scenario - VMs dead, high load on node, qmp errors in syslog.
Updated node with Opteron to 6.3-6 (kernel 5.4.103-1), rebooted.
Next morning - same scenario plus one of Intel Xeons E5 got the same issue (some 5-6 days after it was updated to 6.3-4).
Then we did advised here downgrade
pve-qemu-kvm=5.1.0-8 libproxmox-backup-qemu0=1.0.2-1
on Opteron and only one affected Intel.One morning later all VMs are good on AMD (and one affected Intel).
8 Intels - no issues.
One Intel had issue once.
AMD had issues multiple times.
I've noticed that mass qmp errors in syslog tends to start after midnight, we don't have anything specific scheduled then, PBS backup starts at 22:00 and are done about 23:30-ish.
Last edited: