Hello,
We have a 4 Node Proxmox Cluster with HPE DL385 Gen 10 Servers. AMD EPYC 7002 Series CPU, 100G Mellanox Cards, 100G Switch, WD SN640 NVME Enterprise SSD and 1-2TB of RAM per host.
So the Hardware seems to be pretty solid and robust. we keep the proxmox hosts updated.
Since the past month or so we have been seeing timeouts when we start any VM or try to snapshot a VM. We need to start it twice or thrice and then it works.
if we start a VM we get errors like
Task viewer: VM 286 - Start
OutputStatus
Stop
TASK ERROR: got timeout
----------------------------------------------------------
or if we try to snapshot it
VM 286 qmp command 'savevm-end' failed - VM snapshot not started
snapshot create failed: starting cleanup
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
now trying for the 2nd or 3rd time and it succeeds.
Apart from this we have also seen many Linux VM freezing (Debian / Ubuntu / CentOS all of them) many times with high IO at times (windows machines work without any issues)
The CEPH Cluster reports no errors and all the 26 Drives are healthy and with no errors.
Any suggestions
We have a 4 Node Proxmox Cluster with HPE DL385 Gen 10 Servers. AMD EPYC 7002 Series CPU, 100G Mellanox Cards, 100G Switch, WD SN640 NVME Enterprise SSD and 1-2TB of RAM per host.
So the Hardware seems to be pretty solid and robust. we keep the proxmox hosts updated.
Since the past month or so we have been seeing timeouts when we start any VM or try to snapshot a VM. We need to start it twice or thrice and then it works.
if we start a VM we get errors like
Task viewer: VM 286 - Start
OutputStatus
Stop
TASK ERROR: got timeout
----------------------------------------------------------
or if we try to snapshot it
VM 286 qmp command 'savevm-end' failed - VM snapshot not started
snapshot create failed: starting cleanup
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
now trying for the 2nd or 3rd time and it succeeds.
Apart from this we have also seen many Linux VM freezing (Debian / Ubuntu / CentOS all of them) many times with high IO at times (windows machines work without any issues)
The CEPH Cluster reports no errors and all the 26 Drives are healthy and with no errors.
Any suggestions