Timeout while starting VM / Snapshotting disk etc...

deepcloud

Active Member
Feb 12, 2021
138
24
38
India
deepcloud.in
Hello,

We have a 4 Node Proxmox Cluster with HPE DL385 Gen 10 Servers. AMD EPYC 7002 Series CPU, 100G Mellanox Cards, 100G Switch, WD SN640 NVME Enterprise SSD and 1-2TB of RAM per host.
So the Hardware seems to be pretty solid and robust. we keep the proxmox hosts updated.

Since the past month or so we have been seeing timeouts when we start any VM or try to snapshot a VM. We need to start it twice or thrice and then it works.

if we start a VM we get errors like

Task viewer: VM 286 - Start

OutputStatus

Stop
TASK ERROR: got timeout

----------------------------------------------------------
or if we try to snapshot it

VM 286 qmp command 'savevm-end' failed - VM snapshot not started
snapshot create failed: starting cleanup
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...

now trying for the 2nd or 3rd time and it succeeds.

Apart from this we have also seen many Linux VM freezing (Debian / Ubuntu / CentOS all of them) many times with high IO at times (windows machines work without any issues)

The CEPH Cluster reports no errors and all the 26 Drives are healthy and with no errors.

Any suggestions
 
Before suggesting "older kernel", make sure that you are on the current up2date kernel.

So please check if you run latest version, post your:

> pveversion -v