Timeout while starting VM / Snapshotting disk etc...

deepcloud

Member
Feb 12, 2021
130
17
23
India
deepcloud.in
Hello,

We have a 4 Node Proxmox Cluster with HPE DL385 Gen 10 Servers. AMD EPYC 7002 Series CPU, 100G Mellanox Cards, 100G Switch, WD SN640 NVME Enterprise SSD and 1-2TB of RAM per host.
So the Hardware seems to be pretty solid and robust. we keep the proxmox hosts updated.

Since the past month or so we have been seeing timeouts when we start any VM or try to snapshot a VM. We need to start it twice or thrice and then it works.

if we start a VM we get errors like

Task viewer: VM 286 - Start

OutputStatus

Stop
TASK ERROR: got timeout

----------------------------------------------------------
or if we try to snapshot it

VM 286 qmp command 'savevm-end' failed - VM snapshot not started
snapshot create failed: starting cleanup
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...

now trying for the 2nd or 3rd time and it succeeds.

Apart from this we have also seen many Linux VM freezing (Debian / Ubuntu / CentOS all of them) many times with high IO at times (windows machines work without any issues)

The CEPH Cluster reports no errors and all the 26 Drives are healthy and with no errors.

Any suggestions
 
Before suggesting "older kernel", make sure that you are on the current up2date kernel.

So please check if you run latest version, post your:

> pveversion -v
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!