VM's become unresponsive when backups run or a disk is moved

birdflewza

Member
Mar 18, 2021
8
1
8
38
Hi there,

I have a 10 node cluster running v6.3-3. Each node has access to an SSD Ceph pool (pool_ssd). Each node has a separate ceph cluster network, proxmox cluster network, and proxmox data network. I have recently setup a new node with local SSDs in raid 10. The local-lvm storage shows and is sufficient (4.67 TiB of 6.83 TiB). The new node has 2 x Xeon(R) CPU E5-2630 v3 with 188GB RAM. Disk tests with fio in a VM running on this node yielded speeds in excess of 3000 MBps.

Currently I am trying to move a VM disk (1TB) from the pool_ssd to the local-lvm, but after about 10 seconds all VM's on that node become unresponsive. There is nothing apparent in the /var/log/syslog. The same issue occurs when backups of the VMs run. The local-lvm disk is definitely fast enough at 3000 MBps. IO delay ends up being quite high at 14+.

Can someone perhaps help me shed some light on the matter?

Cheers,
Curt
 
Last edited:
Hi @Dominic, thanks for your response. I will give this a try. Do you think that even with the high read and write speeds on the node that moving of disks and backups would max out the bandwidth?
 
Experiencing the same. Moving a disk from NAS to local SSD rpool. The proxmox node web interface, ssh and its vm's are no longer responding. Data transfer is still happening though, just can't monitor its progress anymore.

Looks like the local filesystem on the proxmox node is fully saturated by the move which is causing it to become unresponsive. Shouldn't here be some kind of throttling mechanism to prevent this from happening?
 
I have had similar problems with Debian in general. If load and memory utilization get too high in the VM, the VM becomes unresponsive. I assume that Linux is doing its out-of-memory thing. But it's poorly designed, in my opinion. It can take hours to stabilize. I find it's quicker to just kill the VM and do any data recovery that is needed. My attempts at avoiding this issue are to limit throughput (bwlimit) and try to provide plenty of memory. Cgroup configuration could probably fix the problem, too. But if you have the patience to figure out that mess, you're a better person that I.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!