VM's become unresponsive when backups run or a disk is moved

birdflewza · Mar 18, 2021

Hi there,

I have a 10 node cluster running v6.3-3. Each node has access to an SSD Ceph pool (pool_ssd). Each node has a separate ceph cluster network, proxmox cluster network, and proxmox data network. I have recently setup a new node with local SSDs in raid 10. The local-lvm storage shows and is sufficient (4.67 TiB of 6.83 TiB). The new node has 2 x Xeon(R) CPU E5-2630 v3 with 188GB RAM. Disk tests with fio in a VM running on this node yielded speeds in excess of 3000 MBps.

Currently I am trying to move a VM disk (1TB) from the pool_ssd to the local-lvm, but after about 10 seconds all VM's on that node become unresponsive. There is nothing apparent in the /var/log/syslog. The same issue occurs when backups of the VMs run. The local-lvm disk is definitely fast enough at 3000 MBps. IO delay ends up being quite high at 14+.

Can someone perhaps help me shed some light on the matter?

Cheers,
Curt

Dominic · Mar 22, 2021

Hi,

you could try to use the --bwlimit parameter

Code:

qm move_disk <vmid> <disk> <storage> --bwlimit 100

Maybe with some really low numbers first.

qm man page: https://pve.proxmox.com/pve-docs/qm.1.html

birdflewza · Mar 22, 2021

Hi @Dominic, thanks for your response. I will give this a try. Do you think that even with the high read and write speeds on the node that moving of disks and backups would max out the bandwidth?

JamesT · Jul 28, 2021

Hi @curtsahd , wondering if you ever got to the bottom of this? I'm trying to track down similar issues in my cluster.

mgiammarco · Jul 28, 2021

Is bwlimit really implemented? I have read in another post it is not working.

hapklaar · Nov 18, 2023

Experiencing the same. Moving a disk from NAS to local SSD rpool. The proxmox node web interface, ssh and its vm's are no longer responding. Data transfer is still happening though, just can't monitor its progress anymore.

Looks like the local filesystem on the proxmox node is fully saturated by the move which is causing it to become unresponsive. Shouldn't here be some kind of throttling mechanism to prevent this from happening?

EllyMae · Nov 18, 2023

I have had similar problems with Debian in general. If load and memory utilization get too high in the VM, the VM becomes unresponsive. I assume that Linux is doing its out-of-memory thing. But it's poorly designed, in my opinion. It can take hours to stabilize. I find it's quicker to just kill the VM and do any data recovery that is needed. My attempts at avoiding this issue are to limit throughput (bwlimit) and try to provide plenty of memory. Cgroup configuration could probably fix the problem, too. But if you have the patience to figure out that mess, you're a better person that I.

Search

Search

VM's become unresponsive when backups run or a disk is moved

birdflewza

Member

Dominic

Proxmox Retired Staff

birdflewza

Member

JamesT

New Member

mgiammarco

Renowned Member

hapklaar

Member

EllyMae

Member

We value your privacy