Move Disk Soft Locks All Storage

Cmdrd

Active Member
Sep 19, 2017
8
1
43
34
I have been struggling with an issue for the past few weeks where using either "Move Disk" in the GUI or "qm move disk" in the CLI from one ZFS pool to another soft locks all storage pools on the host, VMs become entirely unresponsive. This only seems to happen with VMs >30GB, for some reason <=30GB the move disk process completes incredibly quickly. After about 31GB have been moved, the disk move progress slows down to an absolute crawl and "zpool iostat" for all pools goes to 0 despite having running workloads that end up freezing.

Canceling the task and monitoring the processes waiting for disk I/O using the following command (watch -n 1 "(ps aux | awk '$8 ~ /D/ { print $0 }')"), there's a bunch of zfs operations queued up along with the "zfs create -s -V" command that "qm disk move" likely triggered. It takes about 10 minutes for the pools to start writing data again and then everything returns to normal like nothing happened. Both pools are ashift=12 and blocksize=8k. If I stop all loads on both pools it doesn't have any different impact on this, the soft lock still occurs.

PVEVersion and package versions, latest updates are all applied and still experiencing this issue.

Code:
# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.8-2-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-14
proxmox-kernel-6.8: 6.8.8-2
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
pve-kernel-5.15.158-1-pve: 5.15.158-1
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.3
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.4.2
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.12-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.0-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
 
Last edited:
Maybe your pools are just overloaded. (Buffer bloat? Slow non-enterprise SSDs? Rotating rust? Bad topology like RaidZx?) To debug this you may limit the transfer speed to some realistic and unproblematic value. See "Datacenter --> Options --> Bandwidth Limits --> Disk Move". To find an acceptable value watch "Node --> Summary --> CPU usage / IO delay", it should stay low...

This would be an unwelcome workaround - but to stabilize a cluster it may be useful.
 
Thanks for the reply! Guess I should have posted the hardware specs and configs.

The two pools are mirrors of a pair of Samsung PM1725 and a pair of Samsung 883DCT SSDs. Wear is low on both of them, they're trimmed, and they benchmark at their peak rates for long periods of time totally fine, so it does not appear to be that. I can max both pools out for I/O for tens of minutes without this issue occuring, it's weird in that the only scenario that I can replicate this is moving disks.

I've disabled compression and atime on both pools, next up I'll also test disabling dedupe to rule that out. But I have been observing this with no load on the pools, fresh boot with nothing powered up. iostat show no load for any of the SSDs which has been the most confusing part about it as that would be the first indication of the storage hardware being overloaded. It feels like a race condition in software.

I'll test out with those settings, did not know about them there, thanks for the suggestion! I'll post back here with more testing results. If i see similar results I'll try with MDADM configured mirrors with EXT4 for the filesystem, see if it's specific to ZFS or possibly somewhere else in the software stack.
 
Last edited:
Setting the BW limit for Disk Move appears to have solved the problem, thank you so much for the suggestion! Set it at 50MiB/s, successful migration on a 100GB disk, was not able to move that before. I'll play around with it a bit to find a balance that maintains stability, but it's also not a super common thing so I'm not worried about the lower speed adding a lot of operational overhead, reliability is far more important with the issues that were occurring.

Thank you again!
 
  • Like
Reactions: UdoB
Issue appears to be back, despite setting lower bandwidth limits.

I/O delay is 0, there is no activity as per iostat but I can't start VMs or move any VMs hours after cancelling the process. I get the feeling that there's potentially an issue with the I/O scheduler getting stuck on something.

I have been getting warnings of the following tasks being blocked in dmesg after another failed move disk at low speeds, so there is definitely something going wrong:
dbuf_evict, z_zvol, kworker, kvm
 
Last edited:
I have been doing more testing and using zfs send/receive to move the zvols with the VMs off and I've been able to move hundreds of GB at 600 MB/s sustained without a single freeze, so it is absolutely tied to the "qm move disk" command and the processes that it executes.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!