TL;DR: When I do a 'move storage' function in Proxmox, the SOURCE appears to be rewritten, far more aggressively than the destination, and I think that probably shouldn't happen!
Here's the (simplified) setup - Three Proxmox Nodes, two Storage Servers, one VM. The configuration of the VM, or the disk, appears to be irrelevant - discard, cache, etc, makes no difference.
Both storage servers are ZFS servers, with the zvol set to sharenfs=async,rw,crossmnt,no_subtree_check,no_root_squash and sync=disabled. The storage is set in the cluster to be NFS mounted /store1/datastore and /store2/datastore, with default NFS mount version.
When I do a live 'move storage', qemu for some reason goes NUTS and sends thousands and thousands of tiny writes to the SOURCE image (qcow2), which ends up causing IOPS starvation and the source host gets even MORE unhappy. The DESTINATION storage server is fine, and is running at basically 0% disk utilization, because the source server is at 100% utilization writing to the source qcow2 image.
Before I go digging into this further, I'm wondering if this is something that's expected? My current hypothesis is that it's something to do with the internal snapshot and dirty bitmap tracking, but I haven't looked into it too far yet - I am just wondering if I have made some fundamental error with the storage setup that is causing this.
This only happens in an ONLINE MOVE. When the VM is turned off, it runs (as expected) at pretty much wirespeed. I have been experimenting with a little VyOS VM with 32gb of storage - which *does not write to disk* - and doing a live storage move takes 2 minutes. With the VM off, less than a second.
Those numbers feel WILDLY wrong to me, which makes me think that I've done something silly, as I'm sure other people would have noticed this before now!
(Running the latest version of everything as of right now - that was the first thing I tried!)
Here's the (simplified) setup - Three Proxmox Nodes, two Storage Servers, one VM. The configuration of the VM, or the disk, appears to be irrelevant - discard, cache, etc, makes no difference.
Both storage servers are ZFS servers, with the zvol set to sharenfs=async,rw,crossmnt,no_subtree_check,no_root_squash and sync=disabled. The storage is set in the cluster to be NFS mounted /store1/datastore and /store2/datastore, with default NFS mount version.
When I do a live 'move storage', qemu for some reason goes NUTS and sends thousands and thousands of tiny writes to the SOURCE image (qcow2), which ends up causing IOPS starvation and the source host gets even MORE unhappy. The DESTINATION storage server is fine, and is running at basically 0% disk utilization, because the source server is at 100% utilization writing to the source qcow2 image.
Before I go digging into this further, I'm wondering if this is something that's expected? My current hypothesis is that it's something to do with the internal snapshot and dirty bitmap tracking, but I haven't looked into it too far yet - I am just wondering if I have made some fundamental error with the storage setup that is causing this.
This only happens in an ONLINE MOVE. When the VM is turned off, it runs (as expected) at pretty much wirespeed. I have been experimenting with a little VyOS VM with 32gb of storage - which *does not write to disk* - and doing a live storage move takes 2 minutes. With the VM off, less than a second.
Those numbers feel WILDLY wrong to me, which makes me think that I've done something silly, as I'm sure other people would have noticed this before now!
(Running the latest version of everything as of right now - that was the first thing I tried!)