When doing a live VM migration (latest proxmox enterprise 7) from one server to another where both servers use local disks with lvm-thin (ext4 hardware SSD RAID-10), if the VM hard disk has "Discard" enabled, we find that the migration hammers the I/O of the target node until the first copy phase is over ("drive mirror is starting for drive" portion). Once that completes the first VM copy, the I/O settles down for the remainder of the copy and sync. The live migration copies the full VM size (not the actual thin size) but them trims after the migration is complete.
If we disable discard on the hard drive and then run a live VM migration, the I/O is very low the entire time as we'd expect. It still does a full size copy (not the thin size), but I/O is low throughout. Of course we cannot reclaim space using fstrim until we re-enable discard after the migration (which requires a reboot).
We've tested with bandwidth limits, which can limit the I/O spike somewhat with discard on, but even with SSD RAID-10, the I/O spike is very high (20-40% IO delay) and impacts other running VMs on the target node.
Is this a bug in the way live migrations are done, or just an inherent limitation of live migrations with discard enabled? Ideally we'd like to keep discard on and not have to reboot a VM twice in order to live migrate it without impacting other VMs performance on the target node.
If we disable discard on the hard drive and then run a live VM migration, the I/O is very low the entire time as we'd expect. It still does a full size copy (not the thin size), but I/O is low throughout. Of course we cannot reclaim space using fstrim until we re-enable discard after the migration (which requires a reboot).
We've tested with bandwidth limits, which can limit the I/O spike somewhat with discard on, but even with SSD RAID-10, the I/O spike is very high (20-40% IO delay) and impacts other running VMs on the target node.
Is this a bug in the way live migrations are done, or just an inherent limitation of live migrations with discard enabled? Ideally we'd like to keep discard on and not have to reboot a VM twice in order to live migrate it without impacting other VMs performance on the target node.