Cannot Live Migrate with "Discard" Set

stevensedory · Nov 12, 2019

We are running Proxmox 6:

root@kvm02-a-lax:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.15-1-pve: 5.0.15-1

Our OS is a mirror ZFS volume, and our VM volumes are local striped mirror ZFS. We have two nodes in our cluster.

Live migration works fine, but if we have a disk set with the Discard flag, the migration freezes.

We have to stop it, then usually "qm unlock" the VM, stop the VM process on the receiving node, and then delete the created disk on the receiving node.

Is Live Migration with Discard on ZFS not supported? I cannot find anything that explicitly says this.

wolfgang · Nov 13, 2019

Hi,

This should work.
The problem is maybe the qemu-agent "guest-trim" flag?
Try to disable it in the options section.

stevensedory · Nov 13, 2019

wolfgang said:
The problem is maybe the qemu-agent "guest-trim" flag?

Thanks for your reply.

I checked, and QEMU Guest Agent is set to "Default (Disabled)". I will try to enable it and see what happens.

stevensedory · Nov 13, 2019

wolfgang said:
The problem is maybe the qemu-agent "guest-trim" flag?

Tried enabling the agent, and installed it on the guest, but still the same issue. It gets this far, then the Disk IO gets really high on the receiving host, but there's no network traffic showing in the host summary, so seems the disk isn't actually transferring. Obviously, normally I see the disk copy progress at this point, all the way to "OK".

2019-11-13 06:54:45 starting migration of VM 101 to node 'kvm02-a-lax' (10.32.1.86)
2019-11-13 06:54:45 found local disk 'VMs:vm-101-disk-0' (in current VM config)
2019-11-13 06:54:45 copying disk images
2019-11-13 06:54:45 starting VM 101 on remote node 'kvm02-a-lax'
2019-11-13 06:54:49 start remote tunnel
2019-11-13 06:54:51 ssh tunnel ver 1
2019-11-13 06:54:51 starting storage migration
2019-11-13 06:54:51 scsi0: start migration to nbd:10.32.1.86:60000:exportname=drive-scsi0
drive mirror is starting for drive-scsi0

Alucard1 · Mar 3, 2020

Hello, thanks for this thread. I'm not certain that I'm having the same issue, as I'm not using ZFS but (local) LVM, but the behavior is exactly the same. I am also using the 'discard' flag. The qemu-agent is disabled.

I'm using the following command;
#qm migrate 110 node1 -online -with-local-disks -bwlimit 100000

Then nothing is showing up after waiting very long (I have waited longer than this example) while under full I/O load on the receiving node, then I ultimately have to cancel the job;

2020-03-02 17:51:39 starting migration of VM 110 to node 'node1' (10.7.17.18)
2020-03-02 17:51:40 found local disk 'ssdimages:vm-110-disk-0' (in current VM config)
2020-03-02 17:51:40 copying local disk images
2020-03-02 17:51:40 starting VM 110 on remote node 'node1'
2020-03-02 17:51:42 start remote tunnel
2020-03-02 17:51:43 ssh tunnel ver 1
2020-03-02 17:51:43 starting storage migration
2020-03-02 17:51:43 scsi0: start migration to nbd:10.7.17.18:60000:exportname=drive-scsi0
drive mirror is starting for drive-scsi0 with bandwidth limit: 100000 KB/s
drive-scsi0: Cancelling block job
2020-03-02 18:06:39 ERROR: online migrate failure - mirroring error: interrupted by signal

#pveversion -v:
proxmox-ve: 6.1-2 (running kernel: 5.0.21-5-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.13-1-pve: 5.3.13-1

I have used the same command successfully before, on the same node. I'm wondering whether it's got something to do with pve version differences between nodes or perhaps a regression in relation to the 'discard' feature, but it's only a hunch...

Anything I could try or look at?

stevensedory · Mar 3, 2020

Hey, ya I have just been disabling the discard flag (requires reboot though of course) if I need to live migrate. I'm not sure what the fix is. Perhaps it has been fixed in a newer release. I suppose one of us should submit a bug

Alucard1 · Mar 9, 2020

You're right, I've gone ahead and submitted the bug; https://bugzilla.proxmox.com/show_bug.cgi?id=2631
I'm running fairly recent versions of PVE (all 6.1), I'll look into updating to the latest versions and see if that does anything.

stevensedory · Mar 9, 2020

Alucard1 said:
You're right, I've gone ahead and submitted the bug; https://bugzilla.proxmox.com/show_bug.cgi?id=2631
I'm running fairly recent versions of PVE (all 6.1), I'll look into updating to the latest versions and see if that does anything.

Awesome! Thanks for the update.

Alucard1 · Mar 20, 2020

@stevensedory ; It's a tricky thing to triage/reproduce. How is your storage setup exactly done (I know you use ZFS, but not how exactly)? I'm using thin volumes in LVM, how does that compare to your setup?

stevensedory · Mar 21, 2020

Alucard1 said:
@stevensedory ; It's a tricky thing to triage/reproduce. How is your storage setup exactly done (I know you use ZFS, but not how exactly)? I'm using thin volumes in LVM, how does that compare to your setup?

We have ZFS for proxmox OS, and then a separate volume for VMs. We're not using LVM, but rather just ZFS, which I believe simply supports thin provisioning (otherwise we would see the storage usage shrink when we "fstrim -a" when discard is set).

filipealvarez · Sep 21, 2021

This still happening in Proxmox 6.4-8 with LVM-thin localdisks

decaen · Oct 7, 2021

If we have the same problem, it doesn't really fail.
You can live migrate with virtio single+lvm-thin+discard, but for a large disk, it take very long time to start !

With a small VM (10 GB), it's not easy to see the problem, you have to wait few seconds.
With a large VM (1 TB), you have to wait few minutes before disk migration starts.

When you look at the LV on destination host, you can see "Mapped size" growing:
$ lvdisplay
LV Path /dev/local-disk/vm-5000-disk-0
LV Size 1.56 TiB
Mapped size 27.55%

with discard=false, migration starts immediately (when "Mapped size" = 0%)
with discard=true, migration starts only when "Mapped size" reach 100%

stevensedory · Oct 26, 2021

So we had this problem again, with the same setup as described at the start of this thread, but while live migrating a VM that didn't have discard set :/

The only thing I could find worth noting was that this VM did not have "format=raw", as mentioned in the bug referenced above.

What happened was, we live migrated another VM without the discard flag set with no problem. Then a few minutes later, we started this other VM. It got about 50% of the way through, then the I/O on the receiving server shot up, effectively pausing all of the production VMs on it.

We're thinking there's something to that format=raw setting, we just can't afford another outage, so we can't test right now.

So this is still an issue, and unlike the bug thread says, it isn't just Alucard1 (the bug poster).

raregtp · Oct 30, 2021

Brand-new forum member here....in fact I joined because of this issue. Currently running Proxmox 7, as of right now the latest version of 7. Started seeing this issue when trying to move a hard drive on a VM to an ESOS array. If you're not familiar with ESOS, it's an SCST based Linux distro to create an open-source storage array on standard hardware.....at it's core it's running SCST. My ESOS system is configured with both fiber as well as iSCSI. I also have an EqualLogic array running iSCSI, and have been playing around with my first Proxmox host when I ran into this issue.

In an attempt to narrow this down, I've tried both fiber and iSCSI on my ESOS array as well as iSCSI on my EqualLogic array. With the discard flag set, I have been running into the exact same issues as described.....moving a disk from local lvm-thin to the ESOS array, whether it's iSCSI or fiber, will experience a long pause after initial drive creation, with high IO on the drive but low traffic from the initiators to the targets. On a fiber connection this causes my initiators in the Proxmox system to time out and eventually go into a reset cycle, and then the controllers hard reset.....and then eventually the drive copy starts and completes for both the fiber and iSCSI. However.....doing the EXACT SAME OPERATION but moving from local lvm-thin to the EqualLogic array, the drive move operation kicks off and immediately start moving the disk. No large pause, either with or without the discard flag set.

I have not seen the issue moving back to local lvm-thin from either array....the move happens as fast as I expect it to. One thing I've been researching is whether or not the SCST implementation on ESOS supports thin-provisioning....and there is only one mode that appears to, and that would be vdisk-fileio. However, there is no setting in the TUI to configure this, so I would need to drop out to the CLI and add this setting to the scst.conf file manually, or use the scstadm tool to add it to the conf file. It is on my short list of next steps to do, but I have not tried this yet.

Just thought I'd share my experience on this as I have two arrays to "compare" with each other where one works great and the other doesn't when using the discard flag. Happy for any input or thoughts from anyone on this situation!

Search

Search

Cannot Live Migrate with "Discard" Set

stevensedory

Well-Known Member

wolfgang

Proxmox Retired Staff

stevensedory

Well-Known Member

stevensedory

Well-Known Member

Alucard1

Member

stevensedory

Well-Known Member

Alucard1

Member

stevensedory

Well-Known Member

Alucard1

Member

stevensedory

Well-Known Member

filipealvarez

Well-Known Member

decaen

Member

stevensedory

Well-Known Member

raregtp

Member

We value your privacy