Live migration failing once after reboot

raregtp · Jan 12, 2022

Greetings! First post here so take it easy on me....

I've been working tirelessly over the last several weeks to get my homelab cluster up and running complete with share FC storage (ESOS) running GFS2 cluster filesystem. After many starts and stops and reconfiguring I believe I finally have this working the way I'd like. Today I was reconfiguring my iSCSI NICs in my hosts to use them/that subnet/vlan for migrations as well, since the NICs/bonds/bridges and physical switches are all configured with jumbo frames. Upon switching to this setup, I've ran into a curious issue when I try to live migrate machines that are on the local datastores from one host to another. The first time after a host is rebooting, live migrating a machine to either of the other two hosts causes the live migration to fail with the below info:

2022-01-11 22:55:29 use dedicated network address for sending migration traffic (192.168.10.211)
2022-01-11 22:55:29 starting migration of VM 107 to node 'moros' (192.168.10.211)
2022-01-11 22:55:29 found local disk 'local-data:107/vm-107-disk-0.qcow2' (in current VM config)
2022-01-11 22:55:29 starting VM 107 on remote node 'moros'
2022-01-11 22:55:33 volume 'local-data:107/vm-107-disk-0.qcow2' is 'local-data:107/vm-107-disk-0.qcow2' on the target
2022-01-11 22:55:33 start remote tunnel
2022-01-11 22:55:34 ssh tunnel ver 1
2022-01-11 22:55:34 starting storage migration
2022-01-11 22:55:34 scsi0: start migration to nbd:unix:/run/qemu-server/107_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 49.0 MiB of 35.0 GiB (0.14%) in 2s
drive-scsi0: transferred 167.0 MiB of 35.0 GiB (0.47%) in 3s
client_loop: send disconnect: Broken pipe
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2022-01-11 22:55:38 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2022-01-11 22:55:38 aborting phase 2 - cleanup resources
2022-01-11 22:55:38 migrate_cancel
2022-01-11 22:55:41 ERROR: writing to tunnel failed: broken pipe
2022-01-11 22:55:41 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems

If I immediately restart the live migration, it will succeed, and every live migration after that also succeeds until I reboot the host again. Once the host is rebooted, the cycle starts over. I am seeing this on all three of my hosts, but was NOT seeing it when I was using another network/nic for the live migrations. Both the previous network and the current network that sees the errors are 1 gig networks...the working network is running across HP switches, the network with the issues is running across Dell PowerConnect switches. The only other difference is that the problem network has jumbo frames enabled on the physical switch and in the networking configuration on the host. All other networking configuration on the host is nearly identical between the two networks.

I can get more detailed as needed, but wanted to get this out there first to get the conversations going. Let me know what questions you have an what data you need to attempt to help me solve this.

raregtp · Jan 12, 2022

Well go figure.....after composing the previous post I tried the "second" migration of said vm....and it failed as before. But.....I did immediately restart the migration with the same machine and it suceeded. Not sure if this is time-related somehow but will be letting it sit over-night and trying another migration in the AM now that I've had one successful migration.

raregtp · Jan 12, 2022

Ok....decided to try one of my other hosts that has not been rebooted all evening, but also hadn't been doing any live migrations, and it had a failure as well, but retrying immediately after saw it succeed as laid out above. The only info I can find related to the error (searched for 'ssh writing to tunnel failed broken pipe') on this is dealing with SSH settings (serveraliveintervals, etc). Will post more info as I come across it. Thanks!!

spirit · Jan 12, 2022

2022-01-11 22:55:29 found local disk 'local-data:107/vm-107-disk-0.qcow2' (in current VM config)
2022-01-11 22:55:29 starting VM 107 on remote node 'moros'
2022-01-11 22:55:33 volume 'local-data:107/vm-107-disk-0.qcow2' is 'local-data:107/vm-107-disk-0.qcow2' on the target
...
2022-01-11 22:55:38 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled

is you "local-data" storage, your gfs2 cluster filesytem ?
if yes, do you have enable "shared" option on the storage ? because proxmox it's try to copy/mirror the disk to the target host.
(and if it's the same storage, it could hang as it's the same file. (or even destroy it if they are no lock).

raregtp · Jan 12, 2022

No, local-data is lvm formatted with xfs, and is not marked as shared. My gfs2 storage is on the SAN and is marked as shared, and using DLM for the lock manager. DLM is set to use SCTP as I configured redundant rings in corosync, and once the .mount unit dependencies we're set correctly I've had no issues using the shared storage.

That said, I also didn't have any issues that I recall until I switched the migration network to use the iSCSI network.

raregtp · Jan 13, 2022

So after leaving the hosts sit for almost 24 hours, I'm not seeing any migration failures. I'm sure if I reboot one of them it will fail on it's first migration....just not able to get into the reboots tonight.

Suggestions/thoughts are very welcome!! Thanks!

Search

Search

Live migration failing once after reboot

raregtp

Member

raregtp

Member

raregtp

Member

spirit

Distinguished Member

raregtp

Member

raregtp

Member

We value your privacy