Greetings! First post here so take it easy on me....
I've been working tirelessly over the last several weeks to get my homelab cluster up and running complete with share FC storage (ESOS) running GFS2 cluster filesystem. After many starts and stops and reconfiguring I believe I finally have this working the way I'd like. Today I was reconfiguring my iSCSI NICs in my hosts to use them/that subnet/vlan for migrations as well, since the NICs/bonds/bridges and physical switches are all configured with jumbo frames. Upon switching to this setup, I've ran into a curious issue when I try to live migrate machines that are on the local datastores from one host to another. The first time after a host is rebooting, live migrating a machine to either of the other two hosts causes the live migration to fail with the below info:
2022-01-11 22:55:29 use dedicated network address for sending migration traffic (192.168.10.211)
2022-01-11 22:55:29 starting migration of VM 107 to node 'moros' (192.168.10.211)
2022-01-11 22:55:29 found local disk 'local-data:107/vm-107-disk-0.qcow2' (in current VM config)
2022-01-11 22:55:29 starting VM 107 on remote node 'moros'
2022-01-11 22:55:33 volume 'local-data:107/vm-107-disk-0.qcow2' is 'local-data:107/vm-107-disk-0.qcow2' on the target
2022-01-11 22:55:33 start remote tunnel
2022-01-11 22:55:34 ssh tunnel ver 1
2022-01-11 22:55:34 starting storage migration
2022-01-11 22:55:34 scsi0: start migration to nbd:unix:/run/qemu-server/107_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 49.0 MiB of 35.0 GiB (0.14%) in 2s
drive-scsi0: transferred 167.0 MiB of 35.0 GiB (0.47%) in 3s
client_loop: send disconnect: Broken pipe
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2022-01-11 22:55:38 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2022-01-11 22:55:38 aborting phase 2 - cleanup resources
2022-01-11 22:55:38 migrate_cancel
2022-01-11 22:55:41 ERROR: writing to tunnel failed: broken pipe
2022-01-11 22:55:41 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems
If I immediately restart the live migration, it will succeed, and every live migration after that also succeeds until I reboot the host again. Once the host is rebooted, the cycle starts over. I am seeing this on all three of my hosts, but was NOT seeing it when I was using another network/nic for the live migrations. Both the previous network and the current network that sees the errors are 1 gig networks...the working network is running across HP switches, the network with the issues is running across Dell PowerConnect switches. The only other difference is that the problem network has jumbo frames enabled on the physical switch and in the networking configuration on the host. All other networking configuration on the host is nearly identical between the two networks.
I can get more detailed as needed, but wanted to get this out there first to get the conversations going. Let me know what questions you have an what data you need to attempt to help me solve this.
I've been working tirelessly over the last several weeks to get my homelab cluster up and running complete with share FC storage (ESOS) running GFS2 cluster filesystem. After many starts and stops and reconfiguring I believe I finally have this working the way I'd like. Today I was reconfiguring my iSCSI NICs in my hosts to use them/that subnet/vlan for migrations as well, since the NICs/bonds/bridges and physical switches are all configured with jumbo frames. Upon switching to this setup, I've ran into a curious issue when I try to live migrate machines that are on the local datastores from one host to another. The first time after a host is rebooting, live migrating a machine to either of the other two hosts causes the live migration to fail with the below info:
2022-01-11 22:55:29 use dedicated network address for sending migration traffic (192.168.10.211)
2022-01-11 22:55:29 starting migration of VM 107 to node 'moros' (192.168.10.211)
2022-01-11 22:55:29 found local disk 'local-data:107/vm-107-disk-0.qcow2' (in current VM config)
2022-01-11 22:55:29 starting VM 107 on remote node 'moros'
2022-01-11 22:55:33 volume 'local-data:107/vm-107-disk-0.qcow2' is 'local-data:107/vm-107-disk-0.qcow2' on the target
2022-01-11 22:55:33 start remote tunnel
2022-01-11 22:55:34 ssh tunnel ver 1
2022-01-11 22:55:34 starting storage migration
2022-01-11 22:55:34 scsi0: start migration to nbd:unix:/run/qemu-server/107_nbd.migrate:exportname=drive-scsi0
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 49.0 MiB of 35.0 GiB (0.14%) in 2s
drive-scsi0: transferred 167.0 MiB of 35.0 GiB (0.47%) in 3s
client_loop: send disconnect: Broken pipe
drive-scsi0: Cancelling block job
drive-scsi0: Done.
2022-01-11 22:55:38 ERROR: online migrate failure - block job (mirror) error: drive-scsi0: 'mirror' has been cancelled
2022-01-11 22:55:38 aborting phase 2 - cleanup resources
2022-01-11 22:55:38 migrate_cancel
2022-01-11 22:55:41 ERROR: writing to tunnel failed: broken pipe
2022-01-11 22:55:41 ERROR: migration finished with problems (duration 00:00:12)
TASK ERROR: migration problems
If I immediately restart the live migration, it will succeed, and every live migration after that also succeeds until I reboot the host again. Once the host is rebooted, the cycle starts over. I am seeing this on all three of my hosts, but was NOT seeing it when I was using another network/nic for the live migrations. Both the previous network and the current network that sees the errors are 1 gig networks...the working network is running across HP switches, the network with the issues is running across Dell PowerConnect switches. The only other difference is that the problem network has jumbo frames enabled on the physical switch and in the networking configuration on the host. All other networking configuration on the host is nearly identical between the two networks.
I can get more detailed as needed, but wanted to get this out there first to get the conversations going. Let me know what questions you have an what data you need to attempt to help me solve this.