Live Migration and Versions

Lymond

Renowned Member
Oct 7, 2009
23
3
68
We added a 6th node to our cluster and its version is one (or so) ahead of the rest. We hadn't live migrated anything in a bit, but live migrations are now failing (they've always been fine). We're wondering if, for live migrations to work between any two nodes in the cluster, if ALL the nodes in the cluster need to be the same version of the pve modules. With some advanced debugging, we're seeing the multiple SSH connections happening as the live migration attempts, and it fails at the beginning of the memory migration. We can SSH between all the nodes.

We have scheduled a summer project to get everything on the same version, I'm just curious if this is a must for live migration.
 
Live migrations from older to newer PVE should work, and we actively keep an eye on this. Live migrating from newer to older versions are not tested and thus might not work.

What versions do you have installed in your cluster currently?
 
also please include pveversion -v output from both source and target node, as well as the VM config and the full migration log. ideally, the system logs from both nodes for the period around the failure as well.
 
Finally getting back to this. Includes are the pveversions from both host nodes. We have a shared storage over NFS to both. Systems all seem to be fine chatting over SSH on the backend network. We also have a longer trace file for the migration attempt I can send -- it's 8 MB so it won't attach here. We can rerun tests for more info as well.
 

Attachments

Updating entries. Live Migrations still failing. We have 6 front end nodes (vm5-vm10) with a shared backend storage over NFS (zstor3). Attached are pveversion -v from the transfer nodes vm8 sending vmID 120 to vm9 as well as verbose logs from each node.
 

Attachments

the log indicates that plain unix socket forwarding fails:

Code:
2021-05-11 15:24:43 start migrate command to unix:/run/qemu-server/120.migrate
debug1: Connection to port -2 forwarding to /run/qemu-server/120.migrate port -2 requested.
debug2: fd 9 setting O_NONBLOCK
debug3: fd 9 is O_NONBLOCK
debug1: channel 2: new [direct-streamlocal@openssh.com]
debug3: send packet: type 90
debug3: receive packet: type 92
channel 2: open failed: connect failed: open failed
debug2: channel 2: zombie
debug2: channel 2: garbage collecting
debug1: channel 2: free: direct-streamlocal@openssh.com: listening port -2 for /run/qemu-server/120.migrate port -2, connect from  port 0 to /run/qemu-server/120.migrate port 0, nchannels 3
debug3: channel 2: status: The following connections are open:
  #1 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 6/7/8 sock -1 cc -1)

2021-05-11 15:24:44 migration status error: failed
2021-05-11 15:24:44 ERROR: online migrate failure - aborting
2021-05-11 15:24:44 aborting phase 2 - cleanup resources

does that unix socket exist on either end with no ongoing migration? could you try removing it and then attempting another migration?
 
This turned out to be our fault. We'd made a change in Puppet to try to restrict users from signing in via SSH -- turning off TCP Forwarding -- which breaks live migration. We've changed this and now things work. The clue was the packet 92 which speaks specifically to SSH not working which made us stare harder at our config. Thank you for the support.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!