We skipped Proxmox 4.1, but decided to upgrade our 4.0 Proxmox servers (5 machines in a cluster, with iSCSI shared storage) to 4.2 recently, mainly because of the GUI changes. It should be noted here that 4.0 -> 4.0 live migration has always worked perfectly for us between any combination of the 4.0 servers.
The upgrade of one of the 4.0 servers to 4.2 seemed to go smoothly (well, once we de-installed Dell's OMSA and re-installed it later) and starting/stopping a test VM on the upgraded 4.2 server was fine.
However, testing live migration of a CentOS 6 VM between a 4.2 server and a 4.0 server (critical to avoid downtime during upgrades) was a total failure in both directions. Here's the messages from a live migration (4.0 -> 4.2) with the IP obscured as A.B.C.D:
Jun 03 14:06:00 starting migration of VM 143 to node 'proxmox04' (A.B.C.D)
Jun 03 14:06:00 copying disk images
Jun 03 14:06:00 starting VM 143 on remote node 'proxmox04'
Jun 03 14:06:03 ERROR: online migrate failure - unable to detect remote migration address
Jun 03 14:06:03 aborting phase 2 - cleanup resources
Jun 03 14:06:03 migrate_cancel
Jun 03 14:06:04 ERROR: migration finished with problems (duration 00:00:05)
migration problems
And there's a different error message going from 4.2 -> 4.0:
Jun 03 14:20:54 starting migration of VM 143 to node 'proxmox02' (A.B.C.D)
Jun 03 14:20:54 copying disk images
Jun 03 14:20:54 starting VM 143 on remote node 'proxmox02'
Jun 03 14:20:55 start failed: command '/usr/bin/systemd-run --scope --slice qemu --unit 143 -p 'CPUShares=1000' /usr/bin/kvm -id 143 [stuff deleted] -machine 'type=pc-i440fx-2.5' -incoming 'tcp:[localhost]:60000' -S' failed: exit code 1
Jun 03 14:20:55 ERROR: online migrate failure - command '/usr/bin/ssh -o 'BatchMode=yes' root@A.B.C.D qm start 143 --stateuri tcp --skiplock --migratedfrom proxmox04 --machine pc-i440fx-2.5' failed: exit code 255
Jun 03 14:20:55 aborting phase 2 - cleanup resources
Jun 03 14:20:55 migrate_cancel
Jun 03 14:20:55 ERROR: migration finished with problems (duration 00:00:01)
migration problems
It's not the first time we've had these issues live migrating during an upgrade procedure (and other users have too) and I suspect it won't be last. Does anyone else have issues with live migration between different minor 4.X releases? Do the devs test such combinations before a minor update is released? It's looking like we're going to have to do offline migrations for dozens of VMs because of this issue - downtime for each and every migration, which is very annoying indeed.
The upgrade of one of the 4.0 servers to 4.2 seemed to go smoothly (well, once we de-installed Dell's OMSA and re-installed it later) and starting/stopping a test VM on the upgraded 4.2 server was fine.
However, testing live migration of a CentOS 6 VM between a 4.2 server and a 4.0 server (critical to avoid downtime during upgrades) was a total failure in both directions. Here's the messages from a live migration (4.0 -> 4.2) with the IP obscured as A.B.C.D:
Jun 03 14:06:00 starting migration of VM 143 to node 'proxmox04' (A.B.C.D)
Jun 03 14:06:00 copying disk images
Jun 03 14:06:00 starting VM 143 on remote node 'proxmox04'
Jun 03 14:06:03 ERROR: online migrate failure - unable to detect remote migration address
Jun 03 14:06:03 aborting phase 2 - cleanup resources
Jun 03 14:06:03 migrate_cancel
Jun 03 14:06:04 ERROR: migration finished with problems (duration 00:00:05)
migration problems
And there's a different error message going from 4.2 -> 4.0:
Jun 03 14:20:54 starting migration of VM 143 to node 'proxmox02' (A.B.C.D)
Jun 03 14:20:54 copying disk images
Jun 03 14:20:54 starting VM 143 on remote node 'proxmox02'
Jun 03 14:20:55 start failed: command '/usr/bin/systemd-run --scope --slice qemu --unit 143 -p 'CPUShares=1000' /usr/bin/kvm -id 143 [stuff deleted] -machine 'type=pc-i440fx-2.5' -incoming 'tcp:[localhost]:60000' -S' failed: exit code 1
Jun 03 14:20:55 ERROR: online migrate failure - command '/usr/bin/ssh -o 'BatchMode=yes' root@A.B.C.D qm start 143 --stateuri tcp --skiplock --migratedfrom proxmox04 --machine pc-i440fx-2.5' failed: exit code 255
Jun 03 14:20:55 aborting phase 2 - cleanup resources
Jun 03 14:20:55 migrate_cancel
Jun 03 14:20:55 ERROR: migration finished with problems (duration 00:00:01)
migration problems
It's not the first time we've had these issues live migrating during an upgrade procedure (and other users have too) and I suspect it won't be last. Does anyone else have issues with live migration between different minor 4.X releases? Do the devs test such combinations before a minor update is released? It's looking like we're going to have to do offline migrations for dozens of VMs because of this issue - downtime for each and every migration, which is very annoying indeed.