Rolling update of 4.1 cluster

Rudo · Mar 29, 2016

hello,

I have a 3 node cluster which I tried to patch from 4.1-2 to 4.1-22.

The plan was to migrate the VMs, online, from node 1 to node 2, patch node 1, reboot and migrate the VMs from node 2 back to 1. Rinse and repeat for the other 2 nodes.

After patching (apt-get update && apt-get dist-upgrade) and rebooting node 1 the cluster was quorate and showed all 3 nodes as active.

However it was not possible to migrate the VMs, online, back to node 1. The error message does not seem to imply what the problem actually is though:

Mar 26 23:30:09 starting migration of VM 102 to node 'node01' (10.20.30.41)
Mar 26 23:30:09 copying disk images
Mar 26 23:30:09 starting VM 102 on remote node 'node01'
Mar 26 23:30:12 starting ssh migration tunnel
Mar 26 23:30:13 starting online/live migration on localhost:60000
Mar 26 23:30:13 migrate_set_speed: 8589934592
Mar 26 23:30:13 migrate_set_downtime: 0.1
Mar 26 23:30:15 ERROR: online migrate failure - aborting
Mar 26 23:30:15 aborting phase 2 - cleanup resources
Mar 26 23:30:15 migrate_cancel
Mar 26 23:30:16 ERROR: migration finished with problems (duration 00:00:07)

The VMs all have a rbd disks which are located on a separate dedicated CEPH cluster.

Shutting down the VMs, migrating offline and starting them back up on the patched node worked fine. But of course I would prefer being able to patch the nodes without having to shut down the VMs.

Any feedback on what I did wrong and how to do a rolling update without downtime would be greatly appreciated.

Best regards,
Rudo

udo · Mar 29, 2016

Hi,
does it work if you do an "apt-get upgrade" on the source node first?

Udo

Rudo · Mar 29, 2016

Hi Udo, thank you for your reply.

Unfortunately this question is after the fact. As I didn't want to leave the cluster in a half upgraded state with different versions between nodes (albeit minor versions) I ended up moving the VMs offline and upgraded all nodes. It is asked as a postmortem and in preparation for the next update.

Do you mean I should have done an "apt-get upgrade" instead of an "apt-get dist-upgrade" on the source node?

From what I understand both would have upgraded all installed packages to the newest version. The "upgrade" would not have deleted any packages, even though they were no longer required. The "dist-upgrade" could possibly have installed some additional dependencies. Both would have required a reboot.

I don't really understand how that would have affected the issue I encountered?

Best regards,
Rudo

wolfgang · Mar 30, 2016

Hi,

you can try to migrate over node 3 this should work.

Rudo · Mar 31, 2016

Hi Wolfgang,

I think I tried what you suggested. Didn't explain that properly in the initial question.

After patching node 1 I was unable to migrate the VMs back to node 1. So I migrated all VMs from node 2 to node 3. Patched and rebooted node 2 (so that two out of the three nodes were on the same version).

After rebooting the cluster was quorate and all three nodes showed as active. However I was still not able to migrate the VMs, online, from node 3 back to node 1 or node 2.

Do you mean that should have worked?

Kind regards,
Rudo

wolfgang · Mar 31, 2016

Rudo said:
Do you mean that should have worked?

Yes normally this should work!

Rudo · Mar 31, 2016

Hi Wolfgang,

I am happy to hear that. At least my assumptions and actions at the time were correct.

Do you have any suggestions on how to best diagnose this issue if it happens again in the future?

Kind regards,
Rudo

Search

Search

Rolling update of 4.1 cluster

Rudo

Renowned Member

udo

Distinguished Member

Rudo

Renowned Member

wolfgang

Proxmox Retired Staff

Rudo

Renowned Member

wolfgang

Proxmox Retired Staff

Rudo

Renowned Member