Rolling update of 4.1 cluster

Rudo

Renowned Member
Dec 14, 2015
19
0
66
hello,

I have a 3 node cluster which I tried to patch from 4.1-2 to 4.1-22.

The plan was to migrate the VMs, online, from node 1 to node 2, patch node 1, reboot and migrate the VMs from node 2 back to 1. Rinse and repeat for the other 2 nodes.

After patching (apt-get update && apt-get dist-upgrade) and rebooting node 1 the cluster was quorate and showed all 3 nodes as active.

However it was not possible to migrate the VMs, online, back to node 1. The error message does not seem to imply what the problem actually is though:

Mar 26 23:30:09 starting migration of VM 102 to node 'node01' (10.20.30.41)
Mar 26 23:30:09 copying disk images
Mar 26 23:30:09 starting VM 102 on remote node 'node01'
Mar 26 23:30:12 starting ssh migration tunnel
Mar 26 23:30:13 starting online/live migration on localhost:60000
Mar 26 23:30:13 migrate_set_speed: 8589934592
Mar 26 23:30:13 migrate_set_downtime: 0.1
Mar 26 23:30:15 ERROR: online migrate failure - aborting
Mar 26 23:30:15 aborting phase 2 - cleanup resources
Mar 26 23:30:15 migrate_cancel
Mar 26 23:30:16 ERROR: migration finished with problems (duration 00:00:07)


The VMs all have a rbd disks which are located on a separate dedicated CEPH cluster.

Shutting down the VMs, migrating offline and starting them back up on the patched node worked fine. But of course I would prefer being able to patch the nodes without having to shut down the VMs.

Any feedback on what I did wrong and how to do a rolling update without downtime would be greatly appreciated.

Best regards,
Rudo
 
Hi Udo, thank you for your reply.

Unfortunately this question is after the fact. As I didn't want to leave the cluster in a half upgraded state with different versions between nodes (albeit minor versions) I ended up moving the VMs offline and upgraded all nodes. It is asked as a postmortem and in preparation for the next update.

Do you mean I should have done an "apt-get upgrade" instead of an "apt-get dist-upgrade" on the source node?

From what I understand both would have upgraded all installed packages to the newest version. The "upgrade" would not have deleted any packages, even though they were no longer required. The "dist-upgrade" could possibly have installed some additional dependencies. Both would have required a reboot.

I don't really understand how that would have affected the issue I encountered?

Best regards,
Rudo
 
Hi,

you can try to migrate over node 3 this should work.
 
Hi Wolfgang,

I think I tried what you suggested. Didn't explain that properly in the initial question.

After patching node 1 I was unable to migrate the VMs back to node 1. So I migrated all VMs from node 2 to node 3. Patched and rebooted node 2 (so that two out of the three nodes were on the same version).

After rebooting the cluster was quorate and all three nodes showed as active. However I was still not able to migrate the VMs, online, from node 3 back to node 1 or node 2.

Do you mean that should have worked?

Kind regards,
Rudo
 
Hi Wolfgang,

I am happy to hear that. At least my assumptions and actions at the time were correct.

Do you have any suggestions on how to best diagnose this issue if it happens again in the future?

Kind regards,
Rudo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!