Hello Everyone!
I have a 4 server cluster scenario, that went running fine for 700+ days, but eventually we need to update it, mostly that we want to use PBS and since we are on version 5.4x we want to roll up the servers to version 6.
I made an offline lab with the same configurations from the nodes, to evaluate if I could update one by one, following the update steps from proxmox wiki (5to6 and 6to7 was tested this way) but only updating a single node at a time.
Anyhow on the docs there is the step to update corosync to version 3 on all nodes, even the ones that wont go from 5to6 so the cluster is healthy, and the update was done this way:
After this the cluster went wild, lots of errors and machines and nodes with question marks, only available if logged on the GUI separatedly, and the other 3 nodes all have crucial machines up and running, so I could not take them down or reboot the whole system besides the updated node4 to version 6.
After some tests and manipulating the services (stop / restart on pve-cluster, corosync, pve-ha-lrm, pve-ha-crm, corosync, pveproxy, pvedaemon, pvestatd and pve-firewall) when realigning the cluster some nodes were up, but in the proccess eventually one came up with different ring.ids, or thinking that they were into a new cluster... a lot backing and forth on this matter and then we tapered that node2 in special was the issue, after some changes:
The updated node from 5 to 6 was with corosync versin 3.1.1, so we ensured all nodes had the same version on all related packages:
after this, if we take out node2 out of the picture (stoping corosync and pve-cluster), we can manage the other 3 nodes, that are in different versions (node1= 5.4-15, node3= 5.4-15, node4=6.4-13) and have management.
One proccess to replicate the "good" nodes to the problematic node2 was:
but after this proccess they almost got syncronized (checking with watch pvecm status), but at some point we got a whole bunch of [TOTEM] replication infos on syslog, and the nodes all lost communication.
Is there anyway to force the rejoin from node2 to the cluster, with the machines still running and not breaking the entire cluster?
Where could I check on whatever node what could be the issue that they cant replicate the cluster / corosync, pmxcfs properly?
Please let me know if I can provide any other information / logs, thanks in advance
I have a 4 server cluster scenario, that went running fine for 700+ days, but eventually we need to update it, mostly that we want to use PBS and since we are on version 5.4x we want to roll up the servers to version 6.
I made an offline lab with the same configurations from the nodes, to evaluate if I could update one by one, following the update steps from proxmox wiki (5to6 and 6to7 was tested this way) but only updating a single node at a time.
Anyhow on the docs there is the step to update corosync to version 3 on all nodes, even the ones that wont go from 5to6 so the cluster is healthy, and the update was done this way:
- all nodes updated to all packages available on the version 5.4 itself, going to version 5.4-15
- update corosync on node4 that I had all vms and configs backed up first
- updated corosync on all other 3 nodes in paralel as suggested
After this the cluster went wild, lots of errors and machines and nodes with question marks, only available if logged on the GUI separatedly, and the other 3 nodes all have crucial machines up and running, so I could not take them down or reboot the whole system besides the updated node4 to version 6.
After some tests and manipulating the services (stop / restart on pve-cluster, corosync, pve-ha-lrm, pve-ha-crm, corosync, pveproxy, pvedaemon, pvestatd and pve-firewall) when realigning the cluster some nodes were up, but in the proccess eventually one came up with different ring.ids, or thinking that they were into a new cluster... a lot backing and forth on this matter and then we tapered that node2 in special was the issue, after some changes:
The updated node from 5 to 6 was with corosync versin 3.1.1, so we ensured all nodes had the same version on all related packages:
apt install libcpg4=3.0.4-pve1
apt install libcmap4=3.0.4-pve1
apt install libquorum5=3.0.4-pve1
apt install libvotequorum8=3.0.4-pve1
apt install libcorosync-common4=3.0.4-pve1
apt install corosync=3.0.4-pve1
after this, if we take out node2 out of the picture (stoping corosync and pve-cluster), we can manage the other 3 nodes, that are in different versions (node1= 5.4-15, node3= 5.4-15, node4=6.4-13) and have management.
One proccess to replicate the "good" nodes to the problematic node2 was:
systemctl stop corosync pve-cluster
scp node1:/var/lib/pve-cluster/* /var/lib/pve-cluster
systemctl restart corosync.service
systemctl restart pve-cluster.service
but after this proccess they almost got syncronized (checking with watch pvecm status), but at some point we got a whole bunch of [TOTEM] replication infos on syslog, and the nodes all lost communication.
Is there anyway to force the rejoin from node2 to the cluster, with the machines still running and not breaking the entire cluster?
Where could I check on whatever node what could be the issue that they cant replicate the cluster / corosync, pmxcfs properly?
Please let me know if I can provide any other information / logs, thanks in advance
Last edited: