Cluster broken after update to PVE 8.2?

proxwolfe · Jun 9, 2024

Hi,

I have a three node cluster in my home lab that was running on PVE 8.1.5.

Now I upgraded the first node to 8.2.2 and, after rebooting, it doesn't connect to the cluster anymore.

The remaining cluster can't see the upgraded node and the upgraded node can't see the remaining cluster.

Did I miss anything regarding the upgrade? What can I do to repair my cluster?

Thanks!

leesteken · Jun 9, 2024

proxwolfe said:
Now I upgraded the first node to 8.2.2 and, after rebooting, it doesn't connect to the cluster anymore.

The new kernel version might have rename the network devices. Check the 'known issues & breaking changes' here: https://pve.proxmox.com/wiki/Roadmap#Proxmox_VE_8.2

proxwolfe · Jun 9, 2024

Hmm, all nodes can ping each other.

So networking doesn't seem to be the issue.

proxwolfe · Jun 9, 2024

As per this thread, I have added all hosts to each host's /etc/hosts file: https://forum.proxmox.com/threads/one-node-cannot-see-other-nodes-in-a-cluster.97367/

Unfortunately, this does not solve my issue.

proxwolfe · Jun 9, 2024

Would it make sense to upgrade also the other two nodes to PVE 8.2.2?

I am a bit reluctant to try in case none of the nodes will be able to see the others anymore...

proxwolfe · Jun 9, 2024

I can also ssh from all of the nodes into all of the other nodes. So it doesn't seem to be an ssh certificate issue (if there is such a thing).

justinclift · Jun 9, 2024

@proxwolfe Would you be ok to paste the output of pvecm status here?

Also, does the cluster have the firewall enabled?

justinclift · Jun 9, 2024

Hmmm, it might be useful to paste the contents of /etc/pve/corosync.conf as well, just for good measure.

proxwolfe · Jun 9, 2024

This is from the one upgraded node

Code:

# pvecm status
#Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:10:00 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.6c7
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.123.6 (local)

This is from the first of the other two nodes:

Code:

# pvecm status
Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:10:36 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000003
Ring ID:          1.6c1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.123.7
0x00000003          1 192.168.123.1 (local)

This is from the second of the other two nodes:

Code:

# pvecm status
Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:14:25 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.6c1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.123.7 (local)
0x00000003          1 192.168.123.1

proxwolfe · Jun 9, 2024

justinclift said:
Also, does the cluster have the firewall enabled?

Firewall is set to 'yes' on all three nodes but no rules have been defined.

justinclift · Jun 9, 2024

proxwolfe said:
Firewall is set to 'yes' on all three nodes but no rules have been defined.

Is it set to "yes" at the datacenter level too? If not, then the setting in the nodes can be ignored.

proxwolfe · Jun 9, 2024

justinclift said:
Hmmm, it might be useful to paste the contents of /etc/pve/corosync.conf as well, just for good measure.

Code:

cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.123.6
  }
  node {
    name: node2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.123.1
  }
  node {
    name: node3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.123.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pvecluster
  config_version: 15
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

node1 (nodeid 4) is the upgraded node

proxwolfe · Jun 9, 2024

justinclift said:
Is it set to "yes" at the datacenter level too? If not, then the setting in the nodes can be ignored

Oh, okay, at the datacenter level it is set to "no" (which makes sense as I can't remember activating the firewall).

justinclift · Jun 9, 2024

Hmmm, that config file looks completely fine to me. Nothing weird in there.

From the pvecm status output... I'm kind of thinking that maybe there was some (unmentioned) protocol change in corosync, leading to the nodes partitioning themselves like that.

When I upgraded a test cluster from 8.1 to 8.2 recently, I did all of the nodes at once and things just all magically worked.

I guess your choice is "do I upgrade the other two and hope for the best"... or not...

justinclift · Jun 9, 2024

Oh. If your Proxmox boot disks are running ZFS, you could snapshot them prior to the upgrade.

That way if the upgrade just breaks everything you can roll them back to the older version.

proxwolfe · Jun 9, 2024

justinclift said:
I guess your choice is "do I upgrade the other two and hope for the best"... or not...

It would seem so. Unfortunately, neither is an appealing choice...

proxwolfe · Jun 9, 2024

justinclift said:
Oh. If your Proxmox boot disks are running ZFS, you could snapshot them prior to the upgrade.

That way if the upgrade just breaks everything you can roll them back to the older version.

Ext4 on all of them.

(I think I did install PVE on ZFS a few years ago (prior to my current cluster) and this gave me some other (booting?) issues and so I refrain from using ZFS on the boot drive.)

So, unfortunately, upgrading is a one way ticket for me...

justinclift · Jun 9, 2024

Guessing you've also tried just stopping the corosync service on node1, waiting a few seconds, then starting corosync back up again and seeing if it decides to play ball?

proxwolfe · Jun 9, 2024

I am just now looking into pve-cluster.service with corosync.service being the next in line.

justinclift · Jun 9, 2024

Getting a bit more experimental, you could maybe try just upgrading corosync on both of the other nodes (ie apt install corosync pulling from the proxmox repos), so that all three nodes are definitely running the same version.

Clearly I'm just spitballing though, and anything could happen. (!)

Cluster broken after update to PVE 8.2?

Well-Known Member

Distinguished Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Well-Known Member

Active Member