Cluster broken after update to PVE 8.2?

proxwolfe

Well-Known Member
Jun 20, 2020
499
50
48
49
Hi,

I have a three node cluster in my home lab that was running on PVE 8.1.5.

Now I upgraded the first node to 8.2.2 and, after rebooting, it doesn't connect to the cluster anymore.

The remaining cluster can't see the upgraded node and the upgraded node can't see the remaining cluster.

Did I miss anything regarding the upgrade? What can I do to repair my cluster?

Thanks!
 
Hmm, all nodes can ping each other.

So networking doesn't seem to be the issue.
 
Would it make sense to upgrade also the other two nodes to PVE 8.2.2?

I am a bit reluctant to try in case none of the nodes will be able to see the others anymore...
 
I can also ssh from all of the nodes into all of the other nodes. So it doesn't seem to be an ssh certificate issue (if there is such a thing).
 
Hmmm, it might be useful to paste the contents of /etc/pve/corosync.conf as well, just for good measure. :)
 
This is from the one upgraded node
Code:
# pvecm status
#Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:10:00 2024
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000004
Ring ID:          4.6c7
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000004          1 192.168.123.6 (local)


This is from the first of the other two nodes:
Code:
# pvecm status
Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:10:36 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000003
Ring ID:          1.6c1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.123.7
0x00000003          1 192.168.123.1 (local)

This is from the second of the other two nodes:
Code:
# pvecm status
Cluster information
-------------------
Name:             pvecluster
Config Version:   15
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sun Jun  9 11:14:25 2024
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.6c1
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.123.7 (local)
0x00000003          1 192.168.123.1
 
Hmmm, it might be useful to paste the contents of /etc/pve/corosync.conf as well, just for good measure. :)

Code:
cat /etc/pve/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: node1
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 192.168.123.6
  }
  node {
    name: node2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 192.168.123.1
  }
  node {
    name: node3
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 192.168.123.7
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pvecluster
  config_version: 15
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

node1 (nodeid 4) is the upgraded node
 
  • Like
Reactions: justinclift
Hmmm, that config file looks completely fine to me. Nothing weird in there.

From the pvecm status output... I'm kind of thinking that maybe there was some (unmentioned) protocol change in corosync, leading to the nodes partitioning themselves like that.

When I upgraded a test cluster from 8.1 to 8.2 recently, I did all of the nodes at once and things just all magically worked.

I guess your choice is "do I upgrade the other two and hope for the best"... or not... :eek:
 
Oh. If your Proxmox boot disks are running ZFS, you could snapshot them prior to the upgrade.

That way if the upgrade just breaks everything you can roll them back to the older version.
 
Oh. If your Proxmox boot disks are running ZFS, you could snapshot them prior to the upgrade.

That way if the upgrade just breaks everything you can roll them back to the older version.
Ext4 on all of them.

(I think I did install PVE on ZFS a few years ago (prior to my current cluster) and this gave me some other (booting?) issues and so I refrain from using ZFS on the boot drive.)

So, unfortunately, upgrading is a one way ticket for me...
 
Guessing you've also tried just stopping the corosync service on node1, waiting a few seconds, then starting corosync back up again and seeing if it decides to play ball?
 
Getting a bit more experimental, you could maybe try just upgrading corosync on both of the other nodes (ie apt install corosync pulling from the proxmox repos), so that all three nodes are definitely running the same version.

Clearly I'm just spitballing though, and anything could happen. (!)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!