I have a cluster of six servers running proxmox 7.4.16.
Three of these servers run ceph, and provide the storage for the VMs hosted by the other three servers. Ceph version is 17.2.6
This configuration has been running without problems for half a year or more.
Today, when I opened the GUI, there were no nodes visible.
I can ping each of the nodes and ssh them without problems. Each node
can ping and ssh the others.
If I run pvecm status on the different nodes, I get different
results.
For example, ceph node 0 gives me:
root@ceph-00:~# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 29 10:10:10 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.646c
Quorate: No
Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.29.134.20 (local)
but node kvm-01 gives me:
oot@skvm-01:/etc/ssh# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 29 10:10:47 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 1.647c
Quorate: No
Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.29.134.11 (local)
and as far as I can determine none of the nodes are in agreement. It
is also worth noting that the time to run the pvecm status command is
quite long - on the order of 45 seconds to a minute
Is there a way to bring these wayward nodes back together?
It appears to me that the VM are still running, and I'd prefer not to
reboot nodes willy-nilly unless I have to.
Three of these servers run ceph, and provide the storage for the VMs hosted by the other three servers. Ceph version is 17.2.6
This configuration has been running without problems for half a year or more.
Today, when I opened the GUI, there were no nodes visible.
I can ping each of the nodes and ssh them without problems. Each node
can ping and ssh the others.
If I run pvecm status on the different nodes, I get different
results.
For example, ceph node 0 gives me:
root@ceph-00:~# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 29 10:10:10 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000001
Ring ID: 1.646c
Quorate: No
Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.29.134.20 (local)
but node kvm-01 gives me:
oot@skvm-01:/etc/ssh# pvecm status
Cluster information
-------------------
Name: Sanford
Config Version: 10
Transport: knet
Secure auth: on
Quorum information
------------------
Date: Fri Sep 29 10:10:47 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 1.647c
Quorate: No
Votequorum information
----------------------
Expected votes: 6
Highest expected: 6
Total votes: 1
Quorum: 4 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
0x00000002 1 172.29.134.11 (local)
and as far as I can determine none of the nodes are in agreement. It
is also worth noting that the time to run the pvecm status command is
quite long - on the order of 45 seconds to a minute
Is there a way to bring these wayward nodes back together?
It appears to me that the VM are still running, and I'd prefer not to
reboot nodes willy-nilly unless I have to.