[SOLVED] Cluster no longer exists after node failure

rekahsoft

Member
Dec 28, 2019
10
3
8
34
I have a 3 node hobby cluster running on lenovo rd330's with ceph across all three nodes. I recently had disk failures in one of the nodes that required it be completely replaced. I reinstalled the node and went to add it to the cluster, however the old node was still being shown in the cluster. I removed them by removing the relevant node section from /etc/corosync/corosync.conf, removed /etc/pve/nodes/<node-name> and restarted pve-cluster, pveproxy and corosync. I have used this in the past successfully, however in this case I cannot get the old node removed from the gui in the existing nodes in the cluster.

Upon further inspection, I found that from the gui, the datacenter reports no cluster, though it still shows all nodes (even the one that failed). How do I get proxmox to find my cluster again? How did this happen and how can it be avoided in the future?


pvecm status reports the following from a working cluster node:
Code:
root@pve-1:~# pvecm status

Cluster information

------------------
Name:             pve
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Dec 28 10:32:39 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2.3280
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
Nodeid  Votes Name
0x00000002          1 172.16.0.21 (local)
0x00000003          1 172.16.0.22

/etc/corosync/corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.0.21
  }
  node {
    name: pve-2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.0.22
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Note: the `pve-0` node section, when removed, reappears after a reboot. It is as follows:

Code:
  node {
    name: pve-0
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.0.20
  }
 

Attachments

  • 2019-12-28-103227_1906x583_scrot.png
    2019-12-28-103227_1906x583_scrot.png
    79.9 KB · Views: 3
I found out the answer to my issue on my own, but I thought I'd post it for others. I was editing the wrong corosync.conf file. I needed to edit the cluster synced version in /etc/pve/corosync.conf. Once I did this and restarted corosync, pveproxy, and pve-cluster, the cluster appeared again in the ui and I could add nodes once again.
 
  • Like
Reactions: Stoiko Ivanov