[SOLVED] Cluster no longer exists after node failure

rekahsoft

Member
Dec 28, 2019
10
3
8
33
I have a 3 node hobby cluster running on lenovo rd330's with ceph across all three nodes. I recently had disk failures in one of the nodes that required it be completely replaced. I reinstalled the node and went to add it to the cluster, however the old node was still being shown in the cluster. I removed them by removing the relevant node section from /etc/corosync/corosync.conf, removed /etc/pve/nodes/<node-name> and restarted pve-cluster, pveproxy and corosync. I have used this in the past successfully, however in this case I cannot get the old node removed from the gui in the existing nodes in the cluster.

Upon further inspection, I found that from the gui, the datacenter reports no cluster, though it still shows all nodes (even the one that failed). How do I get proxmox to find my cluster again? How did this happen and how can it be avoided in the future?


pvecm status reports the following from a working cluster node:
Code:
root@pve-1:~# pvecm status

Cluster information

------------------
Name:             pve
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Sat Dec 28 10:32:39 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          2.3280
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
Nodeid  Votes Name
0x00000002          1 172.16.0.21 (local)
0x00000003          1 172.16.0.22

/etc/corosync/corosync.conf:
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve-1
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.0.21
  }
  node {
    name: pve-2
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.0.22
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pve
  config_version: 6
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  secauth: on
  version: 2
}

Note: the `pve-0` node section, when removed, reappears after a reboot. It is as follows:

Code:
  node {
    name: pve-0
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.0.20
  }
 

Attachments

  • 2019-12-28-103227_1906x583_scrot.png
    2019-12-28-103227_1906x583_scrot.png
    79.9 KB · Views: 3
I found out the answer to my issue on my own, but I thought I'd post it for others. I was editing the wrong corosync.conf file. I needed to edit the cluster synced version in /etc/pve/corosync.conf. Once I did this and restarted corosync, pveproxy, and pve-cluster, the cluster appeared again in the ui and I could add nodes once again.
 
  • Like
Reactions: Stoiko Ivanov

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!