Node refuses to reconnect to Cluster

MrSprint

New Member
Aug 30, 2021
5
2
1
39
Hi All

So I've been having a rejig of my environment, I added an extra node to my cluster (I now how 4), but one of the nodes after a reboot refuses to connect to the cluster. I've done heaps off googling, but with no joy. I want to avoid having to remove the node if possible.

My intent is to power down 2 of the nodes, as i want to use the two power efficient machines for day to day use, and power up the other two only when needed if I'm doing some work/testing something in the lab. I've weighted one of the nodes with more votes than normal to prevent the whole cluster falling over when only two are operational.

At the moment even with node 2 (Mercury) powered on, the cluster can't see it.


1670165142599.png

As far as I can see, the hostfile on the nodes are correct,

the only different I did find was that the corosync.conf file is was of date on Mercury, I suspect I might have updated the copy on node 1 (hydrogen) when Mercury was offline (adding some extra votes to the node I want to be master when I power down the old master, node 1), and I suspect I forgot to power up mercury before doing so. After updated it to match the others, corosync.service does start, but the node doesn't play with the others.


Code:
root@Titanium:~# more /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Hydrogen
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.70.202
  }
  node {
    name: Mercury
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.70.203
  }
  node {
    name: Tin
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.70.204
  }
  node {
    name: Titanium
    nodeid: 4
    quorum_votes: 3
    ring0_addr: 172.16.70.205
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Elemental
  config_version: 9
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@Titanium:~#


Code:
root@Mercury:~# more /etc/corosync/corosync.conf
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: Hydrogen
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 172.16.70.202
  }
  node {
    name: Mercury
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 172.16.70.203
  }
  node {
    name: Tin
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 172.16.70.204
  }
  node {
    name: Titanium
    nodeid: 4
    quorum_votes: 3
    ring0_addr: 172.16.70.205
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: Elemental
  config_version: 9
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

root@Mercury:~#

Here are some outputs which i found while looking at other peoples threads with cluster issues, which hopefully provide some clues. Nothing is leaping out at me hence seeking some expert advice :)

Code:
root@Mercury:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2022-12-04 14:54:56 GMT; 1h 43min ago
    Process: 1558 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 11ms

Dec 04 14:54:56 Mercury systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Dec 04 14:54:56 Mercury systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 04 14:54:56 Mercury systemd[1]: pve-cluster.service: Start request repeated too quickly.
Dec 04 14:54:56 Mercury systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 04 14:54:56 Mercury systemd[1]: Failed to start The Proxmox VE cluster filesystem.



Code:
root@Titanium:~# systemctl status pve-cluster.service
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-12-01 23:09:52 GMT; 2 days ago
    Process: 9949 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
   Main PID: 9965 (pmxcfs)
      Tasks: 7 (limit: 309308)
     Memory: 68.5M
        CPU: 5min 32.992s
     CGroup: /system.slice/pve-cluster.service
             └─9965 /usr/bin/pmxcfs

Dec 04 14:46:26 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 14:46:52 Titanium pmxcfs[9965]: [dcdb] notice: data verification successful
Dec 04 14:49:43 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 15:01:27 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 15:16:27 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 15:31:27 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 15:46:27 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 15:46:52 Titanium pmxcfs[9965]: [dcdb] notice: data verification successful
Dec 04 16:01:27 Titanium pmxcfs[9965]: [status] notice: received log
Dec 04 16:16:27 Titanium pmxcfs[9965]: [status] notice: received log
root@Titanium:~#

Code:
root@Mercury:~# journalctl -b -u pve-cluster
-- Journal begins at Mon 2021-12-06 22:00:47 GMT, ends at Sun 2022-12-04 16:29:55 GMT. --
Dec 04 14:54:54 Mercury systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 04 14:54:54 Mercury pmxcfs[1373]: fuse: mountpoint is not empty
Dec 04 14:54:54 Mercury pmxcfs[1373]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Dec 04 14:54:54 Mercury pmxcfs[1373]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1373]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury pmxcfs[1373]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1373]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 04 14:54:54 Mercury systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 1.
Dec 04 14:54:54 Mercury systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 04 14:54:54 Mercury systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 04 14:54:54 Mercury pmxcfs[1553]: fuse: mountpoint is not empty
Dec 04 14:54:54 Mercury pmxcfs[1553]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Dec 04 14:54:54 Mercury pmxcfs[1553]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1553]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury pmxcfs[1553]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1553]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 04 14:54:54 Mercury systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 2.
Dec 04 14:54:54 Mercury systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 04 14:54:54 Mercury systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 04 14:54:54 Mercury pmxcfs[1554]: fuse: mountpoint is not empty
Dec 04 14:54:54 Mercury pmxcfs[1554]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Dec 04 14:54:54 Mercury pmxcfs[1554]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1554]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury pmxcfs[1554]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:54 Mercury pmxcfs[1554]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Dec 04 14:54:54 Mercury systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 04 14:54:54 Mercury systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Dec 04 14:54:55 Mercury systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 3.
Dec 04 14:54:55 Mercury systemd[1]: Stopped The Proxmox VE cluster filesystem.
Dec 04 14:54:55 Mercury systemd[1]: Starting The Proxmox VE cluster filesystem...
Dec 04 14:54:55 Mercury pmxcfs[1557]: fuse: mountpoint is not empty
Dec 04 14:54:55 Mercury pmxcfs[1557]: fuse: if you are sure this is safe, use the 'nonempty' mount option
Dec 04 14:54:55 Mercury pmxcfs[1557]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:55 Mercury pmxcfs[1557]: [main] crit: fuse_mount error: File exists
Dec 04 14:54:55 Mercury pmxcfs[1557]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:55 Mercury pmxcfs[1557]: [main] notice: exit proxmox configuration filesystem (-1)
Dec 04 14:54:55 Mercury systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Dec 04 14:54:55 Mercury systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Dec 04 14:54:55 Mercury systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Dec 04 14:54:55 Mercury systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 4.

Please let me know what other logs/outputs I can provide, and hopefully i can avoid having to reinstall/re-add the node from scratch, and I promise I'll be more careful in future :(
 
Last edited:
Please post pvecm status of Mercury and of one of the other nodes. If you edit corosync conf while a node is offline you might need to recover it because the node is separated from the other nodes.
 
Thanks for the reply jsterr

Outputs are below
Code:
root@Titanium:~# pvecm status
Cluster information
-------------------
Name:             Elemental
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Dec  5 19:25:32 2022
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000004
Ring ID:          1.3d18
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   6
Highest expected: 6
Total votes:      6
Quorum:           4
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 172.16.70.202
0x00000002          1 172.16.70.203
0x00000003          1 172.16.70.204
0x00000004          3 172.16.70.205 (local)
root@Titanium:~#

And from the orphaned node

Code:
root@Mercury:~# pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused
root@Mercury:~#

interestingly, Mercury's IP (172.16.70.203) didn't show up the first time I ran the command in Titanium's output, as I forgot 'd powered it down overnight since it wasn't doing anything. Powered it up and reran the command, and now it shows :confused: Not sure if thats good or bad?
 
... how much work is involved in deleting the Node and re-adding it? Can that be done, or once a node is deleted can it not be re-joined? (I don't want to have to rename/re-ip it as it'll break my IP schema/naming scheme.

Thanks in advance
Rich
 
... how much work is involved in deleting the Node and re-adding it? Can that be done, or once a node is deleted can it not be re-joined? (I don't want to have to rename/re-ip it as it'll break my IP schema/naming scheme.

Thanks in advance
Rich
What happens if you edit corosync on Mercury, is there a error message showing up? for example nano /etc/pve/corosync.conf