[SOLVED] Cluster Node cannot Join

Jospeh Huber

Renowned Member
Apr 18, 2016
99
7
73
45
Hi,

we have a three node Proxmox Cluster on Proxmox 6.
We have added new hardware network cards and disks to one system.
After installing the hardware during the first boot the network was unreachable.
The name of the network card changed from "enp3s0=>enp4s0".
We have one bridging device for the vms and one network device for ceph.
I have changed this in /etc/network/interfaces ... the network is reachable for all interfaces as it should.
But the webinterface on changed node does not work and the cluster is not working as well.
The curious thing is the other members see the defect node but the defect node does not see the other nodes:

Output Defect Node:
Code:
 pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

Output Other Nodes:
Code:
Cluster information
-------------------
Name:             XXXXXXXX
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jul 27 18:17:21 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.70
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 XXX.YYY.ZZZ.121 (local)
0x00000002          1 XXX.YYY.ZZZ.119
0x00000004          1 XXX.YYY.ZZZ.211


The Output of "
systemctl status pve-cluster pveproxy pvedaemon


Code:
pve-cluster.service
   Loaded: bad-setting (Reason: Unit pve-cluster.service has a bad unit file setting.)
   Active: inactive (dead)

Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 18:00:31 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Missing '='.
Jul 27 18:06:10 vmhost3 systemd[1]: pve-cluster.service: Cannot add dependency job, ignoring: Unit pve-cluster.service has a bad unit file setting.

● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:58:03 CEST; 24min ago
  Process: 1408 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
  Process: 1413 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 1417 (pveproxy)
    Tasks: 4 (limit: 4915)
   Memory: 132.8M
   CGroup: /system.slice/pveproxy.service
           ├─1417 pveproxy
           ├─3746 pveproxy worker
           ├─3747 pveproxy worker
           └─3748 pveproxy worker

Jul 27 18:22:17 vmhost3 pveproxy[3746]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jul 27 18:22:17 vmhost3 pveproxy[3744]: worker exit
Jul 27 18:22:17 vmhost3 pveproxy[3745]: worker exit
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3744 finished
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3745 finished
Jul 27 18:22:17 vmhost3 pveproxy[1417]: starting 2 worker(s)
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3747 started
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3748 started
Jul 27 18:22:17 vmhost3 pveproxy[3747]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jul 27 18:22:17 vmhost3 pveproxy[3748]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.

● pvedaemon.service - PVE API Daemon
   Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:57:57 CEST; 24min ago
Main PID: 1403 (pvedaemon)
    Tasks: 4 (limit: 4915)
   Memory: 133.9M
   CGroup: /system.slice/pvedaemon.service
           ├─1403 pvedaemon
           ├─1404 pvedaemon worker
           ├─1405 pvedaemon worker
           └─1406 pvedaemon worker

Jul 27 17:57:53 vmhost3 systemd[1]: Starting PVE API Daemon...
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: starting server
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: starting 3 worker(s)
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1404 started
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1405 started
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1406 started
Jul 27 17:57:57 vmhost3 systemd[1]: Started PVE API Daemon.
It seems that on the defect node the Proxy certificates are not present, but a
pvecm updatecerts -force does not work because the cluster is not present for the node.


The outout of "systemctl status corosync" is:
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:57:53 CEST; 34min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 1019 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 131.2M
   CGroup: /system.slice/corosync.service
           └─1019 /usr/sbin/corosync -f

Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 27 18:16:13 vmhost3 corosync[1019]:   [KNET  ] link: host: 1 link: 0 is down
Jul 27 18:16:13 vmhost3 corosync[1019]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jul 27 18:16:13 vmhost3 corosync[1019]:   [TOTEM ] Retransmit List: 13f2
Jul 27 18:16:25 vmhost3 corosync[1019]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 27 18:16:25 vmhost3 corosync[1019]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Any ideas?
 
Last edited:
Thx in advance - here it is:
ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000 link/ether 00:22:4d:7b:dd:48 brd ff:ff:ff:ff:ff:ff 3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 24:5e:be:50:c8:96 brd ff:ff:ff:ff:ff:ff inet 10.0.99.84/24 scope global enp1s0 valid_lft forever preferred_lft forever inet6 fe80::265e:beff:fe50:c896/64 scope link valid_lft forever preferred_lft forever 4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 00:22:4d:7b:dd:48 brd ff:ff:ff:ff:ff:ff inet XXX.YYY.ZZZ.211/24 scope global vmbr0 valid_lft forever preferred_lft forever inet6 fe80::222:4dff:fe7b:dd48/64 scope link valid_lft forever preferred_lft forever
eno1 is backing the bridge ... multicast is activated

And here is my corosync conf... we are using two networks for corosync so that it is more fault tolerant in HA.
The config is working for years.

Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vmhost2
    nodeid: 1
    quorum_votes: 1
    ring0_addr: vmhost2
    ring1_addr: vmhost2pm
  }
  node {
    name: vmhost3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: vmhost3
    ring1_addr: vmhost3pm
  }
  node {
    name: vmhost5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: vmhost5
    ring1_addr: vmhost5pm
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycompany-proxmox
  config_version: 6
  interface {
    bindnetaddr: XXX.YYY.ZZZ.121
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.0.99.0
    ringnumber: 1
  }
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
}
 
Is there any chance to increase the logging and check some logfiles?

I am thinking about to remove the node from the cluster and reinstall it, but I have no experience if this is possible in a stale cluster state.
 
Hi again,

I think now it's an systemd problem.
If I start "/usr/bin/pmxcfs" from the command line "pvecm status" is working!

systemctl | grep pve-cluster => is empty systemctl status pve-cluster ● pve-cluster.service Loaded: bad-setting (Reason: Unit pve-cluster.service has a bad unit file setting.) Active: inactive (dead)

The unit file is there and it looks like the unit file of the other nodes "/lib/systemd/system/pve-cluster.service".

I have also tried to disable and enable the service
Code:
>systemctl disable pve-cluster
Removed /etc/systemd/system/multi-user.target.wants/pve-cluster.service.
>systemctl enable /lib/systemd/system/pve-cluster.service
Created symlink /etc/systemd/system/multi-user.target.wants/pve-cluster.service → /lib/systemd/system/pve-cluster.service.


Any ideas now?
 
WTF... there was a corrupt file located here
/etc/systemd/system/pve-cluster.service
After deleting it and reinstalling the service everything is working again
:D
 
  • Like
Reactions: matrix

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!