[SOLVED] Cluster Node cannot Join

Jospeh Huber · Jul 27, 2020

Hi,

we have a three node Proxmox Cluster on Proxmox 6.
We have added new hardware network cards and disks to one system.
After installing the hardware during the first boot the network was unreachable.
The name of the network card changed from "enp3s0=>enp4s0".
We have one bridging device for the vms and one network device for ceph.
I have changed this in /etc/network/interfaces ... the network is reachable for all interfaces as it should.
But the webinterface on changed node does not work and the cluster is not working as well.
The curious thing is the other members see the defect node but the defect node does not see the other nodes:

Output Defect Node:

Code:

 pvecm status
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

Output Other Nodes:

Code:

Cluster information
-------------------
Name:             XXXXXXXX
Config Version:   6
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jul 27 18:17:21 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.70
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 XXX.YYY.ZZZ.121 (local)
0x00000002          1 XXX.YYY.ZZZ.119
0x00000004          1 XXX.YYY.ZZZ.211

The Output of "
systemctl status pve-cluster pveproxy pvedaemon

Code:

pve-cluster.service
   Loaded: bad-setting (Reason: Unit pve-cluster.service has a bad unit file setting.)
   Active: inactive (dead)

Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 17:57:56 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Assignment outside of section. Ignoring.
Jul 27 17:57:56 vmhost3 systemd[1]: pve-cluster.service: Service has no ExecStart=, ExecStop=, or SuccessAction=. Refusing.
Jul 27 18:00:31 vmhost3 systemd[1]: /etc/systemd/system/pve-cluster.service:1: Missing '='.
Jul 27 18:06:10 vmhost3 systemd[1]: pve-cluster.service: Cannot add dependency job, ignoring: Unit pve-cluster.service has a bad unit file setting.

● pveproxy.service - PVE API Proxy Server
   Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:58:03 CEST; 24min ago
  Process: 1408 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, status=111)
  Process: 1413 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
Main PID: 1417 (pveproxy)
    Tasks: 4 (limit: 4915)
   Memory: 132.8M
   CGroup: /system.slice/pveproxy.service
           ├─1417 pveproxy
           ├─3746 pveproxy worker
           ├─3747 pveproxy worker
           └─3748 pveproxy worker

Jul 27 18:22:17 vmhost3 pveproxy[3746]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jul 27 18:22:17 vmhost3 pveproxy[3744]: worker exit
Jul 27 18:22:17 vmhost3 pveproxy[3745]: worker exit
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3744 finished
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3745 finished
Jul 27 18:22:17 vmhost3 pveproxy[1417]: starting 2 worker(s)
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3747 started
Jul 27 18:22:17 vmhost3 pveproxy[1417]: worker 3748 started
Jul 27 18:22:17 vmhost3 pveproxy[3747]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.
Jul 27 18:22:17 vmhost3 pveproxy[3748]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1727.

● pvedaemon.service - PVE API Daemon
   Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:57:57 CEST; 24min ago
Main PID: 1403 (pvedaemon)
    Tasks: 4 (limit: 4915)
   Memory: 133.9M
   CGroup: /system.slice/pvedaemon.service
           ├─1403 pvedaemon
           ├─1404 pvedaemon worker
           ├─1405 pvedaemon worker
           └─1406 pvedaemon worker

Jul 27 17:57:53 vmhost3 systemd[1]: Starting PVE API Daemon...
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: starting server
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: starting 3 worker(s)
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1404 started
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1405 started
Jul 27 17:57:56 vmhost3 pvedaemon[1403]: worker 1406 started
Jul 27 17:57:57 vmhost3 systemd[1]: Started PVE API Daemon.

It seems that on the defect node the Proxy certificates are not present, but a
pvecm updatecerts -force does not work because the cluster is not present for the node.

The outout of "systemctl status corosync" is:

Code:

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-07-27 17:57:53 CEST; 34min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 1019 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 131.2M
   CGroup: /system.slice/corosync.service
           └─1019 /usr/sbin/corosync -f

Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 1 from 469 to 1397
Jul 27 17:58:09 vmhost3 corosync[1019]:   [KNET  ] pmtud: Global data MTU changed to: 1397
Jul 27 18:16:13 vmhost3 corosync[1019]:   [KNET  ] link: host: 1 link: 0 is down
Jul 27 18:16:13 vmhost3 corosync[1019]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jul 27 18:16:13 vmhost3 corosync[1019]:   [TOTEM ] Retransmit List: 13f2
Jul 27 18:16:25 vmhost3 corosync[1019]:   [KNET  ] rx: host: 1 link: 0 is up
Jul 27 18:16:25 vmhost3 corosync[1019]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Any ideas?

matrix · Jul 28, 2020

please post network config and ip a

Jospeh Huber · Jul 28, 2020

Thx in advance - here it is:


ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
    link/ether 00:22:4d:7b:dd:48 brd ff:ff:ff:ff:ff:ff
3: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 24:5e:be:50:c8:96 brd ff:ff:ff:ff:ff:ff
    inet 10.0.99.84/24 scope global enp1s0
       valid_lft forever preferred_lft forever
    inet6 fe80::265e:beff:fe50:c896/64 scope link
       valid_lft forever preferred_lft forever
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:22:4d:7b:dd:48 brd ff:ff:ff:ff:ff:ff
    inet XXX.YYY.ZZZ.211/24 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::222:4dff:fe7b:dd48/64 scope link
       valid_lft forever preferred_lft forever

eno1 is backing the bridge ... multicast is activated

And here is my corosync conf... we are using two networks for corosync so that it is more fault tolerant in HA.
The config is working for years.

Code:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: vmhost2
    nodeid: 1
    quorum_votes: 1
    ring0_addr: vmhost2
    ring1_addr: vmhost2pm
  }
  node {
    name: vmhost3
    nodeid: 4
    quorum_votes: 1
    ring0_addr: vmhost3
    ring1_addr: vmhost3pm
  }
  node {
    name: vmhost5
    nodeid: 2
    quorum_votes: 1
    ring0_addr: vmhost5
    ring1_addr: vmhost5pm
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: mycompany-proxmox
  config_version: 6
  interface {
    bindnetaddr: XXX.YYY.ZZZ.121
    ringnumber: 0
  }
  interface {
    bindnetaddr: 10.0.99.0
    ringnumber: 1
  }
  ip_version: ipv4
  rrp_mode: passive
  secauth: on
  version: 2
}

Jospeh Huber · Jul 29, 2020

Is there any chance to increase the logging and check some logfiles?

I am thinking about to remove the node from the cluster and reinstall it, but I have no experience if this is possible in a stale cluster state.

Jospeh Huber · Jul 30, 2020

Hi again,

I think now it's an systemd problem.
If I start "/usr/bin/pmxcfs" from the command line "pvecm status" is working!


systemctl  | grep pve-cluster
=> is empty

systemctl status pve-cluster
● pve-cluster.service
   Loaded: bad-setting (Reason: Unit pve-cluster.service has a bad unit file setting.)
   Active: inactive (dead)

The unit file is there and it looks like the unit file of the other nodes "/lib/systemd/system/pve-cluster.service".

I have also tried to disable and enable the service

Code:

>systemctl disable pve-cluster
Removed /etc/systemd/system/multi-user.target.wants/pve-cluster.service.
>systemctl enable /lib/systemd/system/pve-cluster.service
Created symlink /etc/systemd/system/multi-user.target.wants/pve-cluster.service → /lib/systemd/system/pve-cluster.service.

Any ideas now?

Jospeh Huber · Jul 30, 2020

WTF... there was a corrupt file located here
/etc/systemd/system/pve-cluster.service
After deleting it and reinstalling the service everything is working again

matrix · Jul 30, 2020

Super

Search

Search

[SOLVED] Cluster Node cannot Join

Jospeh Huber

Renowned Member

matrix

Active Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

Jospeh Huber

Renowned Member

matrix

Active Member

We value your privacy