Trouble with Cluster

strehi

New Member
Jan 7, 2021
2
0
1
Karlsruhe
Hi,

I'm running a Proxmox Cluster 6.3-3 with 4 nodes and wanted to add a 5th node. This caused some trouble,
as the new node was not correctly added. The "new" node hosted some VMs before, but all where shutdown
and deleted before joining the cluster. No shared storage et.al.

The cluster got defunct with hanging processes on all 4 existing nodes:
[492382.111107] INFO: task pveproxy worker:14569 blocked for more than 120 seconds.
[492382.111809] INFO: task pvesr:18726 blocked for more than 120 seconds.
...
Weblogin on the added node did not work any more (login failed), only ssh access was possible.

After shutting down the added node, things in the cluster got stable again. So I decided to
remove the recently added node following https://pve.proxmox.com/wiki/Cluster_Manager
(Separate A Node Without Reinstalling). Now the node is up standalone and the cluster
works normal.

Is it possible to say, what happened. And can I give it another try, or should I first reinstall
the node?

TIA
Bernd
 
Jan 11, 2021
4
0
1
34
Bern
Hi Bernd

Interesting, I tried exactly the same and failed the same way as you did!

I've also 4 nodes in a cluster and tried to add a 5th. I used the webgui to join the cluster. It failed the same way as you described. Managing was no longer possible as long as the 5th server was in the net. Turning it off fixed the issue in the existing cluster again.

I removed the 5th server again and reinstalled it. This morning I tried it again using the command line pvecm add command and it failed again with:
Code:
Please enter superuser (root) password for '192.168.101.31': **********

Establishing API connection with host '192.168.101.31'
The authenticity of host '192.168.101.31' can't be established.
X509 SHA256 key fingerprint is DE:35:03:20:6B:58:63:6A:C1:6E:FF:8A:C5:8A:45:DF:7E:8C:13:4F:AE:50:76:A0:64:0E:55:F8:D4:57:25:5E.
Are you sure you want to continue connecting (yes/no)? Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

TASK ERROR: can't stop pve-cluster service: Job for pve-cluster.service canceled.

Corosync on an existing node showed:
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-01-08 10:19:42 CET; 3 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 1850 (corosync)
    Tasks: 9 (limit: 6143)
   Memory: 256.4M
   CGroup: /system.slice/corosync.service
           └─1850 /usr/sbin/corosync -f

Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11

And the newly added node said:
Code:
root@proxmox05:/home/schm# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-01-11 10:27:47 CET; 5min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 28103 (corosync)
    Tasks: 9 (limit: 9830)
   Memory: 146.3M
   CGroup: /system.slice/corosync.service
           └─28103 /usr/sbin/corosync -f

Jan 11 10:32:52 proxmox05 corosync[28103]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4
Jan 11 10:32:52 proxmox05 corosync[28103]:   [QUORUM] Members[1]: 5
Jan 11 10:32:52 proxmox05 corosync[28103]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 11 10:32:52 proxmox05 corosync[28103]:   [TOTEM ] A new membership (1.7ca9) was formed. Members joined: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] FAILED TO RECEIVE
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] A new membership (5.7cad) was formed. Members left: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [QUORUM] Members[1]: 5
Jan 11 10:32:57 proxmox05 corosync[28103]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] A new membership (1.7cb1) was formed. Members joined: 1 2 3 4


After I rebooted the 5th server, pvecm status on the existing server shows now all 5 servers, but the 5th server still thinks it is alone
Code:
root@proxmox01:~# pvecm status
Cluster information
-------------------
Name:             vm-cluster-zhs
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jan 11 11:25:13 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.7d13
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.101.31 (local)
0x00000002          1 192.168.101.32
0x00000003          1 192.168.101.33
0x00000004          1 192.168.101.34
0x00000005          1 192.168.101.35
Code:
root@proxmox05:/home/schm# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

The pve-cluster status of the 5th server shows after the reboot:
Code:
root@proxmox05:/home/schm# journalctl -u pve-cluster
-- Logs begin at Mon 2021-01-11 10:54:00 CET, end at Mon 2021-01-11 11:33:00 CET. --
Jan 11 10:54:09 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 11 10:54:10 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-PZl3HN/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-gHdzO4/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-fHe9Ul/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-GUqQ1C/qb): Unknown error -1 (-1)


Would be cool if anybody could help fixing the broken setup again!

Best regards
Mathias
 
Last edited:
Jan 11, 2021
4
0
1
34
Bern
Hi

I just found out that the existing cluster suddenly thinks it is no longer in a cluster, but everything seems to work fine.

Datacenter -> Cluster -> "Standalone node - no cluster defined"
But Cluster nodes shows 5 nodes, where the 5th is still not working.

pvesh shows that the certificate of the 5th server is missing
root@proxmox01:~# pvesh get /cluster/config/join --output-format json-pretty
'/etc/pve/nodes/proxmox05/pve-ssl.pem' does not exist!

I also tried to add the 5th node again using -force, but that failed as well with:
Code:
root@proxmox05:/etc/apt/sources.list.d# pvecm add proxmox01.<domain> -link0 192.168.101.35 -force
Please enter superuser (root) password for 'proxmox01.<domain>': **********
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* corosync is already running, is this node already in a cluster?!

WARNING : detected error but forced to continue!

Establishing API connection with host 'proxmox01.<domain>'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
can't stop pve-cluster service: Job for pve-cluster.service canceled.
ipcc_send_rec[4] failed: Connection refused

And the log says:
Code:
root@proxmox05:/etc/apt/sources.list.d# journalctl -u pve-cluster
-- Logs begin at Mon 2021-01-11 10:54:00 CET, end at Wed 2021-01-13 18:09:01 CET. --
Jan 11 10:54:09 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 11 10:54:10 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-PZl3HN/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-gHdzO4/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-fHe9Ul/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-GUqQ1C/qb): Unknown error -1 (-1)
Jan 13 18:02:58 proxmox05 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jan 13 18:02:58 proxmox05 pmxcfs[1759]: [main] notice: teardown filesystem
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Killing process 1759 (pmxcfs) with signal SIGKILL.
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jan 13 18:03:08 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 13 18:03:09 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/proxmox05: -1
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox05/local-lvm: -1
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox05/local: -1

I would be very thankful if somebody could help. I also read that I could regenerate the cluster certificates, but I'm a bit afraid to do that in a running cluster without exactly knowing what might happen!

Thanks
Mathias
 

strehi

New Member
Jan 7, 2021
2
0
1
Karlsruhe
I just found out that the existing cluster suddenly thinks it is no longer in a cluster, but everything seems to work fine.

Datacenter -> Cluster -> "Standalone node - no cluster defined"
But Cluster nodes shows 5 nodes, where the 5th is still not working.

Did you clean everything in the cluster and on the node according to https://pve.proxmox.com/wiki/Cluster_Manager -
especially under Separate A Node Without Reinstalling?

Bye
Bernd
 
Jan 11, 2021
4
0
1
34
Bern
Hi Bernd

No, I didn't clean it up.

I tried to add the server (fresh installed) two times and it failed the same way as it did now with the force command. It seems something blocks pve-cluster service from restarting, but the golden question is what...

Regards
Mathias
 
Jan 11, 2021
4
0
1
34
Bern
Dear Proxmox people

We really need help with this issue! Can PLEASE someone have a look at this post and maybe give some advise what we could try?

Btw, I'm running the following software:
Code:
root@proxmox01:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Thanks
Mathias
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!