Trouble with Cluster

strehi

Member
Jan 7, 2021
21
3
8
Karlsruhe
Hi,

I'm running a Proxmox Cluster 6.3-3 with 4 nodes and wanted to add a 5th node. This caused some trouble,
as the new node was not correctly added. The "new" node hosted some VMs before, but all where shutdown
and deleted before joining the cluster. No shared storage et.al.

The cluster got defunct with hanging processes on all 4 existing nodes:
[492382.111107] INFO: task pveproxy worker:14569 blocked for more than 120 seconds.
[492382.111809] INFO: task pvesr:18726 blocked for more than 120 seconds.
...
Weblogin on the added node did not work any more (login failed), only ssh access was possible.

After shutting down the added node, things in the cluster got stable again. So I decided to
remove the recently added node following https://pve.proxmox.com/wiki/Cluster_Manager
(Separate A Node Without Reinstalling). Now the node is up standalone and the cluster
works normal.

Is it possible to say, what happened. And can I give it another try, or should I first reinstall
the node?

TIA
Bernd
 
Hi Bernd

Interesting, I tried exactly the same and failed the same way as you did!

I've also 4 nodes in a cluster and tried to add a 5th. I used the webgui to join the cluster. It failed the same way as you described. Managing was no longer possible as long as the 5th server was in the net. Turning it off fixed the issue in the existing cluster again.

I removed the 5th server again and reinstalled it. This morning I tried it again using the command line pvecm add command and it failed again with:
Code:
Please enter superuser (root) password for '192.168.101.31': **********

Establishing API connection with host '192.168.101.31'
The authenticity of host '192.168.101.31' can't be established.
X509 SHA256 key fingerprint is DE:35:03:20:6B:58:63:6A:C1:6E:FF:8A:C5:8A:45:DF:7E:8C:13:4F:AE:50:76:A0:64:0E:55:F8:D4:57:25:5E.
Are you sure you want to continue connecting (yes/no)? Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service

TASK ERROR: can't stop pve-cluster service: Job for pve-cluster.service canceled.

Corosync on an existing node showed:
Code:
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-01-08 10:19:42 CET; 3 days ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 1850 (corosync)
    Tasks: 9 (limit: 6143)
   Memory: 256.4M
   CGroup: /system.slice/corosync.service
           └─1850 /usr/sbin/corosync -f

Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11
Jan 11 10:30:47 proxmox01 corosync[1850]:   [TOTEM ] Retransmit List: e f 10 11

And the newly added node said:
Code:
root@proxmox05:/home/schm# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-01-11 10:27:47 CET; 5min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
Main PID: 28103 (corosync)
    Tasks: 9 (limit: 9830)
   Memory: 146.3M
   CGroup: /system.slice/corosync.service
           └─28103 /usr/sbin/corosync -f

Jan 11 10:32:52 proxmox05 corosync[28103]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4
Jan 11 10:32:52 proxmox05 corosync[28103]:   [QUORUM] Members[1]: 5
Jan 11 10:32:52 proxmox05 corosync[28103]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 11 10:32:52 proxmox05 corosync[28103]:   [TOTEM ] A new membership (1.7ca9) was formed. Members joined: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] FAILED TO RECEIVE
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] A new membership (5.7cad) was formed. Members left: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3 4
Jan 11 10:32:57 proxmox05 corosync[28103]:   [QUORUM] Members[1]: 5
Jan 11 10:32:57 proxmox05 corosync[28103]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jan 11 10:32:57 proxmox05 corosync[28103]:   [TOTEM ] A new membership (1.7cb1) was formed. Members joined: 1 2 3 4


After I rebooted the 5th server, pvecm status on the existing server shows now all 5 servers, but the 5th server still thinks it is alone
Code:
root@proxmox01:~# pvecm status
Cluster information
-------------------
Name:             vm-cluster-zhs
Config Version:   12
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Mon Jan 11 11:25:13 2021
Quorum provider:  corosync_votequorum
Nodes:            5
Node ID:          0x00000001
Ring ID:          1.7d13
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   5
Highest expected: 5
Total votes:      5
Quorum:           3
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.101.31 (local)
0x00000002          1 192.168.101.32
0x00000003          1 192.168.101.33
0x00000004          1 192.168.101.34
0x00000005          1 192.168.101.35
Code:
root@proxmox05:/home/schm# pvecm status
Error: Corosync config '/etc/pve/corosync.conf' does not exist - is this node part of a cluster?

The pve-cluster status of the 5th server shows after the reboot:
Code:
root@proxmox05:/home/schm# journalctl -u pve-cluster
-- Logs begin at Mon 2021-01-11 10:54:00 CET, end at Mon 2021-01-11 11:33:00 CET. --
Jan 11 10:54:09 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 11 10:54:10 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-PZl3HN/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-gHdzO4/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-fHe9Ul/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-GUqQ1C/qb): Unknown error -1 (-1)


Would be cool if anybody could help fixing the broken setup again!

Best regards
Mathias
 
Last edited:
Hi

I just found out that the existing cluster suddenly thinks it is no longer in a cluster, but everything seems to work fine.

Datacenter -> Cluster -> "Standalone node - no cluster defined"
But Cluster nodes shows 5 nodes, where the 5th is still not working.

pvesh shows that the certificate of the 5th server is missing
root@proxmox01:~# pvesh get /cluster/config/join --output-format json-pretty
'/etc/pve/nodes/proxmox05/pve-ssl.pem' does not exist!

I also tried to add the 5th node again using -force, but that failed as well with:
Code:
root@proxmox05:/etc/apt/sources.list.d# pvecm add proxmox01.<domain> -link0 192.168.101.35 -force
Please enter superuser (root) password for 'proxmox01.<domain>': **********
detected the following error(s):
* authentication key '/etc/corosync/authkey' already exists
* corosync is already running, is this node already in a cluster?!

WARNING : detected error but forced to continue!

Establishing API connection with host 'proxmox01.<domain>'
Login succeeded.
check cluster join API version
Request addition of this node
Join request OK, finishing setup locally
stopping pve-cluster service
can't stop pve-cluster service: Job for pve-cluster.service canceled.
ipcc_send_rec[4] failed: Connection refused

And the log says:
Code:
root@proxmox05:/etc/apt/sources.list.d# journalctl -u pve-cluster
-- Logs begin at Mon 2021-01-11 10:54:00 CET, end at Wed 2021-01-13 18:09:01 CET. --
Jan 11 10:54:09 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 11 10:54:10 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-PZl3HN/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-gHdzO4/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-fHe9Ul/qb): Unknown error -1 (-1)
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [ipcs] crit: connection from bad user 5004029! - rejected
Jan 11 10:58:45 proxmox05 pmxcfs[1759]: [libqb] error: Error in connection setup (/dev/shm/qb-1759-2715-25-GUqQ1C/qb): Unknown error -1 (-1)
Jan 13 18:02:58 proxmox05 systemd[1]: Stopping The Proxmox VE cluster filesystem...
Jan 13 18:02:58 proxmox05 pmxcfs[1759]: [main] notice: teardown filesystem
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: State 'stop-sigterm' timed out. Killing.
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Killing process 1759 (pmxcfs) with signal SIGKILL.
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=9/KILL
Jan 13 18:03:08 proxmox05 systemd[1]: pve-cluster.service: Failed with result 'timeout'.
Jan 13 18:03:08 proxmox05 systemd[1]: Starting The Proxmox VE cluster filesystem...
Jan 13 18:03:09 proxmox05 systemd[1]: Started The Proxmox VE cluster filesystem.
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-node/proxmox05: -1
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox05/local-lvm: -1
Jan 13 18:03:09 proxmox05 pmxcfs[23728]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/proxmox05/local: -1

I would be very thankful if somebody could help. I also read that I could regenerate the cluster certificates, but I'm a bit afraid to do that in a running cluster without exactly knowing what might happen!

Thanks
Mathias
 
I just found out that the existing cluster suddenly thinks it is no longer in a cluster, but everything seems to work fine.

Datacenter -> Cluster -> "Standalone node - no cluster defined"
But Cluster nodes shows 5 nodes, where the 5th is still not working.

Did you clean everything in the cluster and on the node according to https://pve.proxmox.com/wiki/Cluster_Manager -
especially under Separate A Node Without Reinstalling?

Bye
Bernd
 
Hi Bernd

No, I didn't clean it up.

I tried to add the server (fresh installed) two times and it failed the same way as it did now with the force command. It seems something blocks pve-cluster service from restarting, but the golden question is what...

Regards
Mathias
 
Dear Proxmox people

We really need help with this issue! Can PLEASE someone have a look at this post and maybe give some advise what we could try?

Btw, I'm running the following software:
Code:
root@proxmox01:~# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Thanks
Mathias
 
Thanks for your reply.

I'm using Proxmox since years and have multiple clusters setup, but never had issues like that. I was really surprised about that.

Since nobody seems to know whats wrong here, I'm now going to separate one host from the existing cluster and build a new cluster with the new server. I will reinstall Proxmox on both servers to have a clean playground. If that works, then I will migrate all VMs and also the old hosts to the new cluster.
 
have you made a first ssh connection from new host to each cluster host and from each cluster hosts to new one before starting to join the cluster ?

Interesting Thing - I have one wrong ssh public key in the node to add for one host of the existing cluster.
But the ssh host keys never changed on the cluster hosts.

What happens during a join on the node to add and on the existing cluster nodes?

TIA
Bernd
 
Hi Bernd

I could finally fix my issue and it was a stupid configuration error! I didn't re-setup the cluster yet, but I'm going to do so now, since my Corosync runs on the same interface as the storage and as I read, that should be separated.

My problem was that the existing servers were using a MTU of 9000 for the Storage/Corosync interface, but the new server still used the default 1500. Interestingly the interface configuration in the old servers still show 1500, but the cmd says 9000. I don't remember where or how I set that... Now I set it in the GUI and /etc/network/interfaces now also contains the mtu value.

Even if I don't think so, I hope this will also solve your issue.

Best regards
Mathias
 
Hi Mathias,

thanks for your update. MTU Sizes are not a problem here. After no reply from Proxmox, I completely deleted the former host to add and set it up for scratch. I preserved the existing ssh host key - Joining the cluster was not a problem now. Whatever the reason was, or whatever happens during a join, it's working now. But as far as I can see there are other people with similar problems here in the forum.

Bye
Bernd
 
Hi Bernd

I could finally fix my issue and it was a stupid configuration error! I didn't re-setup the cluster yet, but I'm going to do so now, since my Corosync runs on the same interface as the storage and as I read, that should be separated.

My problem was that the existing servers were using a MTU of 9000 for the Storage/Corosync interface, but the new server still used the default 1500. Interestingly the interface configuration in the old servers still show 1500, but the cmd says 9000. I don't remember where or how I set that... Now I set it in the GUI and /etc/network/interfaces now also contains the mtu value.

Even if I don't think so, I hope this will also solve your issue.

Best regards
Mathias
I just resolved this problem for myself - before reading this message of yours - but I thank you for it just the same. :)

grr...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!