[Solved] New Cluster not working (seems to not be communicating)

robert.kuropkat

New Member
Mar 31, 2022
15
2
3
Part of this discussion appears in the following thread: Login failed. Please try again I'll try to gather the relevant information here.

I created three new Proxmox VE servers (on older hardware, IBM x3650 M2). Each server in stand alone mode seemed to work just fine. Server names are pve01, pve02 and pve03.

On pve02, I uploaded several ISOs, mainly just testing how the store works, etc. Otherwise, I did nothing else on any of the three other than the initial install.

Using the WebUI on pve01, I created a cluster. As soon as it created, I used the join information to add pve02 and pve03. During the process, I believe it was during the cluster create and the node joins I got the following error in the WebUI: "Permission denied Invalid PVE ticket (401)." However, it appeared each step ultimately worked as the nodes showed up in the cluster list. However, while pve01 had a green checkmark, pve02 and pve03 had grey question marks by their names.

At the time I did this, I had three browser windows open which started to misfire as I clicked around. As I learned from the thread mentioned above, this was most likely because the local logins were now being ignored in favor of the cluster pushed authentication, which, if my assumption in the title is correct, were not getting pushed to the nodes. Thus, the only node I can login on is pve01.

I tried all the things mentioned in the above thread and the main problem indicator is that the /etc/pve directory does not respond properly and seems to hang on many (not all) operations. In fact, /etc/pve/nodes/pve01 is completely non-responsive.

The one additional snafu I noticed, was that pve02 had an invalid entry for itself in /etc/hosts. I corrected it, but that was after adding it to the cluster, so there may be a bad IP in a config file somewhere I can't find.

In theory, there are almost no firewall restrictions on this particular network, so ports 5404 and 5405 should be open. However, I was not able to do anything that validated they where working.

Currently, each server has two NICs. Not knowing better, I foolishly used the "main" one when setting up the Cluster and Nodes.

corosync and pve-cluster appear to be working when querying their status from systemctl. pvecm status appears to look good and shows the cluster as being quorate.

These are development machines and are still being setup. So I am willing to be told I just did something horrifically wrong and need to start over. However, I would like this thread to give me more guidance so if I do start over, I do it better. I would also very much like to get some troubleshooting and/or diagnostic skills so I know a little more about what to test, how to test it, and what to look for. Right now, the logs are just miles and miles of stuff that "looks good to me..."

Thanks much!
 
anything visible in the logs of pve-cluster or corosync?

Code:
journalctl -b -u pve-cluster -u corosync

which version are you one? pveversion -v

please run the commands on all nodes.
 
Results of status commands below. pveversion was run on all three and match. The other two commands were run on pve01 only.

I've also got it borked now such that I cannot login to the web gui on any of them. I think this happened when I tested doing this: pmxfs -d -f and now, I can only login if I stop pvecluster and run pmxfs instead. (actually, I just tried this and it seems like I cannot login to the webui at all now...)

pvecm status also seems weird now. I swear it had the nodenames for Nodeid, not some hex number.

Finally, I tried looking through syslog, and journalctl for corosync and pve-cluster and nothing jumped out at me, but without a better idea of what to look for, it's pretty easy to miss something in 15k lines of log files...

proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-7 (running version: 7.1-7/df5740ad)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-2
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-1
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

corosync.service - Corosync Cluster Engine
Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-04-04 12:11:37 EDT; 46min ago
Docs: man:corosync
man:corosync.conf
man:corosync_overview
Main PID: 980 (corosync)
Tasks: 9 (limit: 33655)
Memory: 152.6M
CPU: 42.372s
CGroup: /system.slice/corosync.service
└─980 /usr/sbin/corosync -f

Apr 04 12:57:22 pve01 corosync[980]: [KNET ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Apr 04 12:57:42 pve01 corosync[980]: [KNET ] pmtud: Starting PMTUD for host: 3 link: 0
Apr 04 12:57:42 pve01 corosync[980]: [KNET ] udp: detected kernel MTU: 1500
Apr 04 12:57:42 pve01 corosync[980]: [KNET ] pmtud: PMTUD completed for host: 3 link: 0 current link mtu: 1397
Apr 04 12:57:52 pve01 corosync[980]: [KNET ] pmtud: Starting PMTUD for host: 2 link: 0
Apr 04 12:57:52 pve01 corosync[980]: [KNET ] udp: detected kernel MTU: 1500
Apr 04 12:57:52 pve01 corosync[980]: [KNET ] pmtud: PMTUD completed for host: 2 link: 0 current link mtu: 1397
Apr 04 12:58:12 pve01 corosync[980]: [KNET ] pmtud: Starting PMTUD for host: 3 link: 0
Apr 04 12:58:12 pve01 corosync[980]: [KNET ] udp: detected kernel MTU: 1500
Apr 04 12:58:12 pve01 corosync[980]: [KNET ] pmtud: PMTUD completed for host: 3 link: 0 current link mtu: 1397

pve-cluster.service - The Proxmox VE cluster filesystem
Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2022-04-04 12:11:35 EDT; 47min ago
Process: 959 ExecStart=/usr/bin/pmxcfs (code=exited, status=0/SUCCESS)
Main PID: 975 (pmxcfs)
Tasks: 9 (limit: 33655)
Memory: 54.7M
CPU: 1.671s
CGroup: /system.slice/pve-cluster.service
└─975 /usr/bin/pmxcfs

Apr 04 12:53:57 pve01 pmxcfs[975]: [dcdb] notice: cpg_send_message retried 1 times
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: cpg_send_message retried 1 times
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: node lost quorum
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: node has quorum
Apr 04 12:53:57 pve01 pmxcfs[975]: [dcdb] notice: members: 1/975, 2/954, 3/928
Apr 04 12:53:57 pve01 pmxcfs[975]: [dcdb] notice: starting data syncronisation
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: members: 1/975, 2/954, 3/928
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: starting data syncronisation
Apr 04 12:53:57 pve01 pmxcfs[975]: [dcdb] notice: received sync request (epoch 1/975/00000005)
Apr 04 12:53:57 pve01 pmxcfs[975]: [status] notice: received sync request (epoch 1/975/00000005)

Cluster information
-------------------
Name: Staging
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Mon Apr 4 13:14:02 2022
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.1b13
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 10.1.120.202 (local)
0x00000002 1 10.1.120.204
0x00000003 1 10.1.120.206
 
please post the actual logs from all three nodes..
 
Sorry, not sure if I am posting the correct thing. Did you mean the output of journalctl? I saw several other log files on the system, but none of the ones I found seemed of interest except for /var/log/pveam.log which shows the machines constantly trying to updated and fail because the servers do not have internet access.
 

Attachments

okay. the corosync logs indicate some network issues (links being flaky, retransmit lists). how does the network look like between the nodes? are the physical links dedicated to corosync traffic?
 
@fabian Thanks for your help and sorry for taking so long to get back to you. I scrounged another switch and set it up. It is only 100Mb, but the documentation here: https://pve.proxmox.com/wiki/Separate_Cluster_Network suggested that would be sufficient.

Steps taken:

  1. Setup second, isolated switch. Switch is currently, mostly unconfigured and is little more than a dumb switch, but has only the corosync NICs connected to it.
  2. re-configure one NIC on each server to a new network (changed from 10.1.120.x to 10.1.123.x) which was probably only necessary for my own sanity.
  3. Stopped pve-cluster and restarted in local mode with pmxcfs -l (took me awhile to figure that out).
  4. Modified the /etc/pve/corosync.conf file, mostly in accordance with the documentation mentioned above, with two exceptions. I removed the host names for the ring0_addr attributes and used IP addresses. I also reverted the totem.interface section to linknumber:0 instead of the bindnetaddr and ringnumber attributes suggested. The file using hostnames and and bindnetaddr and ringnumber did not work.
  5. rebooted cluster master and got what seemed to be good status and log messages
  6. powered on second node.
  7. for second node, basically repeated steps 2-5 above and manually updated to /etc/pve/corosync.conf file
  8. rebooted second node, got what seemed to be good status and saw it join the cluster. Since I only had three nodes, this gave me a quorum.
  9. Now that I had a quorum, I was finally able to log into the web interface.
  10. powered on third node, repeating steps 2-5 for this node also.
  11. rebooted third node, got what seemed to be good status and saw it join the cluster.
  12. At this point, I was able to successfully join a fourth node I had created.
I think right now, everything is running. All nodes show up in the cluster with a green checkmark. I'm able to see a Summary for each node, except the one I just added. Not sure what is going on there.

I have more questions, but I'll post in new threads as needed.

Thanks again!
 
  • Like
Reactions: fabian
I think right now, everything is running. All nodes show up in the cluster with a green checkmark. I'm able to see a Summary for each node, except the one I just added. Not sure what is going on there.

P.S. Figured out why node4 was misbehaving. Clocks were out of sync. Time to get NTP going on this network...
 
  • Like
Reactions: fabian

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!