Cluster Issues

jodoherty02

New Member
Jan 3, 2025
4
1
3
Hi all,

I am running a 2-node cluster both on 8.3.2, After having a power outage, one of my nodes (Node 2) gets the "unable to resolve name ZYX" and Node 1 starts hitting the logs with "Totem retransmit" messages, I noticed that by adding on Node 1 in the /etc/hosts file the hostname and ip address of Node 2 begins to work, however I now have the issue where the proxmox GUI either wont load by using the IP address or by using the Nginx Proxy Manager address, on the off chance it does load, it just displays a blank gui with a few logos and buttons and no information in regards to my setup.

Whilst checking I have found out now that I am still getting the blank GUI with node 1 but I can log into the Node GUI with the IP address.

I am able to use the shell in Node 1 when in the Node 2 GUI and run "echo XYZ" but any other commands just hang, anything like "Summary,DNS etc etc" options just load and give me a "Connection error 401: permission denied - invalid PVE ticket".

Running pvecm gives me "Quorate: Yes" on both nodes.

Only Node 1 is giving the "Totem retransmit errors" and this error.

Code:
Jan 02 20:11:27 proxmox corosync[4443]: [KNET ] link: host: 2 link: 0 is down
Jan 02 20:11:27 proxmox corosync[4443]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 02 20:11:27 proxmox corosync[4443]:   [KNET  ] host: host: 2 has no active links
Jan 02 20:11:28 proxmox corosync[4443]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Jan 02 20:11:28 proxmox corosync[4443]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 02 20:11:28 proxmox corosync[4443]:   [TOTEM ] Retransmit List: c d e 15 16 17 1a 1c 1d 1e 1f 21 22 23 24
Jan 02 20:11:28 proxmox corosync[4443]: [KNET ] pmtud: Global data MTU changed to: 1397
Only issue I could think of is that I have a 100Mb connection to Node 2 and 1Gb to Node 1 and also using an Omada which has an MTU of 1492 but that seems to be on the WAN side

Code:
Cluster information
Name:             Project
Config Version:   4
Transport:        knet
Secure auth:      on
Quorum information
Date:             Thu Jan  2 22:55:20 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1.1a0
Quorate:          Yes
Votequorum information
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate
Membership information
    Nodeid      Votes Name
0x00000001          1 192.168.0.20
0x00000002          1 192.168.0.169 (local)


root@proxmox:~# pvecm status
Name:             Project
Config Version:   4
Transport:        knet
Secure auth:      on
Quorum information
Date:             Thu Jan  2 22:55:29 2025
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000001
Ring ID:          1.1a0
Quorate:          Yes
Votequorum information
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2
Flags:            Quorate
Membership information    
Nodeid      Votes Name
0x00000001          1 192.168.0.20 (local)
0x00000002 1 192.168.0.169

It is a two node setup with override.
After doing those commands both nodes seem to hang with Node 1 saying:
got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
Node 2 seems to be hanging and not working.

Pinging between each node using FQDN and IP works. All inside a LAN with no timing delays or high latency
I can still SSH into both nodes no problem.

it seems like the CPU usage, Memory usage charts are updating and are current in the Summary page of Node 2, find that strange I do but the stats such as CPU type, Kernel, Boot are greyed out with "Connection Refused (595)"

Any help would be appreciated into ensuring it won't happen again or migitating steps and/or configuration changes would be helpful.

TIA!
 
see if node 2 can ping node 1 or the internet gateway from the server console. If it can not, then check the nic configs and see if they were reset. Compare the config's between the two. I had mine clear one time before I had a ups.
 
I once had the same thing as you. I started all over again after a few days of fiddling. Never clusters for me anymore! It's too abstract. You can't see what's happening. And once it breaks down, you find out it's all so complex that starting over is the quickest way.

Today I have a OUT OF MEMORY error on boot after updates. I can't find it. So I was sure I could fix it with a reinstall. Unfortunately that went wrong with the last update too. So I know it is an update.

I think I start looking for something else. I can't affort to be offline for a couple of days once in a while. And with Proxmox this happens too much.
 
  • Like
Reactions: jodoherty02
I don't mind fiddling with files and trying to get them to sync back up but I think an option of just "Remove Node from Cluster" and it goes through the process of deleting all the necessary files and resetting back to defaults would be so much more handy then going in and deleteing files using the CLI.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!