Hi all,
I am running a 2-node cluster both on 8.3.2, After having a power outage, one of my nodes (Node 2) gets the "unable to resolve name ZYX" and Node 1 starts hitting the logs with "Totem retransmit" messages, I noticed that by adding on Node 1 in the /etc/hosts file the hostname and ip address of Node 2 begins to work, however I now have the issue where the proxmox GUI either wont load by using the IP address or by using the Nginx Proxy Manager address, on the off chance it does load, it just displays a blank gui with a few logos and buttons and no information in regards to my setup.
Whilst checking I have found out now that I am still getting the blank GUI with node 1 but I can log into the Node GUI with the IP address.
I am able to use the shell in Node 1 when in the Node 2 GUI and run "echo XYZ" but any other commands just hang, anything like "Summary,DNS etc etc" options just load and give me a "Connection error 401: permission denied - invalid PVE ticket".
Running pvecm gives me "Quorate: Yes" on both nodes.
Only Node 1 is giving the "Totem retransmit errors" and this error.
Only issue I could think of is that I have a 100Mb connection to Node 2 and 1Gb to Node 1 and also using an Omada which has an MTU of 1492 but that seems to be on the WAN side
It is a two node setup with override.
After doing those commands both nodes seem to hang with Node 1 saying:
Node 2 seems to be hanging and not working.
Pinging between each node using FQDN and IP works. All inside a LAN with no timing delays or high latency
I can still SSH into both nodes no problem.
it seems like the CPU usage, Memory usage charts are updating and are current in the Summary page of Node 2, find that strange I do but the stats such as CPU type, Kernel, Boot are greyed out with "Connection Refused (595)"
Any help would be appreciated into ensuring it won't happen again or migitating steps and/or configuration changes would be helpful.
TIA!
I am running a 2-node cluster both on 8.3.2, After having a power outage, one of my nodes (Node 2) gets the "unable to resolve name ZYX" and Node 1 starts hitting the logs with "Totem retransmit" messages, I noticed that by adding on Node 1 in the /etc/hosts file the hostname and ip address of Node 2 begins to work, however I now have the issue where the proxmox GUI either wont load by using the IP address or by using the Nginx Proxy Manager address, on the off chance it does load, it just displays a blank gui with a few logos and buttons and no information in regards to my setup.
Whilst checking I have found out now that I am still getting the blank GUI with node 1 but I can log into the Node GUI with the IP address.
I am able to use the shell in Node 1 when in the Node 2 GUI and run "echo XYZ" but any other commands just hang, anything like "Summary,DNS etc etc" options just load and give me a "Connection error 401: permission denied - invalid PVE ticket".
Running pvecm gives me "Quorate: Yes" on both nodes.
Only Node 1 is giving the "Totem retransmit errors" and this error.
Code:
Jan 02 20:11:27 proxmox corosync[4443]: [KNET ] link: host: 2 link: 0 is down
Jan 02 20:11:27 proxmox corosync[4443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 02 20:11:27 proxmox corosync[4443]: [KNET ] host: host: 2 has no active links
Jan 02 20:11:28 proxmox corosync[4443]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Jan 02 20:11:28 proxmox corosync[4443]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Jan 02 20:11:28 proxmox corosync[4443]: [TOTEM ] Retransmit List: c d e 15 16 17 1a 1c 1d 1e 1f 21 22 23 24
Jan 02 20:11:28 proxmox corosync[4443]: [KNET ] pmtud: Global data MTU changed to: 1397
Code:
Cluster information
Name: Project
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
Date: Thu Jan 2 22:55:20 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000002
Ring ID: 1.1a0
Quorate: Yes
Votequorum information
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
Nodeid Votes Name
0x00000001 1 192.168.0.20
0x00000002 1 192.168.0.169 (local)
root@proxmox:~# pvecm status
Name: Project
Config Version: 4
Transport: knet
Secure auth: on
Quorum information
Date: Thu Jan 2 22:55:29 2025
Quorum provider: corosync_votequorum
Nodes: 2
Node ID: 0x00000001
Ring ID: 1.1a0
Quorate: Yes
Votequorum information
Expected votes: 2
Highest expected: 2
Total votes: 2
Quorum: 2
Flags: Quorate
Membership information
Nodeid Votes Name
0x00000001 1 192.168.0.20 (local)
0x00000002 1 192.168.0.169
It is a two node setup with override.
After doing those commands both nodes seem to hang with Node 1 saying:
got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
Node 2 seems to be hanging and not working.
Pinging between each node using FQDN and IP works. All inside a LAN with no timing delays or high latency
I can still SSH into both nodes no problem.
it seems like the CPU usage, Memory usage charts are updating and are current in the Summary page of Node 2, find that strange I do but the stats such as CPU type, Kernel, Boot are greyed out with "Connection Refused (595)"
Any help would be appreciated into ensuring it won't happen again or migitating steps and/or configuration changes would be helpful.
TIA!