Troubleshooting: Node Web GUI no longer accessible, even after standard steps from other threads

beama · Jan 14, 2024

I had a 2 node proxmox cluster in my homelab (which I have since learned is sub-optimal), and one of the nodes stopped being accessible through the web interface. When accessing the master node, I could see the listing for the other node, but it would just give me 500 errors when trying to look at the node summary or start any of the vms (Red indicator on the node, question marks on the vms) hostname lookup 'proxmox' failed - failed to get address info for: proxmox: Name or service not known (500). Yes that node was creatively named "proxmox".

This occurred after moving the node physically from where I had it sitting to a proper rack, and its network interface is now plugged into a new switch. I can't understand why that might contribute, as the network interface didn't change, and it is reserving the static IP as expected.

I tried several things in an attempt to solve the problem and otherwise simplify the troubleshooting.

Confirmed the IP is correct and the network interface didn't change with ip a, ip r
Confirmed I am accessing the correct URL (with https and the correct port)https://192.168.0.136:8006
Confirmed the host is correctly serving the page (running curl -v https://192.168.0.136:8006 from the host and getting the html back
Split the malfunctioning host from the cluster by following the steps on this page (section 5.5.1 specifically): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
1. As part of this step - fixed the master node to be standalone
2. set both nodes to only expect 1 vote
Checked the statuses of a few services (all up/active, and no obvious errors), as well as various restarts of said services after trying small tweaks from other troubleshooting threads systemctl status pveproxy, kubectl status pve-cluster.service, kubectl status pvedaemon.service
Checked journalctl with some arguments I don't remember to look for any possibly related errors (none noticed, but I am by no means well versed in journalctl logs)
Edit: Forgot a step. Also checked the firewall rules, and they seemed to match the working master node perfectly.

At one point I was able to ssh into the stricken node, but at this exact moment I cannot access either the web console or ssh in. I am using ipmi to access the console and attempt changes.

This is not a production node, just has some templates and vms I never backed up (laziness/lack of importance) that I would like to recover, but if I have somehow severely broken things I can just accept that and start fresh.

Any advice for a relative newb to proxmox?

spirit · Jan 16, 2024

the proxmox node need to be ableto resolve dns for "proxmox" to 192.168.0.136.

you should have an entry in /etc/hosts

beama · Jan 16, 2024

Edit2:Tried to address in new reply. ~~Edit: Im not sure how to handle the DNS portion - any pointers on what to check to see if that is functioning?~~

Checking the hosts file, this is what I see. 136 is the misbehaving node, 135 is the working one. It looks like they are both the same.

Any other spots I should check out real quick? As mentioned, not super familiar with the inner workings of proxmox.

beama · Jan 16, 2024

Realized later what you were looking for - It seems to be resolving the proxmox name as expected

beama · Jan 16, 2024

Posting here for completeness - occurred to me that I don't know how the web gui is served, so I figured I would compare services running between the two nodes. Noticed that the healthy node had 33 services, where the afflicted had 31.

The missing services were the iscsid and the rsyslog services. Checking the states of each, the rsyslog service doesn't exist (sounds sub-optimal), while the other just wasn't started. Starting iscsid didn't change the ability to access the web interface (which makes sense, since a quick google search indicates its related to network storage). It does clarify the one or two logs I saw where that node couldn't access my network share though.

beama · Jan 16, 2024

It seems I have jumped the shark a bit. At one point I was able to ping and ssh into the node, but now all pings are timing out. I imagine any troubleshooting efforts at this point are going to be impossible to validate till I figure out what I broke on the network. Really strange that the node is still claiming the appropriate IP, but I can't reach it at all.

I am going to start from the basics - different cable, removing any extraneous switches between the node and router, checking interfaces again. Will return with results.

beama · Jan 18, 2024

Unsure if there is a way to close the thread, but I am officially throwing in the towel on this install. The loss of the couple things is inconvenient, but at this point I am just gonna start fresh.

Thanks @spirit for the suggestions!

Search

Search

Troubleshooting: Node Web GUI no longer accessible, even after standard steps from other threads

beama

New Member

spirit

Distinguished Member

beama

New Member

beama

New Member

beama

New Member

beama

New Member

beama

New Member

We value your privacy