Troubleshooting: Node Web GUI no longer accessible, even after standard steps from other threads

beama

New Member
Jan 14, 2024
6
0
1
I had a 2 node proxmox cluster in my homelab (which I have since learned is sub-optimal), and one of the nodes stopped being accessible through the web interface. When accessing the master node, I could see the listing for the other node, but it would just give me 500 errors when trying to look at the node summary or start any of the vms (Red indicator on the node, question marks on the vms) hostname lookup 'proxmox' failed - failed to get address info for: proxmox: Name or service not known (500). Yes that node was creatively named "proxmox".

This occurred after moving the node physically from where I had it sitting to a proper rack, and its network interface is now plugged into a new switch. I can't understand why that might contribute, as the network interface didn't change, and it is reserving the static IP as expected.

I tried several things in an attempt to solve the problem and otherwise simplify the troubleshooting.

  1. Confirmed the IP is correct and the network interface didn't change with ip a, ip r
  2. Confirmed I am accessing the correct URL (with https and the correct port)https://192.168.0.136:8006
  3. Confirmed the host is correctly serving the page (running curl -v https://192.168.0.136:8006 from the host and getting the html back
  4. Split the malfunctioning host from the cluster by following the steps on this page (section 5.5.1 specifically): https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_remove_a_cluster_node
    1. As part of this step - fixed the master node to be standalone
    2. set both nodes to only expect 1 vote
  5. Checked the statuses of a few services (all up/active, and no obvious errors), as well as various restarts of said services after trying small tweaks from other troubleshooting threads systemctl status pveproxy, kubectl status pve-cluster.service, kubectl status pvedaemon.service
  6. Checked journalctl with some arguments I don't remember to look for any possibly related errors (none noticed, but I am by no means well versed in journalctl logs)
  7. Edit: Forgot a step. Also checked the firewall rules, and they seemed to match the working master node perfectly.

At one point I was able to ssh into the stricken node, but at this exact moment I cannot access either the web console or ssh in. I am using ipmi to access the console and attempt changes.

This is not a production node, just has some templates and vms I never backed up (laziness/lack of importance) that I would like to recover, but if I have somehow severely broken things I can just accept that and start fresh.

Any advice for a relative newb to proxmox?
 
Last edited:
Edit2:Tried to address in new reply. Edit: Im not sure how to handle the DNS portion - any pointers on what to check to see if that is functioning?

Checking the hosts file, this is what I see. 136 is the misbehaving node, 135 is the working one. It looks like they are both the same.

Any other spots I should check out real quick? As mentioned, not super familiar with the inner workings of proxmox.


1705370243169.png1705370171946.png
 
Last edited:
Realized later what you were looking for - It seems to be resolving the proxmox name as expected
1705384101027.png
 
Posting here for completeness - occurred to me that I don't know how the web gui is served, so I figured I would compare services running between the two nodes. Noticed that the healthy node had 33 services, where the afflicted had 31.

The missing services were the iscsid and the rsyslog services. Checking the states of each, the rsyslog service doesn't exist (sounds sub-optimal), while the other just wasn't started. Starting iscsid didn't change the ability to access the web interface (which makes sense, since a quick google search indicates its related to network storage). It does clarify the one or two logs I saw where that node couldn't access my network share though.
 
It seems I have jumped the shark a bit. At one point I was able to ping and ssh into the node, but now all pings are timing out. I imagine any troubleshooting efforts at this point are going to be impossible to validate till I figure out what I broke on the network. Really strange that the node is still claiming the appropriate IP, but I can't reach it at all.

I am going to start from the basics - different cable, removing any extraneous switches between the node and router, checking interfaces again. Will return with results.
 
Unsure if there is a way to close the thread, but I am officially throwing in the towel on this install. The loss of the couple things is inconvenient, but at this point I am just gonna start fresh.

Thanks @spirit for the suggestions!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!