WebUi and SSH not reachable after non-determistic times

dan.ger

Well-Known Member
May 13, 2019
96
7
48
We have a 3-node cluster and it workes perfect with proxmox 8 since the last updates periods 2-3 months. The webUi and ssh is unreachable after a non-determistic time. A reboot of the affected machine solves the problem. So I give Proxmox 9 (9.0.6) a chance. But ends in the same result.

if I do a:
Code:
systemctl stop pvestatd
systemctl start pvestatd
systemctl restart pveproxy

this fix the problem. We checked overcommitment of available CPUs (264 vCores available in Cluster) and memory (available 1.10 TB), also storage (available 36 TB). So we do not have a overcommitment on the resources.

here is the firewall configuration:

firewall.png

In journactl I do not see any issues. The rules are working before. I can access from ipmi the machines and connect to each proxmox node via ssh ip/fqdn.

So any ideas where to start?
 
Hi @dan.ger ,

Are you able to SSH/CURL directly from PVE host into itself at the time when the issue manifests? What about between PVE hosts?
I.e.: ssh root@localhost ; ssh root@LAN_IP ; ssh root@pve-02; curl -k https://localhost:8006 ; etc

The subject of your post states that webui and SSH become unavailable, yet you ended the post with:
I can access from ipmi the machines and connect to each proxmox node via ssh ip/fqdn.
And you only mention that service restart fixes the UI. So is the only thing affected PVE UI?


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hello,

yes I'm able to connect to ipmi, which is like a physical connection through the server's remote management directly on the machine. I cann do a ssh to all of the nodes from each host to the others. a netstat -al shows that port 8006 and ssh ate listening internal and external. I can also download thing with a curl command.

But the nodes are not reachable from outside the subnet of the nodes. The VMs of the nodes are reachable. I cannot reach hosts through ssh and webUi. After I do a reboot of a node all machines are reachable through ssh and webUi. This has to be done i I cannot reach all of them, when I need instantly access. If I have time to wait, the problem goes away by it self after a couple of hours (then at least one is reachable). by ssh and web ui.

If I have access at least to one machine, I have to restart pvestatd and pveproxy on the affected nodes. After that al nodes are reachable ny ssh and webUi. First I thought it is something like fail2ban, so I stop the firewall, but nodes are not reachable. So I start to invetigate and restarting services after i found pvestatd and pveproxy.
 
In Proxmox 9.0.6, pve-firewall stop activates ssh and webUi for affected machines, regardless if fail2ban is running or not. I also checked banned ips and iptables. everything seems to be ok. No blocked IP.

If i start pve-firewall, ssh and webUi does not work.
 
I cann do a ssh to all of the nodes from each host to the others.
But the nodes are not reachable from outside the subnet of the nodes.
What you are describing is very bizarre and unusual. This typically means that its something in your environment. Given your statements above, it is likely a switch misbehavior. Some sort of weird PnP issue? Aggressive ARP expiration? Strange AI host banning?

If the network connectivity works fine on the LAN side between the hosts, then it is not a PVE problem. Its up to you to find out, instead of rebooting PVE - reboot your switch next time. Do network traces at all points, including switch.

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
At the end it has nothing to do with the switch. cause if i stop pve-firewall everything works like it should. So I beöieve that is a pve problem. Any othe suggestions?
 
cause if i stop pve-firewall everything works like it should.
The rules are working before.
Something changed.

Any othe suggestions?
Yes, work on troubleshooting your firewall rules. The last one is DROP. Change it to Accept, does it consistently work now? Change back to DROP, enable log - does it log your gui/ssh attempts when broken? Which rule in your list do you think should be responsible for accepting non-LAN traffic? Enable logging - do you see it working as exepcted.

In short, if you believe the issue is entirely contain in PVE - keep methodically working on isolating the "misbehaving" (or missing) rule. Report your finding here or via your support subscription in an organized form. Preferably, with reproduction steps and clearly identified detailed logs.

Best


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
The rules have been working for over 3 years since Proxmox 6.x without any issues. The problem starts with proxmox 8 and 9. If I turn on the lock i see that the ssh and webUi are dropped. But in iptables/ebtables I do not see any blocking rules, also in fail2ban.

The thing is the ruleset is working after pvestad and pveproxy restart without any cahnges for a couple of hours. So I believe the rules are right. But I will check by disabling all rules and enable rule per rule for a node to see if it takes any effects.
 
They changed the syntax (for predefined IPs?) in the firewall configuration at some point. Maybe remove the rules and create them again for the web GUI, SPICE and SSH?
 
I do the recreate of the rules. But at the end it was the interface vmbr0. If I remove that interface it is running without any issues.

I do not understand the magic behind, cause I only want to use the rule on vmbr0. If I remove that it works instant, ssh and web ui is accessable. Maybe the guys of Proxmox have an answer.

The behavior is that the machines are not reachable for a time and then they become reachable for a few time and after that everything replays with no pattern.
 
I had a similar problem, with the same symptoms, but it was caused by an intel_iommu kernel module. When I changed the settings in grub everything was fine. Have you checked in dmesg for errors from the above module?