Proxmox VE Cluster Node Unresponsive

Feynt

New Member
Feb 8, 2022
12
0
1
43
Morning folks, bit of a conundrum here. Straight to the point: I've set up Proxmox on a second computer with the intention to use it as part of a cluster here at home. Post installation, all was well. Once I made it join the cluster however, the web UI on that second server stopped allowing me to log in, and I get "communication errors" from the working (main) node when attempting to review anything on the second node through the main node's web UI.


Setup


I've got two computers going:
  • A rack server acting as a main server
    • Among its services includes TrueNAS, Pi-Hole, and a variety of game servers
    • There's a VM running docker here as well
    • All's well on its functionality and it has been stable for a week after setup
  • A workstation (HP Z600 I believe) which was my old homelab server
    • It is to be repurposed for rendering and transcoding since it has a graphics card
    • Also has spinning rust for mass storage, will be a backup server for VM snapshots and my desktop
    • Old hardware so it doesn't do UEFI normally, I had to install PVE using Ventoy
    • After installation it, too, was stable for a full day and operated normally as expected up until the fault
Network wise it's nothing special. I have a Netgear R7000 which has been working well and both servers are able to access the internet and each other with full speed (as best as I can figure, storage speeds being a limiting factor). There are no firewall rules set up on the servers at the moment, and both have been updated recently. The second server has been rebooted, since it has been problematic and doesn't have anything on it yet.

The main server has several certificates for my domain, but it's not like I use them. A docker container running Nginx Proxy Manager maintains all my certificates now. I can confirm there are no issues accessing my main server from outside, as people are connected to my game servers even as I type.


The Problem

So around lunch time when I had some spare time I wanted to set up a cluster between the two servers so they could communicate, share VMs, etc. I set up the cluster on my main server without any fanfare and got the connection string (something I can't seem to retrieve again through the web UI). Then, logging into the web UI for the second server, I was able to enter the connection string when clicking on the Join Cluster option. This is where things went South. Once I had got it to begin joining, things stopped responding properly through the second server's web UI. Refreshing the page, it wanted me to log in again, and no matter what I do it rejects my password. However, when I log in via ssh or the direct terminal, everything is fine.

Back on the main server, it shows my datacenter has both servers included. However the second server is showing incorrect storage volumes (it's a mirror of the storage from my main server instead of its own three volumes) and they show question marks. The server itself is showing a green checkmark, oddly. When I click on it though and attempt to view the summary, it refuses to update the Status panel and shows a spinning loading graphic with the error "communication failure (0)". After some time, a dialogue will pop up showing "Connection error 596: Connection timed out" I'm not able to access any other sections of the second server from the first one's web UI. They are either empty, or produce connection errors (communication failure (0)).

Looking at the terminal of the second server, I am seeing messages appear unbidden:
Code:
INFO:  task pvecm:3172 blocked for more than 120 seconds.
       Tainted: P          IO      5.13.19-4-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
(with 10 total instances of the above, each with increasing amounts of time waited, up to 604 seconds)




At this point I'm wondering if I should try to break the cluster and retry once the second server is stable again. I've seen plenty of "warning: don't do that" messages in the Proxmox wiki with regards to removing nodes from clusters, so there's slight trepidation, but it's not like I've got anything to lose on the second server. It was literally imaged a day or two ago and nothing has been put onto it yet. I welcome any solution to the problem though.
 
Last edited:
Hi,

I saw something similar. I added five nodes ok, then the sixth node (a different model from the first five) things went bad as you describe. By powering off the sixth node, everything was ok with the rest of the cluster. If the sixth node was on, it would get hangs/timeouts with web interface like you describe. SSH still works. I found out I could just do a `ifdown vmbr1` on the sixth node, which was the corosync interface, and it would also "fix" the rest of the cluster.

The setup I had there was just temporary before I had all the NICs and switches arrived, so was a bit ad hoc. I think the issue may have been due to either not having multicast configured ok on the switches, or an issue with the ethernet cards' MTU being different (as proxmox folks suggested).
 
Morning folks, bit of a conundrum here. Straight to the point: I've set up Proxmox on a second computer with the intention to use it as part of a cluster here at home. Post installation, all was well. Once I made it join the cluster however, the web UI on that second server stopped allowing me to log in, and I get "communication errors" from the working (main) node when attempting to review anything on the second node through the main node's web UI.


Setup


I've got two computers going:
  • A rack server acting as a main server
    • Among its services includes TrueNAS, Pi-Hole, and a variety of game servers
    • There's a VM running docker here as well
    • All's well on its functionality and it has been stable for a week after setup
  • A workstation (HP Z600 I believe) which was my old homelab server
    • It is to be repurposed for rendering and transcoding since it has a graphics card
    • Also has spinning rust for mass storage, will be a backup server for VM snapshots and my desktop
    • Old hardware so it doesn't do UEFI normally, I had to install PVE using Ventoy
    • After installation it, too, was stable for a full day and operated normally as expected up until the fault
Network wise it's nothing special. I have a Netgear R7000 which has been working well and both servers are able to access the internet and each other with full speed (as best as I can figure, storage speeds being a limiting factor). There are no firewall rules set up on the servers at the moment, and both have been updated recently. The second server has been rebooted, since it has been problematic and doesn't have anything on it yet.

The main server has several certificates for my domain, but it's not like I use them. A docker container running Nginx Proxy Manager maintains all my certificates now. I can confirm there are no issues accessing my main server from outside, as people are connected to my game servers even as I type.


The Problem

So around lunch time when I had some spare time I wanted to set up a cluster between the two servers so they could communicate, share VMs, etc. I set up the cluster on my main server without any fanfare and got the connection string (something I can't seem to retrieve again through the web UI). Then, logging into the web UI for the second server, I was able to enter the connection string when clicking on the Join Cluster option. This is where things went South. Once I had got it to begin joining, things stopped responding properly through the second server's web UI. Refreshing the page, it wanted me to log in again, and no matter what I do it rejects my password. However, when I log in via ssh or the direct terminal, everything is fine.

Back on the main server, it shows my datacenter has both servers included. However the second server is showing incorrect storage volumes (it's a mirror of the storage from my main server instead of its own three volumes) and they show question marks. The server itself is showing a green checkmark, oddly. When I click on it though and attempt to view the summary, it refuses to update the Status panel and shows a spinning loading graphic with the error "communication failure (0)". After some time, a dialogue will pop up showing "Connection error 596: Connection timed out" I'm not able to access any other sections of the second server from the first one's web UI. They are either empty, or produce connection errors (communication failure (0)).

Looking at the terminal of the second server, I am seeing messages appear unbidden:
Code:
INFO:  task pvecm:3172 blocked for more than 120 seconds.
       Tainted: P          IO      5.13.19-4-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
(with 10 total instances of the above, each with increasing amounts of time waited, up to 604 seconds)




At this point I'm wondering if I should try to break the cluster and retry once the second server is stable again. I've seen plenty of "warning: don't do that" messages in the Proxmox wiki with regards to removing nodes from clusters, so there's slight trepidation, but it's not like I've got anything to lose on the second server. It was literally imaged a day or two ago and nothing has been put onto it yet. I welcome any solution to the problem though.
Did you find a solution to your problem? as I am facing a very similar problem once I joined a node to my cluster.
I got strange behaviors thenmy main node where I created the cluster stopped working the second node seemed to be working fine
then I logged out from both nodes and now am locked out of the web ui and only have the ssh access
 
Did you find a solution to your problem? as I am facing a very similar problem once I joined a node to my cluster.
I got strange behaviors thenmy main node where I created the cluster stopped working the second node seemed to be working fine
then I logged out from both nodes and now am locked out of the web ui and only have the ssh access
Unfortunately not. I haven't plugged in the other computer that was acting as the second node in the cluster since then for a few reasons.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!