Morning folks, bit of a conundrum here. Straight to the point: I've set up Proxmox on a second computer with the intention to use it as part of a cluster here at home. Post installation, all was well. Once I made it join the cluster however, the web UI on that second server stopped allowing me to log in, and I get "communication errors" from the working (main) node when attempting to review anything on the second node through the main node's web UI.
Setup
I've got two computers going:
The main server has several certificates for my domain, but it's not like I use them. A docker container running Nginx Proxy Manager maintains all my certificates now. I can confirm there are no issues accessing my main server from outside, as people are connected to my game servers even as I type.
The Problem
So around lunch time when I had some spare time I wanted to set up a cluster between the two servers so they could communicate, share VMs, etc. I set up the cluster on my main server without any fanfare and got the connection string (something I can't seem to retrieve again through the web UI). Then, logging into the web UI for the second server, I was able to enter the connection string when clicking on the Join Cluster option. This is where things went South. Once I had got it to begin joining, things stopped responding properly through the second server's web UI. Refreshing the page, it wanted me to log in again, and no matter what I do it rejects my password. However, when I log in via ssh or the direct terminal, everything is fine.
Back on the main server, it shows my datacenter has both servers included. However the second server is showing incorrect storage volumes (it's a mirror of the storage from my main server instead of its own three volumes) and they show question marks. The server itself is showing a green checkmark, oddly. When I click on it though and attempt to view the summary, it refuses to update the Status panel and shows a spinning loading graphic with the error "communication failure (0)". After some time, a dialogue will pop up showing "Connection error 596: Connection timed out" I'm not able to access any other sections of the second server from the first one's web UI. They are either empty, or produce connection errors (communication failure (0)).
Looking at the terminal of the second server, I am seeing messages appear unbidden:
(with 10 total instances of the above, each with increasing amounts of time waited, up to 604 seconds)
At this point I'm wondering if I should try to break the cluster and retry once the second server is stable again. I've seen plenty of "warning: don't do that" messages in the Proxmox wiki with regards to removing nodes from clusters, so there's slight trepidation, but it's not like I've got anything to lose on the second server. It was literally imaged a day or two ago and nothing has been put onto it yet. I welcome any solution to the problem though.
Setup
I've got two computers going:
- A rack server acting as a main server
- Among its services includes TrueNAS, Pi-Hole, and a variety of game servers
- There's a VM running docker here as well
- All's well on its functionality and it has been stable for a week after setup
- A workstation (HP Z600 I believe) which was my old homelab server
- It is to be repurposed for rendering and transcoding since it has a graphics card
- Also has spinning rust for mass storage, will be a backup server for VM snapshots and my desktop
- Old hardware so it doesn't do UEFI normally, I had to install PVE using Ventoy
- After installation it, too, was stable for a full day and operated normally as expected up until the fault
The main server has several certificates for my domain, but it's not like I use them. A docker container running Nginx Proxy Manager maintains all my certificates now. I can confirm there are no issues accessing my main server from outside, as people are connected to my game servers even as I type.
The Problem
So around lunch time when I had some spare time I wanted to set up a cluster between the two servers so they could communicate, share VMs, etc. I set up the cluster on my main server without any fanfare and got the connection string (something I can't seem to retrieve again through the web UI). Then, logging into the web UI for the second server, I was able to enter the connection string when clicking on the Join Cluster option. This is where things went South. Once I had got it to begin joining, things stopped responding properly through the second server's web UI. Refreshing the page, it wanted me to log in again, and no matter what I do it rejects my password. However, when I log in via ssh or the direct terminal, everything is fine.
Back on the main server, it shows my datacenter has both servers included. However the second server is showing incorrect storage volumes (it's a mirror of the storage from my main server instead of its own three volumes) and they show question marks. The server itself is showing a green checkmark, oddly. When I click on it though and attempt to view the summary, it refuses to update the Status panel and shows a spinning loading graphic with the error "communication failure (0)". After some time, a dialogue will pop up showing "Connection error 596: Connection timed out" I'm not able to access any other sections of the second server from the first one's web UI. They are either empty, or produce connection errors (communication failure (0)).
Looking at the terminal of the second server, I am seeing messages appear unbidden:
Code:
INFO: task pvecm:3172 blocked for more than 120 seconds.
Tainted: P IO 5.13.19-4-pve #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
At this point I'm wondering if I should try to break the cluster and retry once the second server is stable again. I've seen plenty of "warning: don't do that" messages in the Proxmox wiki with regards to removing nodes from clusters, so there's slight trepidation, but it's not like I've got anything to lose on the second server. It was literally imaged a day or two ago and nothing has been put onto it yet. I welcome any solution to the problem though.
Last edited: