Problem with GUI http connection timeout only way in 2 node cluster

ABaum · Jul 17, 2024

Hello I need assistance from someone,
I have node 1 that can log into and see the GUI
Node 2 I can log in and see the GUI

in Node 1 GUI page I can do everything I expect.
in Node 2 GUI page I can do all tasks to node 2
But and here is the problem I cannot see or work with node1 only getting a timeout.
pveproxy sys log has "proxy detected vanished client connection" on node2 after trying to view node1 pages.

Both units on 8.2.2 installed recently.
Where should I look for the one thing that is different between the two?

i also cannot ssh across from node2 to node1 however I can ssh node1 to node2

Code:

root@node2:~# ssh -vvv root@192.168.2.3
OpenSSH_9.2p1 Debian-2+deb12u3, OpenSSL 3.0.13 30 Jan 2024
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 192.168.2.3 is address
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2'
debug3: ssh_connect_direct: entering
debug1: Connecting to 192.168.2.3 [192.168.2.3] port 22.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 192.168.2.3 port 22: Connection timed out
ssh: connect to host 192.168.2.3 port 22: Connection timed out

bbgeek17 · Jul 17, 2024

Resolving round-trip ssh connection would be the first step. It would likely coincide with resolution of GUI issues.

Check MTU consistency and for duplicate IPs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

ABaum · Jul 17, 2024

I have no problems with corosync
I have been carefully checking for firewall rules that block it.
But still get timeout on ssh.

bbgeek17 · Jul 17, 2024

ABaum said:
I have no problems with corosync

I am not sure this helps. May be your corosync is on a different network? May be the packets are small enough to fit into broken MTU? You did not provide any details about your network setup or system state. Just reported a user level application error.

ABaum said:
But still get timeout on ssh.

PVE is based on Debian with Ubuntu Kernel. SSH is basic part of Linux Userland. Start checking ports with "nc" , enable Debug on SSHD side, add more verbosity to "ssh" client side. Get some network captures.

The MTU and/or Duplicate IP are the most likely culprits based on the limited amount information you provided. But I could be completely wrong, its just a guess.

IMHO, if you cant reliably ssh between the nodes, there is no point in troubleshooting anything above it.

Good luck

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

ABaum · Jul 17, 2024

This setup is clustered across wire-guard vpn no high availability, and was working.
When I upgraded the last node I removed it from cluster then reinstalled and rejoined. that is when the trouble started.

I have wg like this and will play with MTU for now.

Code:

root@pve24:~# wg
interface: wg0
  public key: PdxhRSbIVaqfGMzQz+mmhCOti+Le4kveZ1geE7yDUWw=
  private key: (hidden)
  listening port: 51826

peer: zr43aPHwf5HSML+wHjiuMMbdR935gkSoP3twPabiyXE=
  endpoint: 129.222.136.67:52175
  allowed ips: 192.168.2.4/32
  latest handshake: 16 seconds ago
  transfer: 1.19 GiB received, 32.07 GiB sent
  persistent keepalive: every 25 seconds

peer: 8WIcHZfCGbKvDS5dkxzZdleApW1i52se6NwzKPRmDx8=
  endpoint: 69.41.195.50:51824
  allowed ips: 192.168.2.3/32
  latest handshake: 1 minute, 33 seconds ago
  transfer: 1.65 GiB received, 2.33 GiB sent
  persistent keepalive: every 25 seconds

Code:

root@pve21:~# wg
interface: wg0
  public key: 8WIcHZfCGbKvDS5dkxzZdleApW1i52se6NwzKPRmDx8=
  private key: (hidden)
  listening port: 51824

peer: PdxhRSbIVaqfGMzQz+mmhCOti+Le4kveZ1geE7yDUWw=
  endpoint: 216.110.250.179:51826
  allowed ips: 192.168.2.5/32
  latest handshake: 6 seconds ago
  transfer: 7.51 GiB received, 5.07 GiB sent
  persistent keepalive: every 25 seconds

peer: zr43aPHwf5HSML+wHjiuMMbdR935gkSoP3twPabiyXE=
  endpoint: 129.222.136.67:52175
  allowed ips: 192.168.2.4/32
  latest handshake: 51 seconds ago
  transfer: 39.17 GiB received, 14.08 GiB sent
  persistent keepalive: every 25 seconds

bbgeek17 · Jul 17, 2024

ABaum said:
This setup is clustered across wire-guard vpn no high availability, and was working.
When I upgraded the last node I removed it from cluster then reinstalled and rejoined. that is when the trouble started.

The complexity of the situation just increased 20x.

I can only recommend to start from the basics. SSH is critical part of the PVE innerworkings, getting it to work reliably is the first step.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

ABaum · Jul 17, 2024

bbgeek17 said:
The complexity of the situation just increased 20x.

I can only recommend to start from the basics. SSH is critical part of the PVE innerworkings, getting it to work reliably is the first step.

Understood.
Any basic things for me to check?
I just got done proving that ssh works from WAN to my homedesk also on WAN After I fixed the ip allow list to include my current one.
So then its the wg that is the problem for now.

ABaum · Jul 17, 2024

Ok this nmap finally shows something different. (filtered)

The ssh connections that work are reporting open with this command. and the ones that don't are all pointing to the one server I added last to the cluster.
I now will need to see if I can find the differences in the setup to cause the ssh port to be closed.

Code:

@pve25:~# nmap 192.168.2.3 -PN -p ssh
Starting Nmap 7.93 ( https://nmap.org ) at 2024-07-17 17:17 EDT
Nmap scan report for 192.168.2.3
Host is up.

PORT   STATE    SERVICE
22/tcp filtered ssh

Nmap done: 1 IP address (1 host up) scanned in 2.12 seconds

ABaum · Jul 18, 2024

Well It is working now.
I am not certain what I exactly did to make it work.
The last step that I did before ssh suddenly connected was restart pve-firewall
I fixed up the hosts file earlier because it did not have the name resolve for the wg0 connection IPs.
then ran the commands in cant-connect-to-destination-address-using-public-key-task-error-migration-aborted

I don't recall if I did restart the firewall after doing that before I did it again because I was trying to enable logging to show where the packets drop.
I did use tcpdump to see the ssh arriving at the wg interface.

Search

Search

Problem with GUI http connection timeout only way in 2 node cluster

ABaum

Active Member

bbgeek17

Distinguished Member

ABaum

Active Member

bbgeek17

Distinguished Member

ABaum

Active Member

bbgeek17

Distinguished Member

ABaum

Active Member

ABaum

Active Member

ABaum

Active Member

We value your privacy