Cluster based vnc / console error

Nov 1, 2023
13
2
3
I have a four node cluster. For clarity hostnames this example the hosts are host1, host2, host3, host4

All running 9.0 which has been pretty stable so far. Suddenly today essentially out of nowhere, which I'm sure won't turn out to be true, I can only login to the shell of a host or vm or lxc through the gui of the machine that I run the gui on.

I can see all the other host in the cluster I can tab through their hosts, see stats etc. but I can not get a shell or console on any of the hosts, vm, or lxc that don't belong to the UI I came in through.

I go to https:/host1...:8006 I can get a shell/console on host1 and the 3 vm's running there. No shells on hosts2-4 nor any of their vm's or lxc.

I must of screwed something up? SSHD config perhaps?

Any idea what I did? Wasn't even sure how to search for this...

Thanks in advance for any assistance you can provide.

---
# pveversion -v
proxmox-ve: 9.0.0 (running kernel: 6.14.8-2-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
intel-microcode: 3.20250512.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.3
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.7
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.9-1
proxmox-backup-file-restore: 4.0.9-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.4
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.7
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.16
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1
 
Hi,
Do you receive an error message when trying to connect to the shell/console of other hosts or VMs etc? what does exactly you get?
Does the ssh connection works between the cluster nodes?
Please provide the output of the following commands from one of the cluster nodes:
Bash:
 pvecm status
 corosync-cfgtool -n
And the output of ip a from all cluster nodes.

Are you connected as the root user? If not, make sure the user you are using has the correct permissions.
 
Thank you for your assistance.

Here's the info you asked for:

# pvecm status
corosync-cfgtool -n
Cluster information
-------------------
Name: production
Config Version: 6
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Fri Aug 22 09:47:39 2025
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000002
Ring ID: 1.4e7
Quorate: Yes

Votequorum information
----------------------
Expected votes: 4
Highest expected: 4
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 192.168.50.2
0x00000002 1 192.168.50.51 (local)
0x00000003 1 192.168.50.52
0x00000004 1 192.168.50.53
Local node ID 2, transport knet
nodeid: 1 reachable
LINK: 0 udp (192.168.50.51->192.168.50.2) enabled connected mtu: 1397

nodeid: 3 reachable
LINK: 0 udp (192.168.50.51->192.168.50.52) enabled connected mtu: 1397

nodeid: 4 reachable
LINK: 0 udp (192.168.50.51->192.168.50.53) enabled connected mtu: 1397

I've continued to debug (with Gemini) and learned the following:

Here's the current status:

Initial Symptom



In a four-node Proxmox 9.0 cluster, the web UI console (noVNC, SPICE, xterm.js) fails to connect to any guest (VM or LXC) that is on a different node from where the UI is being accessed. The issue is cluster-wide; for example, accessing the UI on pve02 allows connections to guests on pve02, but not to guests on pve01, pve03, or bee.



Investigation Chronology & Key Findings



  1. Theory: Inter-Node SSH Failure
    • Steps: Verified and corrected SSH configurations to ensure passwordless, key-based root login was functional between all four nodes.
    • Result: SSH was confirmed to be working correctly, but the console issue persisted.
  2. Theory: Stale Service or Cluster State
    • Steps: Restarted core Proxmox services (pveproxy, pvedaemon) on all nodes. When that failed, a full, sequential reboot of every node in the cluster was performed.
    • Result: No change. The cluster returned to a healthy, quorate state, but the console problem remained.
  3. Breakthrough: SPICE File Reveals a Proxy
    • Steps: Attempting to open a SPICE console downloaded a .vv file. Inspecting this file revealed a critical line: proxy=http://pve01.westmaxx.com:3128.
    • Result: This proved that the cluster was forcing console connections through a proxy service it believed was running on node pve01 at port 3128.
  4. Theory: Proxy on the Server
    • Steps: An exhaustive search for a proxy configuration was conducted on all nodes. This included checking:
      • System-wide environment files (/etc/environment).
      • APT configuration files (/etc/apt/apt.conf.d/).
      • systemd service overrides and the main pveproxy.service file.
      • The Proxmox datacenter configuration (/etc/pve/datacenter.cfg).
      • The live environment variables of the running pveproxy process.
    • Result: All searches were negative. No proxy was configured in any standard location.
  5. Theory: Proxy on the Client (User's PC)
    • Steps: Investigated the user's PC, which had a fresh Windows installation. The Bitdefender antivirus suite was a prime suspect.
    • Result: The user completely uninstalled Bitdefender and tested with multiple browsers. This had no effect, proving the issue was not on the client side.
  6. Confirmation: The Proxy is Real and on the Server
    • Steps: From the Windows client, Test-NetConnection pve01.westmaxx.com -Port 3128 was run.
    • Result: The test returned TcpTestSucceeded: True, providing definitive proof that a service was actively listening on port 3128 on pve01.
    • Steps: On pve01, sudo ss -tlpn was run to identify the listening process.
    • Result: The process was identified as spiceproxy, a core Proxmox service: users:(("spiceproxy work",pid=1594,fd=6),("spiceproxy",pid=1593,fd=6)).


Final Conclusion



The node pve01 is in a corrupted state, causing its spiceproxy service to start with a non-standard, persistent configuration. This faulty configuration has been propagated to all other nodes via the Proxmox Cluster File System (pmxcfs), effectively "poisoning" the cluster. As a result, every node now incorrectly attempts to route its console traffic through the broken service on pve01, causing all remote console connections to fail.

The misconfiguration is not present in any standard config file and survived both a full reboot and a package update, indicating a deep, non-standard system state issue on pve01.

Current Action: The only remaining solution is to remove the source of the problem. We are currently moving all guests off pve01 in preparation to gracefully remove it from the cluster and perform a fresh Proxmox installation on the hardware.


Next steps are (unless you get back to me with a better answer) to migrates workloads off pve01 remove it from the cluster, re-install proxmox, re-enter it into the cluster and migrate workloads back
 
Not sure how to close this, but thank you for the assistance! I've resolved the problem... Feel kinda stupid to have jumped so far down the rabbit hole trying to chase this down. Turned out it was a recently introduced bug in my .bashrc for non-interactive shells... FML : -) Appreciate your help none the less!
 
  • Like
Reactions: abobakr
Glad you managed to solve it;).
Please try to edit the title and include [SOLVED] at the beginning for closing the thread.
 
Last edited:
Not sure how to close this, but thank you for the assistance! I've resolved the problem... Feel kinda stupid to have jumped so far down the rabbit hole trying to chase this down. Turned out it was a recently introduced bug in my .bashrc for non-interactive shells... FML : -) Appreciate your help none the less!
In order to close the thread as solved, Please edit the first post of the thread and select Solved from the pull-down menu.