Inconsistencies with VM console connections.

TimRyan

Member
Aug 24, 2022
41
3
13
Kaslo BC Canada
comxpertise.ca
I have an 8 node PVE 8.0.4 cluster made up of 2 Dell R610 and 6 Dell R710 servers. The most recent addition were 3 R710's which were added some month after the first 5 server were clustered. The VM ISO images I am using are Ubuntu 22.04.3-live-server-amd64.iso images that I have used to create VM's with the other nodes without issue. However on the most recently added three nodes, prox-6, prox-7, and prox-8 which are all functional nodes and linked in the 8 node quorate cluster, I am seeing some VM's that are unable to connect their instances to the noVNC server, and when the attempt is made the console returns a display of a red strip at the top of the screen with "X Failed to connect to server"

Why is this happening, and of the three newly added nodes, one is able to console display its VM and the other two cannot.

There are differences in the syslogs as well on these nodes.

prox-7's VM117 reaches the shell noVNC with this syslog data
Nov 20 12:52:10 prox-7 systemd[1]: user@0.service: Deactivated successfully.
Nov 20 12:52:10 prox-7 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Nov 20 12:52:10 prox-7 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 20 12:52:10 prox-7 systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 20 12:52:10 prox-7 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 20 12:52:10 prox-7 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 20 12:52:10 prox-7 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Nov 20 12:52:10 prox-7 systemd[1]: user-0.slice: Consumed 1.153s CPU time.

The other two display this info in their syslogs
Nov 20 12:53:34 prox-6 pmxcfs[1123]: [status] notice: received log
Nov 20 12:53:34 prox-6 sshd[136491]: Connection closed by 192.168.0.112 port 34026 [preauth]
Nov 20 12:53:34 prox-6 pmxcfs[1123]: [status] notice: received log

All other nodes report like this on a successful link of a VM to a shell noVNC display
Nov 20 13:02:38 prox-5 systemd[1]: user@0.service: Deactivated successfully.
Nov 20 13:02:38 prox-5 systemd[1]: Stopped user@0.service - User Manager for UID 0.
Nov 20 13:02:39 prox-5 systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 20 13:02:39 prox-5 systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 20 13:02:39 prox-5 systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 20 13:02:39 prox-5 systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 20 13:02:39 prox-5 systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Nov 20 13:02:39 prox-5 systemd[1]: user-0.slice: Consumed 1.249s CPU time.

What is the issue that might be causing these failures on two of the three new nodes?
 
Hi Tim Ryan,

Connection closed by 192.168.0.112 port 34026 [preauth]
That looks like SSH dropping an incoming connection. Stackexchange got a thread discussing some possible causes, troubleshoots and workarounds.

Were the servers 1-5 upgraded from PVE 7 (or earlier), with perhaps an other default algorithm for key generation?
 
To add to this observation the three new proxmox nodes, 6, 7, and 8 are all able to access the shell without issue, and all display the same shell header;

Linux prox-6 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Nov 20 13:40:55 PST 2023 from 192.168.0.112 on pts/1
root@prox-6:~#

Linux prox-7 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Nov 20 13:41:37 PST 2023 from 192.168.0.112 on pts/0
root@prox-7:~#

Linux prox-7 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Nov 20 13:41:37 PST 2023 from 192.168.0.112 on pts/0
root@prox-7:~#

As does the "host server" on https://192.168.0.112:8006/#v1:0:=node/prox-1:3:=jsconsole::::4::

Linux prox-1 6.2.16-19-pve #1 SMP PREEMPT_DYNAMIC PMX 6.2.16-19 (2023-10-24T12:07Z) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Mon Nov 20 13:30:49 PST 2023 on pts/0
root@prox-1:~#

It is only the Ubuntu VM's created on the three new servers that have their shell sessions refused. to add to this I rebooted the host server, prox-1 and now all of the prox-6 thru prox-8 hosted VM's cannot connect to the shell noVNC

and in each case the prox-6,7,or 8 node syslog shows this error

Nov 20 13:48:17 prox-8 sshd[143270]: Connection closed by 192.168.0.112 port 37584 [preauth]

This is consistent across the same VM's located on any of these three PVE 8.0.4 Nodes
 
To add to this;
These are entries in the syslog of the prox-1 PVE 8.0.4 Host Server

This is the result of an attempt to access the shell from VM118 on the prox-8 node the new PVE 8.0.2 and 8.0.4 upgraded install

Nov 20 14:10:45 prox-1 pvedaemon[10628]: starting vnc proxy UPID:prox-1:00002984:00054112:655BD965:vncproxy:118:root@pam:
Nov 20 14:10:45 prox-1 pvedaemon[1172]: <root@pam> starting task UPID:prox-1:00002984:00054112:655BD965:vncproxy:118:root@pam:
Nov 20 14:10:46 prox-1 pvedaemon[10628]: Failed to run vncproxy.
Nov 20 14:10:46 prox-1 pvedaemon[1172]: <root@pam> end task UPID:prox-1:00002984:00054112:655BD965:vncproxy:118:root@pam: Failed to run vncproxy.

This was on the prox-8 syslog at the same time

Nov 20 14:10:46 prox-8 sshd[147234]: Connection closed by 192.168.0.112 port 36880 [preauth]

This is the result of a shell access from VM102 on the prox-4 node the PVE 7 upgraded node

Nov 20 14:11:45 prox-1 pvedaemon[10911]: starting vnc proxy UPID:prox-1:00002A9F:0005583A:655BD9A1:vncproxy:102:root@pam:
Nov 20 14:11:45 prox-1 pvedaemon[1173]: <root@pam> starting task UPID:prox-1:00002A9F:0005583A:655BD9A1:vncproxy:102:root@pam:
Nov 20 14:11:54 prox-1 pvedaemon[1173]: <root@pam> end task UPID:prox-1:00002A9F:0005583A:655BD9A1:vncproxy:102:root@pam: OK

This was on the prox-4 syslog at that same time

Nov 20 14:11:45 prox-4 sshd[178297]: Accepted publickey for root from 192.168.0.112 port 45288 ssh2: RSA SHA256:xjwtiU+0Si64jHl7UmAIgKJz+fCZpRDiSSzT+wG91gc
Nov 20 14:11:45 prox-4 sshd[178297]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Nov 20 14:11:45 prox-4 systemd-logind[816]: New session 30 of user root.
Nov 20 14:11:45 prox-4 systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Nov 20 14:11:45 prox-4 systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Nov 20 14:11:45 prox-4 systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Nov 20 14:11:45 prox-4 systemd[1]: Starting user@0.service - User Manager for UID 0...
Nov 20 14:11:45 prox-4 (systemd)[178300]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Nov 20 14:11:46 prox-4 systemd[178300]: Queued start job for default target default.target.
Nov 20 14:11:46 prox-4 systemd[178300]: Created slice app.slice - User Application Slice.
Nov 20 14:11:46 prox-4 systemd[178300]: Reached target paths.target - Paths.
Nov 20 14:11:46 prox-4 systemd[178300]: Reached target timers.target - Timers.
Nov 20 14:11:46 prox-4 systemd[178300]: Starting dbus.socket - D-Bus User Message Bus Socket...
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Nov 20 14:11:46 prox-4 systemd[178300]: Listening on dbus.socket - D-Bus User Message Bus Socket.
Nov 20 14:11:46 prox-4 systemd[178300]: Reached target sockets.target - Sockets.
Nov 20 14:11:46 prox-4 systemd[178300]: Reached target basic.target - Basic System.
Nov 20 14:11:46 prox-4 systemd[178300]: Reached target default.target - Main User Target.
Nov 20 14:11:46 prox-4 systemd[178300]: Startup finished in 197ms.
Nov 20 14:11:46 prox-4 systemd[1]: Started user@0.service - User Manager for UID 0.
Nov 20 14:11:46 prox-4 systemd[1]: Started session-30.scope - Session 30 of User root.
Nov 20 14:11:46 prox-4 sshd[178297]: pam_env(sshd:session): deprecated reading of user environment enabled
Nov 20 14:11:54 prox-4 sshd[178297]: Received disconnect from 192.168.0.112 port 45288:11: disconnected by user
Nov 20 14:11:54 prox-4 sshd[178297]: Disconnected from user root 192.168.0.112 port 45288
Nov 20 14:11:54 prox-4 sshd[178297]: pam_unix(sshd:session): session closed for user root
Nov 20 14:11:54 prox-4 systemd[1]: session-30.scope: Deactivated successfully.

Is this evidence of a failure of the VM's on the new servers to be able to initiate secure sessions for some reason?
 
Further to all of these issues, and I have yet to solve the basic inability to shell access new Ubuntu 220.4.3 server iso VM's on the latest three nodes of my 8 node cluster.

After seeing other users issues related to the sequence of installation and updating resulting in odd behaviour, I ran each cluster node from its command shell a manual update and up grade as follows;

Login as root with the root password

apt-get update
And this resulted in differing behavior depending on the time it was built and the iso image used. This ranged from the very last 6 releases to 7.1 and 7.2 and 8.2.
apt-get dist-upgrade
And this triggered a range of differing upgrades that made me realize that servers I thought were the same were anything but.

Followed by full shut down and cold boot cycles on each node.

After this process all 8 of my server nodes return this message when this process is repeated;

root@prox-1:~# apt-get update
Hit:1 http://ftp.ca.debian.org/debian bookworm InRelease
Hit:2 http://ftp.ca.debian.org/debian bookworm-updates InRelease
Hit:3 http://security.debian.org bookworm-security InRelease
Hit:4 http://security.debian.org/debian-security bookworm-security InRelease
Hit:5 http://download.proxmox.com/debian/ceph-quincy bookworm InRelease
Hit:6 http://download.proxmox.com/debian/pve bookworm InRelease
Reading package lists... Done
root@prox-1:~# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
root@prox-1:~#

This was not the case prior to the command line apt-get update and upgrade processes were carried out. I was under the impression that the nightly process on the cluster took care of this. Obviously not.

The issue of the failing noVNC shell sessions is still blocking Ubuntu 22.04.3 VM's from being installed however on the last three nodes that were built from the PVE 8.0.2 ISO's

This is the only snag I have not been able to resolve!


All nodes are now at PVE 8.0.9
 
Last edited:
Hi TimRyan,

Obscure troubles!
This was not the case prior to the command line apt-get update and upgrade processes were carried out. I was under the impression that the nightly process on the cluster took care of this. Obviously not.
Which process do you refer to?
 
I was under the impression after reading many of the syslogs that there was an apt update process on each node at midnite every 24 hour cycle, and this kept everything updated.

I have to solve the inability of the three later nodes to allow access to the VNC screen on creation of new nodes
 
an apt update process on each node at midnite every 24 hour cycle, and this kept everything updated.
Ah, then it is good to know:
  • apt update pulls the latest package definitions; after this the apt package database is up-to-date (this way the GUI can show you upgrades are pending)
  • apt upgrade pulls the actual changes; after this your system is up-to-date
Automatic application of security upgrades can be configured in apt. You might run apt upgrade in a cron job, but since not all update come without functional changes, implementation of such a job is not common.
 
@wbk: Did you mean apt dist-upgrade instead of apt upgrade?

You should never use apt upgrade with Proxmox products. This won't upgrade packages if another package needs to be removed. This can result in damaged installations (incompatible package versions). Some time ago I killed a cluster with this...
Use apt dist-upgrade instead.
 
Hi Azunai,

Did you mean apt dist-upgrade instead of apt upgrade?
I did mean apt upgrade

Some time ago I killed a cluster with this
Thank you for the heads-up!

I am quite surprised a regular apt upgrade, with no forcing, would break an installation. What I could imagine, in a cluster, is that one node has some specific packages installed that are not installed on another, and that these specific packages prevented an upgrade of a component critical for the workings of the cluster while it was upgraded on the other nodes.

Did you ever look into the cause of the mismatch?

For systems managed by myself, I hardly ever use dist-upgrade outside of Debian version upgrades.
 
Last edited:
What I could imagine, in a cluster, is that one node has some specific packages installed that are not installed on another, and that these specific packages prevented an upgrade of a component critical for the workings of the cluster while it was upgraded on the other nodes.
That could have been the case.

Did you ever look into the cause of the mismatch?
No, unfortunately not. I remember "/etc/pve" not being mounted, so many other dependent services didn't start correctly.

Nowadays I try not to install extra/special software directly on the host but in LXCs. Well, allmost, as we still use htop for example. This way the host stays clean, having no problems with updates anymore. And the backup/restore is easier, too.

And now we use Ansible to rollout updates and configurations (for a quicker node recovery/reinstall), so it makes no differences for us if Ansible executes the longer command, as we don't have to type it in anymore. Bye bye snowflakes ;)
 
For the record, I have resolved the problem with the use the latest PVE 8.1 ISO release. I have eliminated all of the three servers built on the early 8.0.2 ISO and reinstalled these three servers from the latest 8.1 ISO and the problem is GONE!!!