VM noVNC failures traceable to machines built from 8.1-1 November 23 ISO

@TimRyan
do i understand correctly that it still does not work?

i'd make sure that on every node the file /etc/ssh/ssh_known_hosts is a symlink to /etc/pve/priv/known_hosts
e.g. with
Code:
ln -s /etc/pve/priv/known_hosts /etc/ssh/ssh_known_hosts

(you might need to remove the original file first)
and then update the known_hosts file by executing
Code:
pvecm updatecerts
on every node

this should fix the known_hosts file and allow proper ssh tunneling again
if in the future you remove a node from the cluster, please do as the documentation says:

After removal of the node, its SSH fingerprint will still reside in the known_hosts of the other nodes. If you receive an SSH error after rejoining a node with the same IP or hostname, run pvecm updatecerts once on the re-added node to update its fingerprint cluster wide.
 
The problem was a messy and confused set of ssh keys compounded by the pattern of upgrades and my lack of understanding the mechanisms used internally at the cluster level. When you create confused garbage and it is replicated all over the cluster. It becomes obvious that you have to houseclean the existing rubbish to restore order. Its a tough way to learn and a lot of lost time, but know that I know how the cluster host and key replication works, even if I break it again I will know how to fix it.
 
@TimRyan
do i understand correctly that it still does not work?

i'd make sure that on every node the file /etc/ssh/ssh_known_hosts is a symlink to /etc/pve/priv/known_hosts
e.g. with
Code:
ln -s /etc/pve/priv/known_hosts /etc/ssh/ssh_known_hosts

(you might need to remove the original file first)
and then update the known_hosts file by executing
Code:
pvecm updatecerts
on every node

this should fix the known_hosts file and allow proper ssh tunneling again
if in the future you remove a node from the cluster, please do as the documentation says:
 
Let's face it, the construction and "logical operational plumbing" of a cluster is anything but simple. When that is compounded by the complexities of key encrypted communications amongst the pile of components in a well populated cluster, maintaining secure and reliable functionality is anything but simple.

I have read almost all of the documentation once, and the challenge is getting a really good picture of how all the pieces fit together and interact, compounded by the continuous stream of updates and changes. I have been hacking at this platform for about 8 months now from the PVE 6 to 7 transition which is where I started to PVE 8 today. I have seen a number of complications including this last one which was triggered by my expansion build out coinciding with PVE8. now that I fully understand how the Host node and all others interact as far as communications and secure sessions I am pretty sure that its a viable platform. The tests I need to do now to satisfy my self of the platform viability are these;

How easy is it to secure while using it as a reverse proxy routed server platform to maximize static IP efficiency with Caddy 2.7.6

Where might there be security issues with segmented user groups using parts of a single physical site cluster

What will have to be done to make this viable as a bridged cluster with physically separated sites on secured fiber backbones. Will is still be manageable from a single location.

Once I have dealt with these issues it will be ready for prime time and commercial services offerings. When I get there I will write up the build and system so others can benefit from my effort. I think I have what I need for my Proof of Concept first site.
 
Let's face it, the construction and "logical operational plumbing" of a cluster is anything but simple. When that is compounded by the complexities of key encrypted communications amongst the pile of components in a well populated cluster, maintaining secure and reliable functionality is anything but simple.

I have read almost all of the documentation once, and the challenge is getting a really good picture of how all the pieces fit together and interact, compounded by the continuous stream of updates and changes. I have been hacking at this platform for about 8 months now from the PVE 6 to 7 transition which is where I started to PVE 8 today. I have seen a number of complications including this last one which was triggered by my expansion build out coinciding with PVE8. now that I fully understand how the Host node and all others interact as far as communications and secure sessions I am pretty sure that its a viable platform. The tests I need to do now to satisfy my self of the platform viability are these;

How easy is it to secure while using it as a reverse proxy routed server platform to maximize static IP efficiency with Caddy 2.7.6

Where might there be security issues with segmented user groups using parts of a single physical site cluster

What will have to be done to make this viable as a bridged cluster with physically separated sites on secured fiber backbones. Will is still be manageable from a single location.

Once I have dealt with these issues it will be ready for prime time and commercial services offerings. When I get there I will write up the build and system so others can benefit from my effort. I think I have what I need for my Proof of Concept first site.

I just find it incredible that you got advised the very same thing that is in the docs, which is the very same thing that got you to this situation in the first place, but because it got too complex to reproduce, this so-called advice (with the implication that you must have done something wrong) will live on.

The reason why you were getting those error was that the pvecm updatecerts corrupts known_hosts file. Now of course if you remove those files entirely and then go on to run pvecm updatecerts, there's nothing to corrupt so it goes on to add the missing entries and it will go working, but that's the very point of why threads like yours appear and reappear and go on to live since 10 years.

Meanwhile filed bug reports get ignored, available patches got ignored and staff is (some knowingly, Dominik probably not) cajoling unsuspecting users that all is well (or something was wrong between the keyboard and the screen), but yes otherwise doing a good PR job. Instead of editing 5 lines of code for everyone, pretty scary realisation of how the PR effort works.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!