[SOLVED] [BUG][SSH] Lock yourself out of a node - PermitRootLogin prohibit-password

esi_y

Renowned Member
Nov 29, 2023
2,221
366
63
github.com
It must be my destiny to find all the SSH related bugs in PVE stack first-hand [1abcd]. Here's a new one!

Suppose you have done the reasonable thing and ssh-copy-id'd your keys to your nodes' root's authorized_keys. This is actually only necessary with one node alone as the information is shared, after all "root on any node is already effectively root on every node" [1e]. Well, that's convenient.

Then you have done the next reasonable thing and set PermitRootLogin prohibit-password in /etc/ssh/sshd_config as you should. Now all is well until at some point, one of your nodes shows offline with everything else running on it and accessible, except for that node - SSH or GUI.

(If you have allowed root to authenticate by password, it will lock you out of GUI "only", if relay-connecting (via other node) confusingly with root@node: Permission denied (publickey,password) - you could still SSH in directly, but suddenly are asked for a password.)

Well, as a matter of that convenience, all is great as long as the symlink from all the nodes' /root/.ssh/authorized_keys -> /etc/pve/priv/authorized_keys can be followed. But /etc/pve [2] is a virtual filesystem mounted at runtime only (and read-only if inquorate), more precisely mounted while pve-cluster.service is running. Should the service fail, the symlink points to nowhere and so your authorized_keys are non-existent.

Well, what can you do to prevent this? Well, the quickest (non-systematic) way is:

Code:
cp /etc/pve/priv/authorized_keys ~/.ssh/authorized_keys2

On every single node. Yes, it was never meant for this, but it works. Also, you should probably have own infrastructure set up, maybe even extra user. If you go the way of the extreme and create e.g. a service that copies the content regularly over, you would actually be able to connect even via GUI, except, counter-intuitively, from another node ONLY.

Hope this helps someone.




[1a] https://bugzilla.proxmox.com/show_bug.cgi?id=4252
[1b] https://bugzilla.proxmox.com/show_bug.cgi?id=4670
[1c] https://bugzilla.proxmox.com/show_bug.cgi?id=4886
[1d] https://bugzilla.proxmox.com/show_bug.cgi?id=5174
[1e] https://bugzilla.proxmox.com/show_bug.cgi?id=5060#c1
[2] https://pve.proxmox.com/wiki/Proxmox_Cluster_File_System_(pmxcfs)
 
Last edited:
I just realised, only today although it's quite obvious (not that much of a fan of GUI), that the above (having authorized_keys as symlink to /etc/pve) is actually ALSO the ONLY reason why:

- when a node has any issue with starting up pve-cluster (thus cannot mount /etc/pve) it also HAS INACCESSIBLE SHELL in GUI of ANOTHER (healthy) node
- this is true in cases when there's actually no problem with the SSH, but the keys (inaccessible on the target node) are needed for the auto-login, relaying as well (e.g. guest VMs, etc.)


I find it quite disturbing this is the design from the very beginning, considering often one could just troubleshoot in the shell right via GUI.

Then back when the other SSH bugs (#4252, #4886 above) where addressed regarding the host keys, the client keys were left intact, but coming back to the discussion [1], the idea was to split them off but RETAIN in /etc/pve.

I remember there was no love for my SSH certificates out of the box (the only professional way to actually do this), but I thought I would point this one out here especially as THIS bug (above) remains unattended completely. Any chance to reconsider this, @fabian?


[1] https://lore.proxmox.com/pve-devel/20231221095313.156390-1-f.gruenbichler@proxmox.com/
 
Last edited:
no, we still want to get rid of SSH for internal usage (or split it off from the default instance), without further complicating the setup using certificates.
 

Thanks for the reply, I think I have not phrased it too well...

we still want to get rid of SSH for internal usage (or split it off from the default instance),

(I got this part (in the long run), but it does not seem to be on the roadmap even.)

without further complicating the setup using certificates.

I got your take on the certificates already from the hosts discussion. I meant the authorized_keys i.e. client keys related bug. It seems to me more straightforward to fix (the issue above) than the hosts were - the file should not be a link.

Option 1: Since any compromised host already compromises the entire cluster (derived from statement above), it might as well be using same client key all the time that could be on the command line (-i) from the (already fixed) helper. Then adding a new node to cluster would mean adding this client key to its local authorized_keys file.

Option 2: If you somehow absolutely want to preserve all the keys (you are not removing them today though), it could also be an observed file (like corosync.conf) and updated on every machine instead of linked.

Option 3: Allow pmxcfs to start with unresolved hostname and keep it readonly with local-> pointing to what are common files correctly.
EDIT: I take this back, the FUSE might be not starting up for other reasons (e.g. corrupt DB) anyways.

Anything but symlinking these files to what-might-be-unmounted filesystem is better than current status.

EDIT2: I added my further points to BZ ticket. Thanks for taking care of it.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!