Proxmox cluster ssh not working between some nodes

dassic · May 3, 2024

Hi!

I ran into an odd issue where 3 proxmox nodes (out of a cluster of 11) cannot ssh to each other on the primary interface. On the secondary interface all works. All other nodes have no issues.

When trying on the 3 nodes with issues, ssh gets stuck at this point (using verbose debugging) :

debug2: local client KEXINIT proposal
debug2: KEX algorithms: sntrup761x25519-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,ext-info-c,kex-strict-c-v00@openssh.com
debug2: host key algorithms: ssh-ed25519-cert-v01@openssh.com,ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-ed25519,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ssh-ed25519@openssh.com,sk-ecdsa-sha2-nistp256@openssh.com,rsa-sha2-512,rsa-sha2-256
debug2: ciphers ctos: aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com
debug2: ciphers stoc: aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com,zlib
debug2: compression stoc: none,zlib@openssh.com,zlib
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: sntrup761x25519-sha512@openssh.com,curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,kex-strict-s-v00@openssh.com
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com
debug2: compression stoc: none,zlib@openssh.com
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug3: kex_choose_conf: will use strict KEX ordering
debug1: kex: algorithm: sntrup761x25519-sha512@openssh.com
debug1: kex: host key algorithm: ssh-ed25519
debug1: kex: server->client cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
debug1: kex: client->server cipher: aes128-ctr MAC: umac-64-etm@openssh.com compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

I ran pvecm updatecert -f from one of the nodes where all works. It had no impact.

I checked that /etc/ssh/ssh_known_hosts links to /etc/pve/priv/known_hosts on all nodes, so all are using the same (cluster) file.
I checked that /root/.ssh/authorized_keys links to /etc/pve/priv/authorized_keys on all nodes, so all are using the same (cluster) file.

I see that known_hosts in /root/.ssh does not link to /etc/pve/priv, is that correct ? (it doesn't on the nodes that works either, but I wonder if this is causing the trouble and I need to either sync or wipe & re-create these files on at least the troubled nodes).

Any thoughts ?

All nodes have MTU at 9000 and the switch they use is set to 9214 (standard on Arista). I checked this specifically as I've seen this appears to be a common root cause.
Also interesting is that if I reboot every node in the cluster, I may end up with this issue on other nodes, but it always ends up appearing on at least one node and sometimes more and only on the first interface.
The interfaces are all LACP bonds.

kohl42 · May 14, 2024

I realized I had the same issue today, a week after migrating both nodes to Proxmox 8.

My guess is that cipher aes128-ctr is not supported anymore, though I could not find much info about that change (haven't searched extensively though).

Easiest workaround is to to use a more secure cipher, like aes192-ctr.

Code:

ssh -c aes192-ctr <other_node>

kohl42 · May 14, 2024

Nope, my bad, I still get random timeouts upon connection.

dassic · May 14, 2024

So I did some more testing and it appears the trigger in my case is related to setting the MTU to 9000.

In theory this should work as the switches are all configured for this and it works for other devices (non-proxmox), but with the Proxmox servers (running 8.2) I get these random SSH issues....and other things in Proxmox (OSDs going randomly down, parallel backups causing the entire cluster to become unstable etc. etc.).

My setup is with 2 NICs in a 802.3ad LACP bond across Arista 2 switches in MLAG. Then an bridge with the IP address. Both interfaces, bond & bridge are set to MTU 9000.
All parameters related to MLAG & LACP look OK.

As soon as I delete the MTU size on all cluster nodes and go back to the default (1500) the problem goes away.

I have a 2nd redundant link for all nodes that always ran at 1500 MTU, but also with a 802.3ad bond & bridge across Arista MLAG and this link always worked.

So my conclusion is that my specific issue relates to jumbo frames not working properly.

noacess · Jun 2, 2024

Anyone have any ideas on this? I tried moving my cluster network to a standalone NIC with jumbo frames enabled (MTU 9000) to speed up migration speeds and ran into the same problem. Everything works fine if the dedicated interface has a MTU of 1500.

levifig · Jun 14, 2024

noacess said:
Anyone have any ideas on this? I tried moving my cluster network to a standalone NIC with jumbo frames enabled (MTU 9000) to speed up migration speeds and ran into the same problem. Everything works fine if the dedicated interface has a MTU of 1500.

I had the same thing happen to me when I started moving my nodes to MTU 9000.

I tried the temporary workaround of adding -o KexAlgorithms=ecdh-sha2-nistp521 to the SSH command and it worked and I was about to add that to my ssh_config, but it didn't "feel right" because other nodes have no trouble accessing each other and don't have any additional cyphers.

Then I came across your post and noticed the similiarities of the MTU changes. I went back through all my nodes and, lo and behold, the "problematic" node didn't have the MTU set on the physical NIC, just had it on the bridge... I added the MTU to the physical NIC, systemctl restart networking and it all started working... :X

Hope this helps you or anyone else going through similar troubles...

--LF

Search

Search

Proxmox cluster ssh not working between some nodes

dassic

New Member

kohl42

Active Member

kohl42

Active Member

dassic

New Member

noacess

New Member

levifig

Active Member

We value your privacy