Failing node in a two-node cluster, I don´t find anything

m-electronics

Active Member
Jan 12, 2022
61
4
28
25
Hello to the proxmox community,

I looked at my logs because one host of a 2-node cluster (with external quorum device) is not reachable many times but not regular.
There I found this line of syslog (journalctl):
Code:
got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=dell' -o 'UserKnownHostsFile=/etc/pve/nodes/dell/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.88.254 pvecm mtunnel -migration_network 192.168.34.10/25 -get_migration_ip' failed: exit code 255

Then I looked at other VMs and CTs and there it is working fine. What means the exit code 255? Is it a known problem / bug whatever?

I look forward that the community can help me here :)
Answers in German are also good

UPDATE: Maybe the replication errors was because I changed the sshd_config a few months ago to disable ssh-rsa. But there were old ssh-rsa host keys in the /etc/pve/nodes/<nodename>/ssh_known_hosts files...

But the source problem why I started the troubleshooting is that one node is failing again and again. There are no messages in the syslog I think are responsible for that. The memory is working properly, I had run memtest a few weeks ago where also an outage was before...

I hope you have another ideas :)
 
Last edited:
Hi,

to help yo further can you please provide the following:
  • Output of pveversion -v
  • Output of cat /etc/pve/corosync.conf (please redact any sensitive information like IPs or hostnames if needed)
  • Your task logs for the failed replication job: pvesh get /nodes/NODE/tasks --errors 1
  • Journal logs around the time the node lost connection. (also note to redact information you do not want to share)
Regarding disabling ssh-rsa: the old ssh-rsa host keys in /etc/pve/nodes/NODE/ssh_known_hosts would explain the replication errors. Running:

Code:
pvecm updatecerts --force
could fix further issues.


If there is still any problem with your ssh connection have a look at this from the other node and see if there is anything suspicious:
Code:
ssh -vvv -o 'BatchMode=yes' -o 'HostKeyAlias=dell ' -o 'UserKnownHostsFile=/etc/pve/nodes/dell/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.88.254 echo ok

What exactly was the purpose of disabling rsa?