Random Replication Job: '110-0' failed then it doesn't then it does then it doesn't...

drjaymz@

Member
Jan 19, 2022
129
5
23
102
I have slowly been seeing more and more replication job failed errors. By the time I get to them, the problem has gone and I cannot find a sensible log to work out what the problem is. I would suggest the configuration isn't wrong per-se otherwise it wouldn't work most of the time. I get a bunch of emails for various guest containers or VM's around the same time.

My replication is via a dedicated network shown below on the 200 subnet. Its simply a switch with 6 machines (5 PVE and 1PBS) and that is it.
I might have assumed that it was a dodgy switch but its not just that one site its starting to infect two other sites.

Code:
Replication job '110-0' with target 'proxmoxmon1' and schedule '09/15' failed!

Last successful sync: 2025-03-17 08:24:52
Next sync try: 2025-03-17 08:44:00
Failure count: 1

Error:
command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=proxmoxmon1' -o 'UserKnownHostsFile=/etc/pve/nodes/proxmoxmon1/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.200.6 -- pvesr prepare-local-job 110-0 --scan local-zfs local-zfs:subvol-110-disk-0 --last_sync 1742199892' failed: exit code 255

I changed the replication schedules to spread them out, that didn't make any difference.
I checked all the guests that had a problem and by the time I get to them, they are working and the replication it taking just a second or 2 so as far as I can see not running into itself.

I'm running 8.3.2 and they were updated not that long ago.

What I'm asking for is how do I find out what the problem is?

Possibly related to https://forum.proxmox.com/threads/random-zfs-replication-errors.82486/
Same symptom but doesn't seem applicable, its not helpful that logs are not persistent.
 
Last edited:
Restarted two nodes that seemed to be having issues the most, whilst upgrading to 8.3.5, issue hasn't resurfaced yet. Will report back in a few days. This means, don't really have an answer for the caused.
 
Hello, I have had this type of problem. That was openssh-server that stopped serving SSH client connections as security measure. You can look at the logs if ssh server is complaining maybe ?
 
Hello, I have had this type of problem. That was openssh-server that stopped serving SSH client connections as security measure. You can look at the logs if ssh server is complaining maybe ?
I have had a look. I did see one or two

Mar 09 07:15:09 proxmoxmon3 sshd[2062129]: fatal: ssh_packet_send_debug: send DEBUG: Connection reset by peer
Mar 10 08:30:13 proxmoxmon3 sshd[335445]: fatal: ssh_packet_send_debug: send DEBUG: Connection reset by peer
Mar 11 21:45:06 proxmoxmon3 sshd[4001944]: fatal: ssh_packet_send_debug: send DEBUG: Connection reset by peer
Mar 13 00:15:09 proxmoxmon3 sshd[2446943]: fatal: ssh_packet_send_debug: send DEBUG: Connection reset by peer
Mar 16 16:18:15 proxmoxmon3 sshd[3172737]: fatal: ssh_packet_send_debug: send DEBUG: Connection reset by peer

These don't actually coincide with the replication errors and might be unrelated. It does look like a networking problem, but a problem that cleared itself when the server was rebooted. 30hrs further on its not occurred again.
 
Yes, there were no interface problems - occasionally when its gone bonkers, you can see the interface go down and up, in this case all interfaces never changed. Again, the physcial network isn't to blame, rather something on proxmox network stack itself because nothing was changed, in fact not even physically been to the servers just rebooted and then its fine.

This cluster has 5x servers and a dedicated replication network and subnet and number 4 died just before we had issues - it looks like a board fault. The system fenced the 4 vm's and tried to move to node 1 and failed because node 1 didn't have the base-image (despite the fact that it did). with number 4 out of the picture I found that node 3 had a full set of replicated images and I brought them back online - so we were in business just a few moments after the fault, but HA failover failed. Then the VM's I had brought up on 3 had trouble replicating intermittently. This was what was fixed by rebooting node 1 and 3.

My theory, the evidence for which is probably lost by now, was that perhaps the failure of node 4 left connections stale and that somewhere there is a connection limit that it kept hitting. This is obviously cleared by a restart - although if I knew what it was I could probably just have restarted the relevant thing. Your suggestion that it was ssh was a good one, its probably something like that.