[SOLVED] CHR Mikrotik HA replication failed

Shyciii

New Member
Apr 7, 2025
3
0
1
Hello. I have a problem with Proxmox ( version 7.4.17) cluster of 2 members. It is running over 10 VMs and a CHR Mikrotik to access the internet for the VMs. The replication of the VMs is working fine, but the replication of the CHR Mikrotik is running to error. When I bring it up under the replication menu, it throws the following error (and it is also written to the journalctl log):

Code:
2025-04-07 13:08:29 110-0: start replication job
2025-04-07 13:08:29 110-0: guest => VM 110, running => 2341709
2025-04-07 13:08:29 110-0: volumes => ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:30 110-0: create snapshot '__replicate_110-0_1744024109__' on ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:30 110-0: using secure transmission, rate limit: none
2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__)
2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M
2025-04-07 13:08:31 110-0: total estimated size is 135M
2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist
2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist
2025-04-07 13:08:32 110-0: command 'zfs recv -F -- ssd/vm-110-disk-0' failed: exit code 1
2025-04-07 13:08:32 110-0: warning: cannot send 'ssd/vm-110-disk-0@__replicate_110-0_1744024109__': signal received
2025-04-07 13:08:32 110-0: cannot send 'ssd/vm-110-disk-0': I/O error
2025-04-07 13:08:32 110-0: command 'zfs send -Rpv -- ssd/vm-110-disk-0@__replicate_110-0_1744024109__' failed: exit code 1
2025-04-07 13:08:32 110-0: delete previous replication snapshot '__replicate_110-0_1744024109__' on ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:32 110-0: end replication job with error: command 'set -o pipefail && pvesm export ssd:base-106-disk-0/vm-110-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_110-0_1744024109__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cluster01' root@172.16.1.1 -- pvesm import ssd:base-106-disk-0/vm-110-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_110-0_1744024109__ -allow-rename 0' failed: exit code 1

vm'c conf file (/etc/pve/qemu-server/110.conf)

Code:
balloon: 0
boot: order=scsi0;ide2;net0
cores: 4
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=7.2.0,ctime=1698644063
name: 172.16.1.220-CHR
net0: virtio=9E:55:A7:3B:54:32,bridge=vmbr1,firewall=1
net1: virtio=06:AE:AD:55:DA:56,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: ssd:base-106-disk-0/vm-110-disk-0,cache=writethrough,iothread=1,size=2G
scsihw: virtio-scsi-single
smbios1: uuid=d1daddb1-4062-4854-86a5-6589975e9d2a
sockets: 1
vmgenid: f9c01868-093b-48e1-9225-9f9f1f720e4f

Data undere Options menu:

1744025315169.png

What could be the problem?
 
This is something that's interesting to me:

Code:
2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__)
2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M
2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist
2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist

ssd:base-106-disk-0/vm-110-disk-0 vs ssd/vm-110-disk-0@...: dataset does not exist

I'm pretty sure I've seen `base-` prepended before - I think that happens when you create it from a template.
Does the template still exist?

Regardless, here's something I'm pretty sure will get things back in sync:
  1. Remove the failing replication job (at the Datacenter or Node or VM level).
  2. If you can't remove it or get an error, remove it from /etc/pve/replication.cfg directly.
  3. Do a simple Sanity check that the Datacenter Storage pool that you want is available to both nodes
    (but it sounds like you only have one and the others are working)
    You may also want to check that "thin-provision" is checked while you're in there, if you want that.
  4. If you're okay to increase the size of the VM by unlinking it from the base, live-migrate it to a different storage temporarily
    (if qemu agent and discard are enabled, it can be thinned again as part of migration,
    or later with sudo fstrim / inside the guest)
  5. Live-migrate it back to the desired ZFS replication volume
  6. Re-add the replication job
At that point the disk image should no longer contain a reference to the base volume and the replication should re-create the new volume without any issue.

This is not a risky procedure, but just as a matter of course you may also want to create a backup before doing the disk migration since you don't have replication working presently.

Sanity check that the volume exists everywhere you expect.
Screenshot 2025-04-08 at 10.16.43 PM.png
Checking that thin-provisioning is enabled.
Screenshot 2025-04-08 at 9.28.45 PM.png

Enabling Qemu Agent to re-thin disks.
Screenshot 2025-04-08 at 9.36.26 PM.png
Live-migrate storage to another volume (and then back).
Screenshot 2025-04-08 at 10.23.24 PM.png
(continued)
Screenshot 2025-04-08 at 10.24.13 PM.png
 
Last edited:
  • Like
Reactions: Shyciii
I was able to replicate the error by creating a clone of a template and adding replication to the clone without the base template.

There are two easy fixes:
  1. Replicate the template BEFORE replicating the clone
    (since it will never change you can choose yearly and then "Schedule Now")
  2. Unlink the clone from the template
    (by migrating it as described above)

Replication of cloned template fails.

Screenshot 2025-04-08 at 10.44.37 PM.png
Replication of template succeeds.

Screenshot 2025-04-08 at 10.46.31 PM.png
Now replication of clone succeeds.


Screenshot 2025-04-08 at 10.47.51 PM.png
 
Last edited:
This is something that's interesting to me:

Code:
2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__)
2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M
2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist
2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist

ssd:base-106-disk-0/vm-110-disk-0 vs ssd/vm-110-disk-0@...: dataset does not exist

I'm pretty sure I've seen `base-` prepended before - I think that happens when you create it from a template.
Does the template still exist?

Regardless, here's something I'm pretty sure will get things back in sync:
  1. Remove the failing replication job (at the Datacenter or Node or VM level).
  2. If you can't remove it or get an error, remove it from /etc/pve/replication.cfg directly.
  3. Do a simple Sanity check that the Datacenter Storage pool that you want is available to both nodes
    (but it sounds like you only have one and the others are working)
    You may also want to check that "thin-provision" is checked while you're in there, if you want that.
  4. If you're okay to increase the size of the VM by unlinking it from the base, live-migrate it to a different storage temporarily
    (if qemu agent and discard are enabled, it can be thinned again as part of migration,
    or later with sudo fstrim / inside the guest)
  5. Live-migrate it back to the desired ZFS replication volume
  6. Re-add the replication job
At that point the disk image should no longer contain a reference to the base volume and the replication should re-create the new volume without any issue.

This is not a risky procedure, but just as a matter of course you may also want to create a backup before doing the disk migration since you don't have replication working presently.

Sanity check that the volume exists everywhere you expect.
View attachment 84677
Checking that thin-provisioning is enabled.
View attachment 84678

Enabling Qemu Agent to re-thin disks.
View attachment 84680
Live-migrate storage to another volume (and then back).
View attachment 84679
(continued)
View attachment 84681

Can I move the disk in live? Because this VM is the Mikrotik router. If it is not running, all the vm will be inaccessible because it is providing the internet.
 
Yes. VMs support live, zero-downtime disk migration between the various storage types - lvm-thin, zfs, and ceph rdb, etc.

During the final moment you'll have high latency for 1-4 seconds, but it's just that - latency.

I tried, and now working fine. Thank you so much!
 
@Shyciii I'm glad to here it! Will you put a thumbs up on whichever posts were the most helpful and mark this thread as Solved?

Edit: Nevermind, I see that you did change the title to include "[Solved]" - I missed it at first because it wasn't orange.
(there's also a "Prefix" option in the dropdown when you edit the thread - just for future reference)
Screenshot 2025-04-11 at 1.08.29 AM.png
 
Last edited: