[SOLVED] CHR Mikrotik HA replication failed

Shyciii · Apr 7, 2025

Hello. I have a problem with Proxmox ( version 7.4.17) cluster of 2 members. It is running over 10 VMs and a CHR Mikrotik to access the internet for the VMs. The replication of the VMs is working fine, but the replication of the CHR Mikrotik is running to error. When I bring it up under the replication menu, it throws the following error (and it is also written to the journalctl log):

Code:

2025-04-07 13:08:29 110-0: start replication job
2025-04-07 13:08:29 110-0: guest => VM 110, running => 2341709
2025-04-07 13:08:29 110-0: volumes => ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:30 110-0: create snapshot '__replicate_110-0_1744024109__' on ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:30 110-0: using secure transmission, rate limit: none
2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__)
2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M
2025-04-07 13:08:31 110-0: total estimated size is 135M
2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist
2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist
2025-04-07 13:08:32 110-0: command 'zfs recv -F -- ssd/vm-110-disk-0' failed: exit code 1
2025-04-07 13:08:32 110-0: warning: cannot send 'ssd/vm-110-disk-0@__replicate_110-0_1744024109__': signal received
2025-04-07 13:08:32 110-0: cannot send 'ssd/vm-110-disk-0': I/O error
2025-04-07 13:08:32 110-0: command 'zfs send -Rpv -- ssd/vm-110-disk-0@__replicate_110-0_1744024109__' failed: exit code 1
2025-04-07 13:08:32 110-0: delete previous replication snapshot '__replicate_110-0_1744024109__' on ssd:base-106-disk-0/vm-110-disk-0
2025-04-07 13:08:32 110-0: end replication job with error: command 'set -o pipefail && pvesm export ssd:base-106-disk-0/vm-110-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_110-0_1744024109__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=cluster01' root@172.16.1.1 -- pvesm import ssd:base-106-disk-0/vm-110-disk-0 zfs - -with-snapshots 1 -snapshot __replicate_110-0_1744024109__ -allow-rename 0' failed: exit code 1

vm'c conf file (/etc/pve/qemu-server/110.conf)

Code:

balloon: 0
boot: order=scsi0;ide2;net0
cores: 4
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=7.2.0,ctime=1698644063
name: 172.16.1.220-CHR
net0: virtio=9E:55:A7:3B:54:32,bridge=vmbr1,firewall=1
net1: virtio=06:AE:AD:55:DA:56,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: ssd:base-106-disk-0/vm-110-disk-0,cache=writethrough,iothread=1,size=2G
scsihw: virtio-scsi-single
smbios1: uuid=d1daddb1-4062-4854-86a5-6589975e9d2a
sockets: 1
vmgenid: f9c01868-093b-48e1-9225-9f9f1f720e4f

Data undere Options menu:

What could be the problem?

aj@root · Apr 9, 2025

This is something that's interesting to me:

Code:

2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__)
2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M
2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist
2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist

ssd:base-106-disk-0/vm-110-disk-0 vs ssd/vm-110-disk-0@...: dataset does not exist

I'm pretty sure I've seen `base-` prepended before - I think that happens when you create it from a template.
Does the template still exist?

Regardless, here's something I'm pretty sure will get things back in sync:

Remove the failing replication job (at the Datacenter or Node or VM level).
If you can't remove it or get an error, remove it from /etc/pve/replication.cfg directly.
Do a simple Sanity check that the Datacenter Storage pool that you want is available to both nodes
(but it sounds like you only have one and the others are working)
You may also want to check that "thin-provision" is checked while you're in there, if you want that.
If you're okay to increase the size of the VM by unlinking it from the base, live-migrate it to a different storage temporarily
(if qemu agent and discard are enabled, it can be thinned again as part of migration,
or later with sudo fstrim / inside the guest)
Live-migrate it back to the desired ZFS replication volume
Re-add the replication job

At that point the disk image should no longer contain a reference to the base volume and the replication should re-create the new volume without any issue.

This is not a risky procedure, but just as a matter of course you may also want to create a backup before doing the disk migration since you don't have replication working presently.

Sanity check that the volume exists everywhere you expect.

Screenshot 2025-04-08 at 10.16.43 PM.png

Checking that thin-provisioning is enabled.

Enabling Qemu Agent to re-thin disks.

Live-migrate storage to another volume (and then back).

Screenshot 2025-04-08 at 10.23.24 PM.png

(continued)

Screenshot 2025-04-08 at 10.24.13 PM.png

aj@root · Apr 9, 2025

I was able to replicate the error by creating a clone of a template and adding replication to the clone without the base template.

There are two easy fixes:

Replicate the template BEFORE replicating the clone
(since it will never change you can choose yearly and then "Schedule Now")
Unlink the clone from the template
(by migrating it as described above)

Replication of cloned template fails.

Replication of template succeeds.

Now replication of clone succeeds.

Screenshot 2025-04-08 at 10.47.51 PM.png

Shyciii · Apr 9, 2025

aj@root said:
This is something that's interesting to me:

Code:

2025-04-07 13:08:30 110-0: full sync 'ssd:base-106-disk-0/vm-110-disk-0' (__replicate_110-0_1744024109__) 2025-04-07 13:08:31 110-0: full send of ssd/vm-110-disk-0@__replicate_110-0_1744024109__ estimated size is 135M 2025-04-07 13:08:32 110-0: cannot receive: local origin for clone ssd/vm-110-disk-0@__replicate_110-0_1744024109__ does not exist 2025-04-07 13:08:32 110-0: cannot open 'ssd/vm-110-disk-0': dataset does not exist

ssd:base-106-disk-0/vm-110-disk-0 vs ssd/vm-110-disk-0@...: dataset does not exist

I'm pretty sure I've seen `base-` prepended before - I think that happens when you create it from a template.
Does the template still exist?

Regardless, here's something I'm pretty sure will get things back in sync:

Remove the failing replication job (at the Datacenter or Node or VM level).

If you can't remove it or get an error, remove it from /etc/pve/replication.cfg directly.

Do a simple Sanity check that the Datacenter Storage pool that you want is available to both nodes
(but it sounds like you only have one and the others are working)
You may also want to check that "thin-provision" is checked while you're in there, if you want that.

If you're okay to increase the size of the VM by unlinking it from the base, live-migrate it to a different storage temporarily
(if qemu agent and discard are enabled, it can be thinned again as part of migration,
or later with sudo fstrim / inside the guest)

Live-migrate it back to the desired ZFS replication volume

Re-add the replication job

At that point the disk image should no longer contain a reference to the base volume and the replication should re-create the new volume without any issue.

This is not a risky procedure, but just as a matter of course you may also want to create a backup before doing the disk migration since you don't have replication working presently.

Sanity check that the volume exists everywhere you expect.
View attachment 84677
Checking that thin-provisioning is enabled.
View attachment 84678

Enabling Qemu Agent to re-thin disks.
View attachment 84680
Live-migrate storage to another volume (and then back).
View attachment 84679
(continued)
View attachment 84681

Can I move the disk in live? Because this VM is the Mikrotik router. If it is not running, all the vm will be inaccessible because it is providing the internet.

aj@root · Apr 9, 2025

Yes. VMs support live, zero-downtime disk migration between the various storage types - lvm-thin, zfs, and ceph rdb, etc.

During the final moment you'll have high latency for 1-4 seconds, but it's just that - latency.

Shyciii · Apr 10, 2025

aj@root said:
Yes. VMs support live, zero-downtime disk migration between the various storage types - lvm-thin, zfs, and ceph rdb, etc.

During the final moment you'll have high latency for 1-4 seconds, but it's just that - latency.

I tried, and now working fine. Thank you so much!

aj@root · Apr 11, 2025

@Shyciii I'm glad to here it! Will you put a thumbs up on whichever posts were the most helpful and mark this thread as Solved?

Edit: Nevermind, I see that you did change the title to include "[Solved]" - I missed it at first because it wasn't orange.
(there's also a "Prefix" option in the dropdown when you edit the thread - just for future reference)

Search

Search

[SOLVED] CHR Mikrotik HA replication failed

Shyciii

New Member

aj@root

Member

aj@root

Member

Shyciii

New Member

aj@root

Member

Shyciii

New Member

aj@root

Member

We value your privacy