Replication failing to replicate cloudinit drive

FingerlessGloves

Well-Known Member
Oct 22, 2019
36
5
48
Hi,

I have two VMs which have cloudinit drives, which are failing to replicate over that drive, when a replication task is created, other VMs with cloudinit drives are replicating over fine but these two aren't. Granted those other VMs replication tasks were created months n months ago. Trying to replicate pve-1 to pve-2

VM 102 for example has this replication log
Code:
2025-09-22 09:30:00 102-0: start replication job
2025-09-22 09:30:00 102-0: guest => VM 102, running => 52759
2025-09-22 09:30:00 102-0: volumes => local-zfs:vm-102-disk-0
2025-09-22 09:30:01 102-0: freeze guest filesystem
2025-09-22 09:30:01 102-0: create snapshot '__replicate_102-0_1758529800__' on local-zfs:vm-102-disk-0
2025-09-22 09:30:01 102-0: thaw guest filesystem
2025-09-22 09:30:02 102-0: using insecure transmission, rate limit: none
2025-09-22 09:30:02 102-0: incremental sync 'local-zfs:vm-102-disk-0' (__replicate_102-0_1758529560__ => __replicate_102-0_1758529800__)
2025-09-22 09:30:02 102-0: send from @__replicate_102-0_1758529560__ to rpool/data/vm-102-disk-0@__replicate_102-0_1758529800__ estimated size is 17.9M
2025-09-22 09:30:02 102-0: total estimated size is 17.9M
2025-09-22 09:30:02 102-0: TIME        SENT   SNAPSHOT rpool/data/vm-102-disk-0@__replicate_102-0_1758529800__
2025-09-22 09:30:03 102-0: [pve-2] successfully imported 'local-zfs:vm-102-disk-0'
2025-09-22 09:30:03 102-0: delete previous replication snapshot '__replicate_102-0_1758529560__' on local-zfs:vm-102-disk-0
2025-09-22 09:30:03 102-0: (remote_finalize_local_job) delete stale replication snapshot '__replicate_102-0_1758529560__' on local-zfs:vm-102-disk-0
2025-09-22 09:30:03 102-0: end replication job

As you can see it's not replicating `vm-102-cloudinit` disk, and this is my vm configuration. Comparing it to a VM that did replicate it, the scsi config lines are the same, only things that are different really, are UUIDs/IPs/Name the usual suspects.
Code:
agent: 1
boot: order=scsi0;net0
ciuser: fingerlessgloves
cores: 4
cpu: x86-64-v4
hotplug: disk,network,usb,memory
ipconfig0: ip=78.x.x.x/28,gw=78.x.x.x
memory: 6144
meta: creation-qemu=9.0.0,ctime=1721506383
name: mailcow
nameserver: 172.30.0.1
net0: virtio=BC:24:11:58:ED:72,bridge=vmbr0,tag=1100
numa: 1
onboot: 1
ostype: l26
scsi0: local-zfs:vm-102-disk-0,iothread=1,size=60G
scsi1: local-zfs:vm-102-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-single
searchdomain: local
smbios1: uuid=0b5a9762-1ddf-4a1c-a1c5-5fca160eea87
sockets: 1
sshkeys: ssh-ed25519%20AAAAC3NzaC1lZDI1NTE5AAAAIMPkBlg2JGmmNMy77VBiYzRnCIJsq1GrWBYoTt5we5cw%20jonny%40sshkey
startup: order=1,up=10
tablet: 0
vmgenid: c0a65ff9-5ba4-490c-bec3-a8cdfdd6d193

The disk does exist and the VM is booting just fine
Code:
root@pve-1# zfs list | grep 102
rpool/data/vm-102-cloudinit                72K  2.64T    72K  -
rpool/data/vm-102-disk-0                 43.3G  2.64T  43.3G  -

Code:
root@pve-2# zfs list | grep 102
rpool/data/vm-102-disk-0                 43.3G   642G  43.3G  -

I've tried searching the problem, but the only thing I did find was an old topic here which sadly no one replied too. Is there something I can check to see why it thinks it doesn't need to replicate the cloudinit disk?

pve-1: pve-manager/8.4.12/c2ea8261d32a5020 (running kernel: 6.8.12-14-pve)
pve-2: pve-manager/8.4.12/c2ea8261d32a5020 (running kernel: 6.8.12-14-pve)

Is there a service I need to restart to make it re-evaluate the VM configurations assuming I need to do something like that, or do I need to plan a reboot of the host?
 
Hi,
is the cloudinit drive marked as media=cdrom on the other VMs? Drives with that setting are currently excluded from replication.

It doesn't really need to be replicated, since it can/will be generated fresh from the cloudinit parameters even when being recovered on another node after a host failure. Or do you see any actual issues (other than the drive not being replicated)?
 
Hi,
is the cloudinit drive marked as media=cdrom on the other VMs? Drives with that setting are currently excluded from replication.

It doesn't really need to be replicated, since it can/will be generated fresh from the cloudinit parameters even when being recovered on another node after a host failure. Or do you see any actual issues (other than the drive not being replicated)?
Indeed they all do. Ah if it'll get regenerated when it HA's across that'll be fine then.
 
1758637568599.png

Err now it's created a new disk-1 and now thinks it's a normal disk image not cloudinit.

Edit: Editing `102.conf` and hard changing it back to `local-zfs:vm-102-cloudinit,media=cdrom,size=4M` has allowed me to live migrate the VM again. After it migrated still showed `vm-102-cloudinit`
 
Last edited:
Err now it's created a new disk-1 and now thinks it's a normal disk image not cloudinit.
What does "now" mean? What operations led up to this? Please share the relevant task logs (see Task History).
 
What does "now" mean? What operations led up to this? Please share the relevant task logs (see Task History).
Hi, sorry I did mean "now". I have a bad habbit of typing not when I mean now.

Below the task migrating the VM, it's saying cloudinit disk already exists for this VM. Which it might of done at the ZFS dataset level, since I had a host die and the HA started the replicated VM on the other host. I wondering if some logic needs amending to override the cloudinit disk should it exist, rather than generating a new cloudinit disk as disk-1. I'm sure I've had this happen without a host failing, if I see it happen again, I'll grab those logs. But in this instance that's how this problem started.

I did have a VM with 3 cloudinit disks. vm-110-cloudinit, vm-110-disk-1 and vm-110-disk-2 I edited the .conf changed it back to the original vm-110-cloudinit, then removed disk-1 and disk-2 from the ZFS datasets. I also then cleaned up all the other dangling disk-1 cloudinit disks from the other VMs I fixed, on the ZFS pools on each host.

Code:
task started by HA resource agent
2025-09-23 17:44:55 use dedicated network address for sending migration traffic (172.30.254.1)
2025-09-23 17:44:55 starting migration of VM 110 to node 'pve-1' (172.30.254.1)
2025-09-23 17:44:55 found generated disk 'local-zfs:vm-110-cloudinit' (in current VM config)
2025-09-23 17:44:55 found local, replicated disk 'local-zfs:vm-110-disk-0' (attached)
2025-09-23 17:44:55 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2025-09-23 17:44:55 replicating disk images
2025-09-23 17:44:55 start replication job
2025-09-23 17:44:55 guest => VM 110, running => 2043214
2025-09-23 17:44:55 volumes => local-zfs:vm-110-disk-0
2025-09-23 17:44:56 freeze guest filesystem
2025-09-23 17:44:56 create snapshot '__replicate_110-0_1758645895__' on local-zfs:vm-110-disk-0
2025-09-23 17:44:56 thaw guest filesystem
2025-09-23 17:44:56 using insecure transmission, rate limit: none
2025-09-23 17:44:56 incremental sync 'local-zfs:vm-110-disk-0' (__replicate_110-0_1758639601__ => __replicate_110-0_1758645895__)
2025-09-23 17:44:57 send from @__replicate_110-0_1758639601__ to rpool/data/vm-110-disk-0@__replicate_110-0_1758645895__ estimated size is 51.9M
2025-09-23 17:44:57 total estimated size is 51.9M
2025-09-23 17:44:57 TIME        SENT   SNAPSHOT rpool/data/vm-110-disk-0@__replicate_110-0_1758645895__
2025-09-23 17:44:57 [pve-1] successfully imported 'local-zfs:vm-110-disk-0'
2025-09-23 17:44:57 delete previous replication snapshot '__replicate_110-0_1758639601__' on local-zfs:vm-110-disk-0
2025-09-23 17:44:58 (remote_finalize_local_job) delete stale replication snapshot '__replicate_110-0_1758639601__' on local-zfs:vm-110-disk-0
2025-09-23 17:44:58 end replication job
2025-09-23 17:44:58 copying local disk images
2025-09-23 17:44:58 full send of rpool/data/vm-110-cloudinit@__migration__ estimated size is 65.3K
2025-09-23 17:44:58 total estimated size is 65.3K
2025-09-23 17:44:58 TIME        SENT   SNAPSHOT rpool/data/vm-110-cloudinit@__migration__
2025-09-23 17:44:58 [pve-1] volume 'rpool/data/vm-110-cloudinit' already exists - importing with a different name
2025-09-23 17:44:58 [pve-1] successfully imported 'local-zfs:vm-110-disk-1'
2025-09-23 17:44:58 volume 'local-zfs:vm-110-cloudinit' is 'local-zfs:vm-110-disk-1' on the target
2025-09-23 17:44:58 starting VM 110 on remote node 'pve-1'
2025-09-23 17:45:00 volume 'local-zfs:vm-110-disk-0' is 'local-zfs:vm-110-disk-0' on the target
2025-09-23 17:45:00 start remote tunnel
2025-09-23 17:45:00 ssh tunnel ver 1
2025-09-23 17:45:00 starting storage migration
2025-09-23 17:45:00 scsi0: start migration to nbd:172.30.254.1:60003:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0
drive-scsi0: transferred 0.0 B of 3.1 MiB (0.00%) in 0s
drive-scsi0: transferred 3.1 MiB of 3.1 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2025-09-23 17:45:01 switching mirror jobs to actively synced mode
drive-scsi0: switching to actively synced mode
drive-scsi0: successfully switched to actively synced mode
2025-09-23 17:45:02 starting online/live migration on tcp:172.30.254.1:60002
2025-09-23 17:45:02 set migration capabilities
2025-09-23 17:45:02 migration downtime limit: 100 ms
2025-09-23 17:45:02 migration cachesize: 512.0 MiB
2025-09-23 17:45:02 set migration parameters
2025-09-23 17:45:02 start migrate command to tcp:172.30.254.1:60002
2025-09-23 17:45:03 migration active, transferred 293.7 MiB of 3.0 GiB VM-state, 412.0 MiB/s
2025-09-23 17:45:04 migration active, transferred 590.0 MiB of 3.0 GiB VM-state, 485.9 MiB/s
2025-09-23 17:45:05 migration active, transferred 885.0 MiB of 3.0 GiB VM-state, 397.7 MiB/s
2025-09-23 17:45:06 migration active, transferred 1.2 GiB of 3.0 GiB VM-state, 295.3 MiB/s
2025-09-23 17:45:07 migration active, transferred 1.4 GiB of 3.0 GiB VM-state, 405.2 MiB/s
2025-09-23 17:45:08 migration active, transferred 1.7 GiB of 3.0 GiB VM-state, 314.9 MiB/s
2025-09-23 17:45:09 migration active, transferred 2.0 GiB of 3.0 GiB VM-state, 301.0 MiB/s
2025-09-23 17:45:10 migration active, transferred 2.3 GiB of 3.0 GiB VM-state, 358.6 MiB/s
2025-09-23 17:45:11 migration active, transferred 2.6 GiB of 3.0 GiB VM-state, 489.4 MiB/s
2025-09-23 17:45:12 average migration speed: 308.9 MiB/s - downtime 36 ms
2025-09-23 17:45:12 migration completed, transferred 2.7 GiB VM-state
2025-09-23 17:45:12 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job...
drive-scsi0: Completed successfully.
drive-scsi0: mirror-job finished
2025-09-23 17:45:14 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve-1' -o 'UserKnownHostsFile=/etc/pve/nodes/pve-1/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@172.30.254.1 pvesr set-state 110 \''{"local/pve-2":{"storeid_list":["local-zfs"],"last_iteration":1758645895,"last_node":"pve-2","last_sync":1758645895,"last_try":1758645895,"duration":2.743885,"fail_count":0}}'\'
2025-09-23 17:45:14 stopping NBD storage migration server on target.
2025-09-23 17:45:17 migration finished successfully (duration 00:00:23)
TASK OK
 
Last edited: