VM failback with cloudinit and ZFS replication fails

ddtlabs · Mar 10, 2026

This is a 2-node cluster running pve 9.1.5 with one additional QDevice.

ZFS synchronization and migration generally work without problems. If a node fails, VMs with cloudinit are restarted on the remaining node. So far, so good.

However, when the failed node becomes available again and VMs are supposed to be automatically migrated back to it, this fails. Every 10 seconds, a new migration attempt is started, which then aborts.

If the cloudinit image is deleted on the restarted host, the migration works again.

Shared storage for the cloudinit image is not an option that can be used.

Code:

task started by HA resource agent
2026-03-07 10:13:48 conntrack state migration not supported or disabled, active connections might get dropped
2026-03-07 10:13:48 starting migration of VM 101 to node 'n2' (192.168.30.52)
2026-03-07 10:13:48 found generated disk 'zfs2:vm-101-cloudinit' (in current VM config)
2026-03-07 10:13:48 found local, replicated disk 'zfs2:vm-101-disk-0' (attached)
2026-03-07 10:13:48 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2026-03-07 10:13:48 replicating disk images
2026-03-07 10:13:48 start replication job
2026-03-07 10:13:48 guest => VM 101, running => 86469
2026-03-07 10:13:48 volumes => zfs2:vm-101-disk-0
2026-03-07 10:13:49 freeze guest filesystem
2026-03-07 10:13:49 create snapshot '__replicate_101-0_1772874828__' on zfs2:vm-101-disk-0
2026-03-07 10:13:49 thaw guest filesystem
2026-03-07 10:13:49 using secure transmission, rate limit: none
2026-03-07 10:13:49 incremental sync 'zfs2:vm-101-disk-0' (__replicate_101-0_1772874818__ => __replicate_101-0_1772874828__)
2026-03-07 10:13:50 send from @__replicate_101-0_1772874818__ to zfs2/vm-101-disk-0@__replicate_101-0_1772874828__ estimated size is 931K
2026-03-07 10:13:50 total estimated size is 931K
2026-03-07 10:13:50 TIME        SENT   SNAPSHOT zfs2/vm-101-disk-0@__replicate_101-0_1772874828__
2026-03-07 10:13:50 successfully imported 'zfs2:vm-101-disk-0'
2026-03-07 10:13:50 delete previous replication snapshot '__replicate_101-0_1772874818__' on zfs2:vm-101-disk-0
2026-03-07 10:13:51 (remote_finalize_local_job) delete stale replication snapshot '__replicate_101-0_1772874818__' on zfs2:vm-101-disk-0
2026-03-07 10:13:51 end replication job
2026-03-07 10:13:51 copying local disk images
2026-03-07 10:13:51 full send of zfs2/vm-101-cloudinit@__migration__ estimated size is 81.5K
2026-03-07 10:13:51 total estimated size is 81.5K
2026-03-07 10:13:51 TIME        SENT   SNAPSHOT zfs2/vm-101-cloudinit@__migration__
2026-03-07 10:13:51 volume 'zfs2/vm-101-cloudinit' already exists
send/receive failed, cleaning up snapshot(s)..
2026-03-07 10:13:51 ERROR: storage migration for 'zfs2:vm-101-cloudinit' to storage 'zfs2' failed - command 'set -o pipefail && pvesm export zfs2:vm-101-cloudinit zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=n2' -o 'UserKnownHostsFile=/etc/pve/nodes/n2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.52 -- pvesm import zfs2:vm-101-cloudinit zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 0' failed: exit code 255
2026-03-07 10:13:51 aborting phase 1 - cleanup resources
2026-03-07 10:13:51 scsi0: removing block-dirty-bitmap 'repl_scsi0'
2026-03-07 10:13:51 ERROR: migration aborted (duration 00:00:03): storage migration for 'zfs2:vm-101-cloudinit' to storage 'zfs2' failed - command 'set -o pipefail && pvesm export zfs2:vm-101-cloudinit zfs - -with-snapshots 0 -snapshot __migration__ | /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=n2' -o 'UserKnownHostsFile=/etc/pve/nodes/n2/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.52 -- pvesm import zfs2:vm-101-cloudinit zfs - -with-snapshots 0 -snapshot __migration__ -delete-snapshot __migration__ -allow-rename 0' failed: exit code 255
TASK ERROR: migration

101.conf

Code:

agent: enabled=1
balloon: 1024
boot: c
bootdisk: scsi0
cicustom: vendor=snippets:snippets/ci-vendor-9501.yml
ciupgrade: 1
cores: 1
cpu: host
ipconfig0: ip=dhcp
memory: 2048
meta: creation-qemu=9.0.2,ctime=1724658870
name: ac1.int.example.com
nameserver: 1.1.1.1
net0: virtio=BC:24:11:41:A2:46,bridge=vmbr0
numa: 0
ostype: l26
scsi0: zfs2:vm-101-disk-0,cache=writeback,discard=on,format=raw,size=36352M,ssd=1
scsi2: zfs2:vm-101-cloudinit,media=cdrom,size=4M
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=1a6529e2-cf0b-4ce0-a89f-57fa394c2d55
sockets: 1
vga: std
vmgenid: 2517065f-a8d9-4ca7-a6f2-3208a2ace7db

dakralex · Mar 10, 2026

Hi!

ddtlabs said:
2026-03-07 10:13:51 volume 'zfs2/vm-101-cloudinit' already exists

Thanks for the report! I suppose there is a HA node affinity rule which makes the HA resource failback to the old node. As the node fails, the HA Manager will move the HA resource but not clean up the cloudinit image from the failed node as it would happen in normal circumstances... I suppose we can forcefully overwrite cloudinit images on the target as these are auto-generated, but I'll investigate and get back here later.

dakralex · Mar 10, 2026

To mitigate this for now, zfs2/vm-101-cloudinit can be removed on the failed node to be able to migrate it back there.

ddtlabs · Mar 10, 2026

Thanks for your answer. Yes, you are right, there is an affinity rule to failback. I will disable it for now.
Will there be a fix that will override cloudinit images in the near future?

ddtlabs · Mar 27, 2026

There is another unexpected behavior in this context. Tested with a 3-node cluster 9.1.6 (nodes: ve1, ve2, ve3)

vm-101 was created with cloudinit on node ve1. ZFS replications were created. HA was enabled for vm-101. vm-101 was started. Replications for vm-101 are only created for vm-100-disk-0, not for vm-101-cloudinit.

Then:
- Shutdown of vm-101 on node ve1.
- Shutdown of node ve1.

vm-101 appears on node ve2 after 2 minutes, still powered off. So far, so good.
Node ve1 is now restarted. If you now attempt to migrate the currently powered-off VM-101 to node ve1, the migration fails with the following error message:

Code:

2026-03-26 10:08:28 starting migration of VM 101 to node 've1' (192.168.30.211)
2026-03-26 10:08:28 found local disk 'local-zfs:vm-101-cloudinit' (attached)
2026-03-26 10:08:28 found local, replicated disk 'local-zfs:vm-101-disk-0' (attached)
2026-03-26 10:08:28 can't migrate local disk 'local-zfs:vm-101-cloudinit': zfs error: cannot open 'rpool/data/vm-101-cloudinit': dataset does not exist
2026-03-26 10:08:28 ERROR: Problem found while scanning volumes - can't migrate VM - check log
2026-03-26 10:08:28 aborting phase 1 - cleanup resources
2026-03-26 10:08:28 ERROR: migration aborted (duration 00:00:00): Problem found while scanning volumes - can't migrate VM - check log
TASK ERROR: migration aborted

The same thing happens when attempting to migrate vm-101 using affinity rules. Only then does the error message reappear every 10 seconds.
As soon as the cloudinit image for vm-101 is rebuilt on node ve2, the migration of the powered-off vm-101 to node ve1 works again.

However, if vm-101 is started on node ve2, the cloud image is also created, and vm-101 runs normally.
A migration to the original node ve1 is still not possible while powered on without deleting the cloudinit image on the original node ve1. I already reported this in my first post.

ddtlabs · Apr 2, 2026

Please let me ask after nearly a week without any response from proxmox support...
Are there any efforts to adjust this behavior (or fix the bug)? Or do I have to create a workaround myself?

ddtlabs · Apr 9, 2026

I wanted to circle back to my previous post. I haven't received any feedback yet and this issue is holding up my work. Does anyone have any idea if this behavior will be changed? Or do I have to live with it as it is and find a workaround for the planned installation?

Neobin · Apr 10, 2026

You could open a ticket in Bugzilla: [1] for (better) tracking...

[1] https://bugzilla.proxmox.com

mariol · May 19, 2026

https://bugzilla.proxmox.com/show_bug.cgi?id=7608

VM failback with cloudinit and ZFS replication fails

ddtlabs

Member

dakralex

Proxmox Staff Member

dakralex

Proxmox Staff Member

ddtlabs

Member

ddtlabs

Member

ddtlabs

Member

ddtlabs

Member

Neobin

Distinguished Member

mariol

Proxmox Staff Member

We value your privacy