HA Live Migration of ZFS replication still fails in PVE 8.2

prudentcircle · Aug 9, 2024

I have a VM live migration issue in PVE 8.2

I searched very hard on the forum and I was able to find that the issue was present in older version of Proxmox VE.

https://forum.proxmox.com/threads/ha-live-migration-with-zfs-replication-on-pve-6-2.69708/

I am running PVE 8.2 and live migration still errors for those HA VMs with zfs replication.
Is there anyone else having the same issues??

Is this not allowed by design? or It is a bug?
I was not able to find answer to this question even after hours of searching on the forum.

Code:

task started by HA resource agent
2024-08-09 15:03:24 use dedicated network address for sending migration traffic (10.21.250.106)
2024-08-09 15:03:25 starting migration of VM 104 to node 'ozone-set00-j09-svr06' (10.21.250.106)
2024-08-09 15:03:25 found local, replicated disk 'hdd-2m-data:vm-104-disk-0' (attached)
2024-08-09 15:03:25 found generated disk 'local-zfs:vm-104-cloudinit' (in current VM config)
2024-08-09 15:03:25 virtio0: start tracking writes using block-dirty-bitmap 'repl_virtio0'
2024-08-09 15:03:25 replicating disk images
2024-08-09 15:03:25 start replication job
QEMU Guest Agent is not running - VM 104 qmp command 'guest-ping' failed - got timeout
2024-08-09 15:03:28 guest => VM 104, running => 1749892
2024-08-09 15:03:28 volumes => hdd-2m-data:vm-104-disk-0
2024-08-09 15:03:30 create snapshot '__replicate_104-0_1723183405__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:03:30 using secure transmission, rate limit: none
2024-08-09 15:03:30 incremental sync 'hdd-2m-data:vm-104-disk-0' (__replicate_104-0_1723183203__ => __replicate_104-0_1723183405__)
2024-08-09 15:03:32 send from @__replicate_104-0_1723183203__ to hdd-2m-data/vm-104-disk-0@__replicate_104-0_1723183405__ estimated size is 170K
2024-08-09 15:03:32 total estimated size is 170K
2024-08-09 15:03:32 TIME        SENT   SNAPSHOT hdd-2m-data/vm-104-disk-0@__replicate_104-0_1723183405__
2024-08-09 15:03:33 successfully imported 'hdd-2m-data:vm-104-disk-0'
2024-08-09 15:03:33 delete previous replication snapshot '__replicate_104-0_1723183203__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:03:34 (remote_finalize_local_job) delete stale replication snapshot '__replicate_104-0_1723183203__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:03:35 end replication job
2024-08-09 15:03:35 copying local disk images
2024-08-09 15:03:36 full send of rpool/data/vm-104-cloudinit@__migration__ estimated size is 65.3K
2024-08-09 15:03:36 total estimated size is 65.3K
2024-08-09 15:03:36 TIME        SENT   SNAPSHOT rpool/data/vm-104-cloudinit@__migration__
2024-08-09 15:03:36 successfully imported 'local-zfs:vm-104-cloudinit'
2024-08-09 15:03:36 volume 'local-zfs:vm-104-cloudinit' is 'local-zfs:vm-104-cloudinit' on the target
2024-08-09 15:03:36 starting VM 104 on remote node 'ozone-set00-j09-svr06'
2024-08-09 15:03:39 volume 'hdd-2m-data:vm-104-disk-0' is 'hdd-2m-data:vm-104-disk-0' on the target
2024-08-09 15:03:39 start remote tunnel
2024-08-09 15:03:40 ssh tunnel ver 1
2024-08-09 15:03:40 starting storage migration
2024-08-09 15:03:40 virtio0: start migration to nbd:unix:/run/qemu-server/104_nbd.migrate:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0
channel 3: open failed: connect failed: open failed
drive-virtio0: Cancelling block job
drive-virtio0: Done.
2024-08-09 15:03:40 ERROR: online migrate failure - mirroring error: VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before all data were read
2024-08-09 15:03:40 aborting phase 2 - cleanup resources
2024-08-09 15:03:40 migrate_cancel

Here is my output of pveversion -v

Code:

root@ozone-set00-j09-svr06:~/.ssh# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-2
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.0-1
proxmox-backup-file-restore: 3.2.0-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.1
pve-cluster: 8.0.6
pve-container: 5.0.10
pve-docs: 8.2.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.5
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-5
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2
root@ozone-set00-j09-svr06:~/.ssh# pveversion -v

prudentcircle · Aug 9, 2024

I found that setting replication schedules is enough to cause live migrations to fail.
Even without HA configuration, my VM fails to migrate to a node with replicated volume.

Code:

2024-08-09 15:21:35 use dedicated network address for sending migration traffic (10.21.250.110)
2024-08-09 15:21:35 starting migration of VM 104 to node 'ozone-set00-j09-svr10' (10.21.250.110)
2024-08-09 15:21:35 found local, replicated disk 'hdd-2m-data:vm-104-disk-0' (attached)
2024-08-09 15:21:35 found generated disk 'local-zfs:vm-104-cloudinit' (in current VM config)
2024-08-09 15:21:35 virtio0: start tracking writes using block-dirty-bitmap 'repl_virtio0'
2024-08-09 15:21:35 replicating disk images
2024-08-09 15:21:35 start replication job
QEMU Guest Agent is not running - VM 104 qmp command 'guest-ping' failed - got timeout
2024-08-09 15:21:38 guest => VM 104, running => 1232761
2024-08-09 15:21:38 volumes => hdd-2m-data:vm-104-disk-0
2024-08-09 15:21:40 create snapshot '__replicate_104-0_1723184495__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:21:40 using secure transmission, rate limit: none
2024-08-09 15:21:40 incremental sync 'hdd-2m-data:vm-104-disk-0' (__replicate_104-0_1723184403__ => __replicate_104-0_1723184495__)
2024-08-09 15:21:42 send from @__replicate_104-0_1723184403__ to hdd-2m-data/vm-104-disk-0@__replicate_104-1_1723184413__ estimated size is 624B
2024-08-09 15:21:42 send from @__replicate_104-1_1723184413__ to hdd-2m-data/vm-104-disk-0@__replicate_104-0_1723184495__ estimated size is 154K
2024-08-09 15:21:42 total estimated size is 155K
2024-08-09 15:21:42 TIME        SENT   SNAPSHOT hdd-2m-data/vm-104-disk-0@__replicate_104-1_1723184413__
2024-08-09 15:21:42 TIME        SENT   SNAPSHOT hdd-2m-data/vm-104-disk-0@__replicate_104-0_1723184495__
2024-08-09 15:21:44 successfully imported 'hdd-2m-data:vm-104-disk-0'
2024-08-09 15:21:44 delete previous replication snapshot '__replicate_104-0_1723184403__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:21:45 (remote_finalize_local_job) delete stale replication snapshot '__replicate_104-0_1723184403__' on hdd-2m-data:vm-104-disk-0
2024-08-09 15:21:45 end replication job
2024-08-09 15:21:45 copying local disk images
2024-08-09 15:21:47 full send of rpool/data/vm-104-cloudinit@__migration__ estimated size is 65.3K
2024-08-09 15:21:47 total estimated size is 65.3K
2024-08-09 15:21:47 TIME        SENT   SNAPSHOT rpool/data/vm-104-cloudinit@__migration__
2024-08-09 15:21:47 successfully imported 'local-zfs:vm-104-cloudinit'
2024-08-09 15:21:47 volume 'local-zfs:vm-104-cloudinit' is 'local-zfs:vm-104-cloudinit' on the target
2024-08-09 15:21:47 starting VM 104 on remote node 'ozone-set00-j09-svr10'
2024-08-09 15:21:50 volume 'hdd-2m-data:vm-104-disk-0' is 'hdd-2m-data:vm-104-disk-0' on the target
2024-08-09 15:21:50 start remote tunnel
2024-08-09 15:21:51 ssh tunnel ver 1
2024-08-09 15:21:51 starting storage migration
2024-08-09 15:21:51 virtio0: start migration to nbd:unix:/run/qemu-server/104_nbd.migrate:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0
channel 3: open failed: connect failed: open failed
drive-virtio0: Cancelling block job
drive-virtio0: Done.
2024-08-09 15:21:51 ERROR: online migrate failure - mirroring error: VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before all data were read
2024-08-09 15:21:51 aborting phase 2 - cleanup resources
2024-08-09 15:21:51 migrate_cancel
2024-08-09 15:21:51 virtio0: removing block-dirty-bitmap 'repl_virtio0'
2024-08-09 15:21:54 ERROR: migration finished with problems (duration 00:00:20)
TASK ERROR: migration problems

fiona · Aug 9, 2024

Hi,
please upgrade to the latest version and see if the issue persists. Can you check if using insecure migration mode exposes the same issue (should not be used if you don't trust your local network!)? Please also share the VM configuration: qm config 104.

Code:

2024-08-09 15:21:51 virtio0: start migration to nbd:unix:/run/qemu-server/104_nbd.migrate:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0
channel 3: open failed: connect failed: open failed

sounds like there might be a connection issue with the Unix sockets.

prudentcircle · Aug 9, 2024

Hi, @fiona
As-per your suggestion, I ran the command below and the migration worked.

qm migrate 104 ozone-set00-j09-svr06 --online --with-local-disks 1 --migration_type insecure

The log below is the output of a successful migration.

Code:

root@ozone-set00-j09-svr03:~# qm migrate 104 ozone-set00-j09-svr06 --online --with-local-disks 1 --migration_type insecure
2024-08-09 17:09:38 use dedicated network address for sending migration traffic (10.21.250.106)
2024-08-09 17:09:39 starting migration of VM 104 to node 'ozone-set00-j09-svr06' (10.21.250.106)
2024-08-09 17:09:39 found local, replicated disk 'hdd-2m-data:vm-104-disk-0' (attached)
2024-08-09 17:09:39 found generated disk 'local-zfs:vm-104-cloudinit' (in current VM config)
2024-08-09 17:09:39 virtio0: start tracking writes using block-dirty-bitmap 'repl_virtio0'
2024-08-09 17:09:39 replicating disk images
2024-08-09 17:09:39 start replication job
QEMU Guest Agent is not running - VM 104 qmp command 'guest-ping' failed - got timeout
2024-08-09 17:09:42 guest => VM 104, running => 1232761
2024-08-09 17:09:42 volumes => hdd-2m-data:vm-104-disk-0
2024-08-09 17:09:44 create snapshot '__replicate_104-1_1723190979__' on hdd-2m-data:vm-104-disk-0
2024-08-09 17:09:44 using secure transmission, rate limit: none
2024-08-09 17:09:44 incremental sync 'hdd-2m-data:vm-104-disk-0' (__replicate_104-1_1723190715__ => __replicate_104-1_1723190979__)
2024-08-09 17:09:45 send from @__replicate_104-1_1723190715__ to hdd-2m-data/vm-104-disk-0@__replicate_104-1_1723190979__ estimated size is 187K
2024-08-09 17:09:45 total estimated size is 187K
2024-08-09 17:09:45 TIME        SENT   SNAPSHOT hdd-2m-data/vm-104-disk-0@__replicate_104-1_1723190979__
2024-08-09 17:09:46 successfully imported 'hdd-2m-data:vm-104-disk-0'
2024-08-09 17:09:46 delete previous replication snapshot '__replicate_104-1_1723190715__' on hdd-2m-data:vm-104-disk-0
2024-08-09 17:09:47 (remote_finalize_local_job) delete stale replication snapshot '__replicate_104-1_1723190715__' on hdd-2m-data:vm-104-disk-0
2024-08-09 17:09:47 end replication job
2024-08-09 17:09:47 copying local disk images
2024-08-09 17:09:49 full send of rpool/data/vm-104-cloudinit@__migration__ estimated size is 65.3K
2024-08-09 17:09:49 total estimated size is 65.3K
2024-08-09 17:09:49 TIME        SENT   SNAPSHOT rpool/data/vm-104-cloudinit@__migration__
2024-08-09 17:09:50 [ozone-set00-j09-svr06] successfully imported 'local-zfs:vm-104-cloudinit'
2024-08-09 17:09:50 volume 'local-zfs:vm-104-cloudinit' is 'local-zfs:vm-104-cloudinit' on the target
2024-08-09 17:09:50 starting VM 104 on remote node 'ozone-set00-j09-svr06'
2024-08-09 17:09:52 volume 'hdd-2m-data:vm-104-disk-0' is 'hdd-2m-data:vm-104-disk-0' on the target
2024-08-09 17:09:52 start remote tunnel
2024-08-09 17:09:53 ssh tunnel ver 1
2024-08-09 17:09:53 starting storage migration
2024-08-09 17:09:53 virtio0: start migration to nbd:10.21.250.106:60002:exportname=drive-virtio0
drive mirror re-using dirty bitmap 'repl_virtio0'
drive mirror is starting for drive-virtio0
drive-virtio0: transferred 64.0 KiB of 64.0 KiB (100.00%) in 0s
drive-virtio0: transferred 64.0 KiB of 64.0 KiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2024-08-09 17:09:54 starting online/live migration on tcp:10.21.250.106:60001
2024-08-09 17:09:54 set migration capabilities
2024-08-09 17:09:54 migration downtime limit: 100 ms
2024-08-09 17:09:54 migration cachesize: 1.0 GiB
2024-08-09 17:09:54 set migration parameters
2024-08-09 17:09:54 start migrate command to tcp:10.21.250.106:60001
2024-08-09 17:09:55 migration active, transferred 567.7 MiB of 8.0 GiB VM-state, 10.3 GiB/s
2024-08-09 17:09:56 average migration speed: 4.0 GiB/s - downtime 58 ms
2024-08-09 17:09:56 migration status: completed
all 'mirror' jobs are ready
drive-virtio0: Completing block job_id...
drive-virtio0: Completed successfully.
drive-virtio0: mirror-job finished
2024-08-09 17:09:57 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=ozone-set00-j09-svr06' -o 'UserKnownHostsFile=/etc/pve/nodes/ozone-set00-j09-svr06/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@10.21.250.106 pvesr set-state 104 \''{"local/ozone-set00-j09-svr10":{"duration":10.890926,"storeid_list":["hdd-2m-data"],"last_sync":1723190704,"fail_count":0,"last_iteration":1723190704,"last_node":"ozone-set00-j09-svr03","last_try":1723190704},"local/ozone-set00-j09-svr03":{"duration":8.491385,"last_sync":1723190979,"storeid_list":["hdd-2m-data"],"fail_count":0,"last_node":"ozone-set00-j09-svr03","last_try":1723190979,"last_iteration":1723190979}}'\'
2024-08-09 17:09:58 stopping NBD storage migration server on target.
2024-08-09 17:10:02 migration finished successfully (duration 00:00:24)

root@ozone-set00-j09-svr03:~#

Then I tried migration with HA enabled for VM 104.
I think the migration errors out with the same problem.

Code:

root@ozone-set00-j09-svr06:~/.ssh# qm migrate 104 ozone-set00-j09-svr10 --online --with-local-disks 1 --migration_type insecure
Requesting HA migration for VM 104 to node ozone-set00-j09-svr10

I don't know if I have the same control over the HA migration. I have the same issue for now.

Code:

Aug 09 14:14:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570749]: migration problems
Aug 09 14:14:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570748]: <root@pam> end task UPID:ozone-set00-j09-svr06:0017F7BD:032398F2:66B5A5C2:qmigrate:104:root@pam: migration problems
Aug 09 14:14:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570748]: service vm:104 not moved (migration error)
Aug 09 14:15:01 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570865]: <root@pam> starting task UPID:ozone-set00-j09-svr06:0017F832:0323A0A4:66B5A5D5:qmigrate:104:root@pam:
Aug 09 14:15:06 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570865]: Task 'UPID:ozone-set00-j09-svr06:0017F832:0323A0A4:66B5A5D5:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:09 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570866]: VM 104 qmp command failed - VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before >
Aug 09 14:15:09 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570866]: VM 104 qmp command failed - VM 104 qmp command 'block-job-cancel' failed - Block job 'drive-virtio0' not found
Aug 09 14:15:11 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570865]: Task 'UPID:ozone-set00-j09-svr06:0017F832:0323A0A4:66B5A5D5:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:14 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570866]: migration problems
Aug 09 14:15:14 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570865]: <root@pam> end task UPID:ozone-set00-j09-svr06:0017F832:0323A0A4:66B5A5D5:qmigrate:104:root@pam: migration problems
Aug 09 14:15:14 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570865]: service vm:104 not moved (migration error)
Aug 09 14:15:21 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570952]: <root@pam> starting task UPID:ozone-set00-j09-svr06:0017F889:0323A854:66B5A5E9:qmigrate:104:root@pam:
Aug 09 14:15:26 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570952]: Task 'UPID:ozone-set00-j09-svr06:0017F889:0323A854:66B5A5E9:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:29 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570953]: VM 104 qmp command failed - VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before >
Aug 09 14:15:29 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570953]: VM 104 qmp command failed - VM 104 qmp command 'block-job-cancel' failed - Block job 'drive-virtio0' not found
Aug 09 14:15:31 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570952]: Task 'UPID:ozone-set00-j09-svr06:0017F889:0323A854:66B5A5E9:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:34 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570953]: migration problems
Aug 09 14:15:34 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570952]: <root@pam> end task UPID:ozone-set00-j09-svr06:0017F889:0323A854:66B5A5E9:qmigrate:104:root@pam: migration problems
Aug 09 14:15:34 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1570952]: service vm:104 not moved (migration error)
Aug 09 14:15:41 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571059]: <root@pam> starting task UPID:ozone-set00-j09-svr06:0017F8F4:0323B008:66B5A5FD:qmigrate:104:root@pam:
Aug 09 14:15:46 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571059]: Task 'UPID:ozone-set00-j09-svr06:0017F8F4:0323B008:66B5A5FD:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:49 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571060]: VM 104 qmp command failed - VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before >
Aug 09 14:15:49 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571060]: VM 104 qmp command failed - VM 104 qmp command 'block-job-cancel' failed - Block job 'drive-virtio0' not found
Aug 09 14:15:51 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571059]: Task 'UPID:ozone-set00-j09-svr06:0017F8F4:0323B008:66B5A5FD:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:15:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571060]: migration problems
Aug 09 14:15:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571059]: <root@pam> end task UPID:ozone-set00-j09-svr06:0017F8F4:0323B008:66B5A5FD:qmigrate:104:root@pam: migration problems
Aug 09 14:15:54 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571059]: service vm:104 not moved (migration error)
Aug 09 14:16:01 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571144]: <root@pam> starting task UPID:ozone-set00-j09-svr06:0017F949:0323B7DE:66B5A611:qmigrate:104:root@pam:
Aug 09 14:16:06 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571144]: Task 'UPID:ozone-set00-j09-svr06:0017F949:0323B7DE:66B5A611:qmigrate:104:root@pam:' still active, waiting
Aug 09 14:16:09 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571145]: VM 104 qmp command failed - VM 104 qmp command 'drive-mirror' failed - Failed to read initial magic: Unexpected end-of-file before >
Aug 09 14:16:09 ozone-set00-j09-svr06.smsolutions.io pve-ha-lrm[1571145]: VM 104 qmp command failed - VM 104 qmp command 'block-job-cancel' failed - Block job 'drive-virtio0' not found

Here is the config for the VM 104.

Code:

root@ozone-set00-j09-svr06:~/.ssh# qm config 104
agent: 1
boot: order=virtio0
cicustom: user=local:snippets/debian12-lab.yaml
cores: 4
ide2: local-zfs:vm-104-cloudinit,media=cdrom,size=4M
ipconfig0: gw=10.21.249.1,ip=10.21.249.14/24
memory: 8192
meta: creation-qemu=8.1.5,ctime=1722704622
name: btv-dr-deploy
net0: virtio=BC:24:11:AB:6D:03,bridge=vmbr0,tag=2189
ostype: l26
serial0: socket
smbios1: uuid=d5716907-5aa4-489b-b4dc-de4b60f2d92e
sockets: 1
tags: btv;dw
virtio0: hdd-2m-data:vm-104-disk-0,format=raw,size=40G
vmgenid: 831b6c42-6bb0-48fb-82cd-5ecb0aba7a43
root@ozone-set00-j09-svr06:~/.ssh#

fiona · Aug 9, 2024

The migration options can be set cluster-wide in Datacenter > Options > Migration Settings.

To further narrow things down: does secure mode online migration with local disks work if there is no replication?

prudentcircle · Aug 9, 2024

I found in the forum how to set insecure migration for HA as well.

I checked that the problem went away as soon as I set the migration type to insecure in /etc/pve/datacenter.cfg

https://forum.proxmox.com/threads/insecure-migration-settings.30415/

is secure migration not handled well when used with ZFS local storage?
Are my settings not playing well with secure migration?

I am lost here as to why secure migration is not working.

fiona · Aug 9, 2024

prudentcircle said:
is secure migration not handled well when used with ZFS local storage?
Are my settings not playing well with secure migration?

It should work, secure mode is the default, so you can bet that lots of people are using it or there would be many more complaints. You'll have to find out why the connection with Unix sockets doesn't work on your system. Could be related to network settings, but just guessing.

fabian · Aug 9, 2024

did you disable unix socket forwarding on either end in your SSH config?

prudentcircle · Aug 9, 2024

@fabian
Good point.

I did indeed set some Forwarding options to no in my sshd_config

AllowAgentForwarding no
AllowTcpForwarding no

I will test again if setting AllowTcpForwarding to yes will solve the problem.

prudentcircle · Aug 9, 2024

I'm happy to report that setting AllowTcpForwarding to "yes" resolved the issue.

Here in Korea, some government agencies require certain SSH settings, including setting "AllowTcpForwarding" to "no", for security purposes. It is good to know that this requirement will cause functionality issues. I will make this an exception to the security compliance requirement when running Proxmox in such agencies.

Thank you for your help @fabian @fiona

fabian · Aug 9, 2024

prudentcircle said:
Here in Korea, some government agencies require certain SSH settings, including setting "AllowTcpForwarding" to "no", for security purposes.

you might want to point whoever is setting those policies at the SSH documentation

https://manpages.debian.org/bookworm/openssh-server/sshd_config.5.en.html#AllowTcpForwarding

Note that disabling TCP forwarding does not improve security unless users are also denied shell access, as they can always install their own forwarders.

prudentcircle · Aug 9, 2024

Good Point there

I could definitely use that man page to my advantage when asking for exceptions as setting it to "yes" does not make the system insecure.

esi_y · Aug 9, 2024

fabian said:
did you disable unix socket forwarding on either end in your SSH config?

Certain checks of SSH config upon node bootup (or at least before tunneling attempt) would have helped. NB Though trivial, it's not in the docs as a pre-requisite. (Sorry, it's SSH, gotta make a comment ...

)

EDIT: Fits here: https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_requirements

Search

Search

HA Live Migration of ZFS replication still fails in PVE 8.2

prudentcircle

Member

prudentcircle

Member

fiona

Proxmox Staff Member

prudentcircle

Member

fiona

Proxmox Staff Member

prudentcircle

Member

fiona

Proxmox Staff Member

fabian

Proxmox Staff Member

prudentcircle

Member

prudentcircle

Member

fabian

Proxmox Staff Member

prudentcircle

Member

esi_y

Renowned Member

We value your privacy