Live zfs migration with later ext4 errors in VM

thorro

Active Member
Sep 17, 2017
12
0
41
65
Busy web host with 3 ZFS volumes. It migrates with OK status and the VM is seemingly running fine before and after migration. ZFS has set sync=standard. All volumes are replicated to all hosts, so zfs snapshot takes only a little time.
After some hours, one volume is mounted read-only because of errors. As setting up a new machine, I moved the VM around and it happened 2/2 times. Other less busy VMs migrate fine. Can anything be done here? Maybe setting sync=always before live migration would help?

In VM:
Code:
Apr 24 22:22:35 xxx kernel: EXT4-fs error (device sdb1): ext4_lookup:1442: inode #19006119: comm rsync: deleted inode referenced: 23650227
Apr 24 22:22:35 xxx kernel: Aborting journal on device sdb1-8.
Apr 24 22:22:35 xxx kernel: EXT4-fs (sdb1): Remounting filesystem read-only

ZFS shows no issues anywhere.

Migration log:
Code:
Header
Proxmox
Virtual Environment 8.3.5
Virtual Machine 113 (xxx) on node 'pve2'
No Tags
Logs
()
2025-04-24 19:46:33 use dedicated network address for sending migration traffic (10.10.10.117)
2025-04-24 19:46:33 starting migration of VM 113 to node 'pve2' (10.10.10.117)
2025-04-24 19:46:33 found local, replicated disk 'storage:vm-113-disk-0' (attached)
2025-04-24 19:46:33 found local, replicated disk 'storage:vm-113-disk-1' (attached)
2025-04-24 19:46:33 found local, replicated disk 'storage:vm-113-disk-2' (attached)
2025-04-24 19:46:34 scsi0: start tracking writes using block-dirty-bitmap 'repl_scsi0'
2025-04-24 19:46:34 scsi2: start tracking writes using block-dirty-bitmap 'repl_scsi2'
2025-04-24 19:46:34 scsi3: start tracking writes using block-dirty-bitmap 'repl_scsi3'
2025-04-24 19:46:34 replicating disk images
2025-04-24 19:46:34 start replication job
2025-04-24 19:46:34 guest => VM 113, running => 1093821
2025-04-24 19:46:34 volumes => storage:vm-113-disk-0,storage:vm-113-disk-1,storage:vm-113-disk-2
2025-04-24 19:46:36 create snapshot '__replicate_113-0_1745516794__' on storage:vm-113-disk-0
2025-04-24 19:46:36 create snapshot '__replicate_113-0_1745516794__' on storage:vm-113-disk-1
2025-04-24 19:46:36 create snapshot '__replicate_113-0_1745516794__' on storage:vm-113-disk-2
2025-04-24 19:46:36 using insecure transmission, rate limit: none
2025-04-24 19:46:36 incremental sync 'storage:vm-113-disk-0' (__replicate_113-0_1745516709__ => __replicate_113-0_1745516794__)
2025-04-24 19:46:39 send from @__replicate_113-0_1745516709__ to storage/vm-113-disk-0@__replicate_113-0_1745516794__ estimated size is 103M
2025-04-24 19:46:39 total estimated size is 103M
2025-04-24 19:46:39 TIME        SENT   SNAPSHOT storage/vm-113-disk-0@__replicate_113-0_1745516794__
2025-04-24 19:46:40 [pve2] successfully imported 'storage:vm-113-disk-0'
2025-04-24 19:46:40 incremental sync 'storage:vm-113-disk-1' (__replicate_113-0_1745516709__ => __replicate_113-0_1745516794__)
2025-04-24 19:46:43 send from @__replicate_113-0_1745516709__ to storage/vm-113-disk-1@__replicate_113-0_1745516794__ estimated size is 56.4M
2025-04-24 19:46:43 total estimated size is 56.4M
2025-04-24 19:46:43 TIME        SENT   SNAPSHOT storage/vm-113-disk-1@__replicate_113-0_1745516794__
2025-04-24 19:46:43 [pve2] successfully imported 'storage:vm-113-disk-1'
2025-04-24 19:46:43 incremental sync 'storage:vm-113-disk-2' (__replicate_113-0_1745516709__ => __replicate_113-0_1745516794__)
2025-04-24 19:46:47 send from @__replicate_113-0_1745516709__ to storage/vm-113-disk-2@__replicate_113-0_1745516794__ estimated size is 252M
2025-04-24 19:46:47 total estimated size is 252M
2025-04-24 19:46:47 TIME        SENT   SNAPSHOT storage/vm-113-disk-2@__replicate_113-0_1745516794__
2025-04-24 19:46:48 [pve2] successfully imported 'storage:vm-113-disk-2'
2025-04-24 19:46:48 delete previous replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-0
2025-04-24 19:46:48 delete previous replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-1
2025-04-24 19:46:48 delete previous replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-2
2025-04-24 19:46:49 (remote_finalize_local_job) delete stale replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-0
2025-04-24 19:46:49 (remote_finalize_local_job) delete stale replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-1
2025-04-24 19:46:49 (remote_finalize_local_job) delete stale replication snapshot '__replicate_113-0_1745516709__' on storage:vm-113-disk-2
2025-04-24 19:46:50 end replication job
2025-04-24 19:46:50 starting VM 113 on remote node 'pve2'
2025-04-24 19:46:53 volume 'storage:vm-113-disk-0' is 'storage:vm-113-disk-0' on the target
2025-04-24 19:46:54 volume 'storage:vm-113-disk-1' is 'storage:vm-113-disk-1' on the target
2025-04-24 19:46:54 volume 'storage:vm-113-disk-2' is 'storage:vm-113-disk-2' on the target
2025-04-24 19:46:54 start remote tunnel
2025-04-24 19:46:55 ssh tunnel ver 1
2025-04-24 19:46:55 starting storage migration
2025-04-24 19:46:55 scsi2: start migration to nbd:10.10.10.117:60001:exportname=drive-scsi2
drive mirror re-using dirty bitmap 'repl_scsi2'
drive mirror is starting for drive-scsi2 with bandwidth limit: 10240000 KB/s
drive-scsi2: transferred 11.0 MiB of 21.9 MiB (50.14%) in 0s
drive-scsi2: transferred 22.0 MiB of 22.0 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2025-04-24 19:46:56 scsi3: start migration to nbd:10.10.10.117:60001:exportname=drive-scsi3
drive mirror re-using dirty bitmap 'repl_scsi3'
drive mirror is starting for drive-scsi3 with bandwidth limit: 10240000 KB/s
drive-scsi3: transferred 25.8 MiB of 97.9 MiB (26.31%) in 0s
drive-scsi3: transferred 98.0 MiB of 98.0 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2025-04-24 19:46:57 scsi0: start migration to nbd:10.10.10.117:60001:exportname=drive-scsi0
drive mirror re-using dirty bitmap 'repl_scsi0'
drive mirror is starting for drive-scsi0 with bandwidth limit: 10240000 KB/s
drive-scsi0: transferred 320.0 KiB of 59.8 MiB (0.52%) in 0s
drive-scsi0: transferred 60.6 MiB of 60.6 MiB (100.00%) in 1s, ready
all 'mirror' jobs are ready
2025-04-24 19:46:58 starting online/live migration on tcp:10.10.10.117:60000
2025-04-24 19:46:58 set migration capabilities
2025-04-24 19:46:58 migration speed limit: 9.8 GiB/s
2025-04-24 19:46:58 migration downtime limit: 100 ms
2025-04-24 19:46:58 migration cachesize: 8.0 GiB
2025-04-24 19:46:58 set migration parameters
2025-04-24 19:46:58 start migrate command to tcp:10.10.10.117:60000
2025-04-24 19:46:59 migration active, transferred 244.7 MiB of 50.0 GiB VM-state, 350.1 MiB/s
2025-04-24 19:47:00 migration active, transferred 541.6 MiB of 50.0 GiB VM-state, 302.0 MiB/s
2025-04-24 19:47:01 migration active, transferred 874.8 MiB of 50.0 GiB VM-state, 270.0 MiB/s
2025-04-24 19:47:02 migration active, transferred 1.2 GiB of 50.0 GiB VM-state, 328.6 MiB/s
2025-04-24 19:47:03 migration active, transferred 1.5 GiB of 50.0 GiB VM-state, 345.3 MiB/s
2025-04-24 19:47:04 migration active, transferred 1.7 GiB of 50.0 GiB VM-state, 300.6 MiB/s
[snip ... ]
[snip ... ]
2025-04-24 19:49:30 migration active, transferred 51.0 GiB of 50.0 GiB VM-state, 441.2 MiB/s
2025-04-24 19:49:31 migration active, transferred 51.4 GiB of 50.0 GiB VM-state, 456.4 MiB/s
2025-04-24 19:49:32 migration active, transferred 51.8 GiB of 50.0 GiB VM-state, 473.1 MiB/s
2025-04-24 19:49:33 migration active, transferred 52.3 GiB of 50.0 GiB VM-state, 441.8 MiB/s
2025-04-24 19:49:34 migration active, transferred 52.5 GiB of 50.0 GiB VM-state, 309.4 MiB/s
2025-04-24 19:49:34 xbzrle: send updates to 53413 pages in 114.6 MiB encoded memory, cache-miss 98.87%, overflow 4347
2025-04-24 19:49:35 migration active, transferred 52.7 GiB of 50.0 GiB VM-state, 376.0 MiB/s
2025-04-24 19:49:35 xbzrle: send updates to 141473 pages in 279.7 MiB encoded memory, cache-miss 98.87%, overflow 11287
2025-04-24 19:49:36 migration active, transferred 53.0 GiB of 50.0 GiB VM-state, 365.8 MiB/s
2025-04-24 19:49:36 xbzrle: send updates to 202843 pages in 405.5 MiB encoded memory, cache-miss 98.87%, overflow 17005
2025-04-24 19:49:37 migration active, transferred 53.4 GiB of 50.0 GiB VM-state, 715.1 MiB/s
2025-04-24 19:49:37 xbzrle: send updates to 288209 pages in 451.1 MiB encoded memory, cache-miss 98.87%, overflow 19434
2025-04-24 19:49:38 migration active, transferred 53.7 GiB of 50.0 GiB VM-state, 397.6 MiB/s
2025-04-24 19:49:38 xbzrle: send updates to 318708 pages in 514.2 MiB encoded memory, cache-miss 98.87%, overflow 23483
2025-04-24 19:49:39 migration active, transferred 54.0 GiB of 50.0 GiB VM-state, 372.3 MiB/s
2025-04-24 19:49:39 xbzrle: send updates to 367155 pages in 570.5 MiB encoded memory, cache-miss 50.16%, overflow 26584
2025-04-24 19:49:40 migration active, transferred 54.3 GiB of 50.0 GiB VM-state, 472.9 MiB/s
2025-04-24 19:49:40 xbzrle: send updates to 464339 pages in 638.7 MiB encoded memory, cache-miss 50.16%, overflow 29037
2025-04-24 19:49:42 migration active, transferred 54.7 GiB of 50.0 GiB VM-state, 374.9 MiB/s
2025-04-24 19:49:42 xbzrle: send updates to 539857 pages in 691.8 MiB encoded memory, cache-miss 50.16%, overflow 33014
2025-04-24 19:49:43 migration active, transferred 54.9 GiB of 50.0 GiB VM-state, 979.7 MiB/s
2025-04-24 19:49:43 xbzrle: send updates to 657579 pages in 767.3 MiB encoded memory, cache-miss 47.82%, overflow 35364
2025-04-24 19:49:44 migration active, transferred 55.2 GiB of 50.0 GiB VM-state, 477.5 MiB/s
2025-04-24 19:49:44 xbzrle: send updates to 754413 pages in 864.4 MiB encoded memory, cache-miss 37.55%, overflow 39128
2025-04-24 19:49:45 auto-increased downtime to continue migration: 200 ms
2025-04-24 19:49:45 migration active, transferred 55.4 GiB of 50.0 GiB VM-state, 427.2 MiB/s
2025-04-24 19:49:45 xbzrle: send updates to 885805 pages in 954.2 MiB encoded memory, cache-miss 24.63%, overflow 43453
2025-04-24 19:49:46 migration active, transferred 55.6 GiB of 50.0 GiB VM-state, 542.4 MiB/s
2025-04-24 19:49:46 xbzrle: send updates to 1025560 pages in 1.0 GiB encoded memory, cache-miss 21.88%, overflow 48909
2025-04-24 19:49:47 migration active, transferred 55.8 GiB of 50.0 GiB VM-state, 588.1 MiB/s, VM dirties lots of memory: 600.0 MiB/s
2025-04-24 19:49:47 xbzrle: send updates to 1202711 pages in 1.1 GiB encoded memory, cache-miss 18.01%, overflow 53232
2025-04-24 19:49:47 auto-increased downtime to continue migration: 400 ms
2025-04-24 19:49:48 migration active, transferred 55.9 GiB of 50.0 GiB VM-state, 770.2 MiB/s
2025-04-24 19:49:48 xbzrle: send updates to 1402794 pages in 1.1 GiB encoded memory, cache-miss 8.87%, overflow 58743
2025-04-24 19:49:49 average migration speed: 299.5 MiB/s - downtime 325 ms
2025-04-24 19:49:49 migration status: completed
all 'mirror' jobs are ready
drive-scsi0: Completing block job_id...
drive-scsi0: Completed successfully.
drive-scsi2: Completing block job_id...
drive-scsi2: Completed successfully.
drive-scsi3: Completing block job_id...
drive-scsi3: Completed successfully.
drive-scsi0: mirror-job finished
drive-scsi2: mirror-job finished
drive-scsi3: mirror-job finished
2025-04-24 19:49:50 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@10.10.10.117 pvesr set-state 113 \''{"local/pve1":{"storeid_list":["storage"],"last_try":1745516794,"fail_count":0,"last_sync":1745516794,"duration":15.936542,"last_iteration":1745516794,"last_node":"pve1"},"local/pve3":{"last_iteration":1745510407,"last_node":"pve1","last_try":1745510431,"fail_count":0,"last_sync":1745510431,"duration":44.758007,"storeid_list":["storage"]}}'\'
2025-04-24 19:49:53 stopping NBD storage migration server on target.
2025-04-24 19:50:13 migration finished successfully (duration 00:03:41)
TASK OK


Version info:
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-9-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.8: 6.8.12-9
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.11-7-pve-signed: 6.5.11-7
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph-fuse: 16.2.15+ds-0+deb12u1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.1
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.0
libpve-guest-common-perl: 5.2.0
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.3
libpve-storage-perl: 8.3.5
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.3.7-1
proxmox-backup-file-restore: 3.3.7-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.8
pve-cluster: 8.1.0
pve-container: 5.2.5
pve-docs: 8.3.1
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.15-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.1
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-1
pve-zsync: 2.3.1
qemu-server: 8.3.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2
 
Last edited:
is the error always the same? is it always the same disk? anything visible in the host logs?
 
Yes, the most busy disk both times. Host is not available anymore. I'll try again in the next days and will check the host logs for anything suspicious.