Stuck ZFS replication on VMs with QEMU Guest Agent

carles89

Renowned Member
May 27, 2015
100
14
83
Hi,

In a cluster with several nodes replicating between them, we noticed that sometimes ZFS replication task gets stuck on Windows VMs when freezing guest filesystem. Agents are updated to the last version.

Here's the replication log when the issue happens:

Code:
2024-02-08 13:50:01 321-2: start replication job
2024-02-08 13:50:01 321-2: guest => VM 321, running => 1737073
2024-02-08 13:50:01 321-2: volumes => local-zfs:vm-321-disk-0,local-zfs:vm-321-disk-1
2024-02-08 13:50:02 321-2: freeze guest filesystem
2024-02-08 14:50:02 321-2: create snapshot '__replicate_321-2_1707396601__' on local-zfs:vm-321-disk-0
2024-02-08 14:50:02 321-2: create snapshot '__replicate_321-2_1707396601__' on local-zfs:vm-321-disk-1

Today it happened again, but before logging the snapshot part.

Code:
2024-02-15 08:31:43 501-2: start replication job
2024-02-15 08:31:43 501-2: guest => VM 501, running => 164978
2024-02-15 08:31:43 501-2: volumes => local-zfs:vm-501-disk-0,local-zfs:vm-501-disk-1

It does not happen if I disable qemu-guest agent.

The main problem is that this kind of failure blocks all other pending replications, which get in queue until the failing one reaches its timeout. Maybe an option to set a lower freeze guest filesystem timeout could be a solution?

Underlying storage on all nodes is 4 x NVMe configured as ZFS striped mirrors.

Here are the package versions:
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
openvswitch-switch: 2.15.0+ds1-2+deb11u4
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Thank you
 
Last edited:
Update:

The replication of VM 501 just finished. Here's the complete replication log. The snapshot creation starts exactly one hour after "freeze guest filesystem".

Between 8:31 and 9:31 we stopped qemu-guest-agent and VSSVC.exe inside the VM. It randomly makes the replication continue, but it wasn't the case today.

Code:
2024-02-15 08:31:43 501-2: start replication job
2024-02-15 08:31:43 501-2: guest => VM 501, running => 164978
2024-02-15 08:31:43 501-2: volumes => local-zfs:vm-501-disk-0,local-zfs:vm-501-disk-1
2024-02-15 08:31:44 501-2: freeze guest filesystem
2024-02-15 09:31:45 501-2: create snapshot '__replicate_501-2_1707982303__' on local-zfs:vm-501-disk-0
2024-02-15 09:31:45 501-2: create snapshot '__replicate_501-2_1707982303__' on local-zfs:vm-501-disk-1
2024-02-15 09:31:45 501-2: thaw guest filesystem
2024-02-15 09:34:45 501-2: using secure transmission, rate limit: 175 MByte/s
2024-02-15 09:34:45 501-2: incremental sync 'local-zfs:vm-501-disk-0' (__replicate_501-2_1707981636__ => __replicate_501-2_1707982303__)
2024-02-15 09:34:45 501-2: using a bandwidth limit of 175000000 bps for transferring 'local-zfs:vm-501-disk-0'
2024-02-15 09:34:46 501-2: send from @__replicate_501-2_1707981636__ to rpool/data/vm-501-disk-0@__replicate_501-1_1707982242__ estimated size is 628M
2024-02-15 09:34:46 501-2: send from @__replicate_501-1_1707982242__ to rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__ estimated size is 6.27G
2024-02-15 09:34:46 501-2: total estimated size is 6.88G
2024-02-15 09:34:47 501-2: TIME        SENT   SNAPSHOT rpool/data/vm-501-disk-0@__replicate_501-1_1707982242__
2024-02-15 09:34:47 501-2: 09:34:47    145M   rpool/data/vm-501-disk-0@__replicate_501-1_1707982242__
2024-02-15 09:34:48 501-2: 09:34:48    307M   rpool/data/vm-501-disk-0@__replicate_501-1_1707982242__
2024-02-15 09:34:49 501-2: 09:34:49    461M   rpool/data/vm-501-disk-0@__replicate_501-1_1707982242__
2024-02-15 09:34:50 501-2: TIME        SENT   SNAPSHOT rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:50 501-2: 09:34:50    167M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:51 501-2: 09:34:51    348M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:52 501-2: 09:34:52    530M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:53 501-2: 09:34:53    685M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:54 501-2: 09:34:54    834M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:55 501-2: 09:34:55    995M   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:56 501-2: 09:34:56   1.12G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:57 501-2: 09:34:57   1.25G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:58 501-2: 09:34:58   1.39G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:34:59 501-2: 09:34:59   1.54G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:00 501-2: 09:35:00   1.69G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:01 501-2: 09:35:01   1.85G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:02 501-2: 09:35:02   1.98G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:03 501-2: 09:35:03   2.10G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:04 501-2: 09:35:04   2.23G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:05 501-2: 09:35:05   2.37G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:06 501-2: 09:35:06   2.51G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:07 501-2: 09:35:07   2.65G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:08 501-2: 09:35:08   2.81G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:09 501-2: 09:35:09   2.97G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:10 501-2: 09:35:10   3.13G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:11 501-2: 09:35:11   3.30G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:12 501-2: 09:35:12   3.47G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:13 501-2: 09:35:13   3.65G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:14 501-2: 09:35:14   3.82G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:15 501-2: 09:35:15   4.00G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:16 501-2: 09:35:16   4.18G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:17 501-2: 09:35:17   4.35G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:18 501-2: 09:35:18   4.53G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:19 501-2: 09:35:19   4.71G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:20 501-2: 09:35:20   4.89G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:21 501-2: 09:35:21   5.07G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:22 501-2: 09:35:22   5.25G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:23 501-2: 09:35:23   5.42G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:24 501-2: 09:35:24   5.60G   rpool/data/vm-501-disk-0@__replicate_501-2_1707982303__
2024-02-15 09:35:24 501-2: successfully imported 'local-zfs:vm-501-disk-0'
2024-02-15 09:35:24 501-2: incremental sync 'local-zfs:vm-501-disk-1' (__replicate_501-2_1707981636__ => __replicate_501-2_1707982303__)
2024-02-15 09:35:25 501-2: using a bandwidth limit of 175000000 bps for transferring 'local-zfs:vm-501-disk-1'
2024-02-15 09:35:25 501-2: send from @__replicate_501-2_1707981636__ to rpool/data/vm-501-disk-1@__replicate_501-1_1707982242__ estimated size is 1.21M
2024-02-15 09:35:25 501-2: send from @__replicate_501-1_1707982242__ to rpool/data/vm-501-disk-1@__replicate_501-2_1707982303__ estimated size is 5.70M
2024-02-15 09:35:25 501-2: total estimated size is 6.91M
2024-02-15 09:35:25 501-2: successfully imported 'local-zfs:vm-501-disk-1'
2024-02-15 09:35:25 501-2: delete previous replication snapshot '__replicate_501-2_1707981636__' on local-zfs:vm-501-disk-0
2024-02-15 09:35:25 501-2: delete previous replication snapshot '__replicate_501-2_1707981636__' on local-zfs:vm-501-disk-1
2024-02-15 09:35:26 501-2: (remote_finalize_local_job) delete stale replication snapshot '__replicate_501-2_1707981636__' on local-zfs:vm-501-disk-0
2024-02-15 09:35:26 501-2: (remote_finalize_local_job) delete stale replication snapshot '__replicate_501-2_1707981636__' on local-zfs:vm-501-disk-1
2024-02-15 09:35:26 501-2: end replication job


Thank you
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!