Storage replication hangs regulary

Nico Kroll

New Member
Dec 12, 2022
7
1
3
Hey together :) !

We have a setup of 23 hosts and a bunch of LXCs running on them. We are using the replication feature to replicate all ZFS resources. Everything beautiful!
But regularly we face the problem, that something freezes. When this happens, we find the following state:
- LXC doing the hanging replication freezes/(suspend?)
- Host is shown in the GUI with question marks, including all other LXCs/resources
- Other LXCs on the Host running normal
- Load as normal. Nothing in 100 % CPU or D-state.

I found this thread: https://forum.proxmox.com/threads/storage-replication-regulary-hangs-after-upgrade.72690/ , but this did not help for some reasons.

root@pvehe18:~# journalctl -u pvesr
-- No entries --
root@pvehe18:~#

root@pvehe18:~# pvesr status
JobID Enabled Target LastSync NextSync Duration FailCount State
1003052-0 Yes local/pvehe19 2023-08-21_22:15:01 pending 78.254903 0 SYNCING

root@pvehe18:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

root@pvehe19:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

root@pvehe18:~# zfs list -t snapshot | grep 1003052
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -

root@pvehe18:~# zfs list -t snapshot | grep 1003052
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -

root@pvehe18:~# service pvesr status
Unit pvesr.service could not be found.

2023-08-21 22:30:01 1003052-0: start replication job
2023-08-21 22:30:01 1003052-0: guest => CT 1003052, running => 1
2023-08-21 22:30:01 1003052-0: volumes => zp_pve:subvol-1003052-disk-0,zp_pve:subvol-1003052-disk-1,zp_pve:subvol-1003052-disk-2,zp_pve:subvol-1003052-disk-3,zp_pve:subvol-1003052-disk-4,zp_pve:subvol-1003052-disk-5

"pct status 1003052": This command hangs and never returns.
"strace <PID of pct status>": strace: Can't stat '<PID>': file or directory not found

Other considerations: We found already in the forum that it's not a good idea to have replication and backup running at the same time. Therefore we scheduled the replication job not running in the time of the backup job.

Do you have any idea how to figure out what exactly is hanging or what the system is waiting for?
 
I figured out more details. During such a hanging party (different VM, but I unified the vm-id for this thread) there is the following output:

root@pvehe18:~# ps -ax | grep 1003052
28630 ? Ss 0:38 /usr/bin/lxc-start -F -n 1003052
1401558 ? Ss 0:00 /usr/bin/dtach -A /var/run/dtach/vzctlconsole1003052 -r winch -z lxc-attach --clear-env -n 1003052
1401559 pts/2 Ss+ 0:00 lxc-attach --clear-env -n 1003052
2211191 ? S 0:00 lxc-info -n 1003052 -p
3913359 pts/0 S+ 0:00 grep 1003052

As it looks like, there were two lxc-attach-processes running. Mh... but they also could just hang, waiting for the VM.

(looking further...)

In the log files (source and target system) I could not find anything helpful. There is just the log output till it stops. No message about the stop. The last I could see was a root-login on the target system with a logout in the same seconds (successfull remote execution job maybe).
 
I saw this when I had one disk in a bad state and then clicked to look at other disks it hung the ui and was bad time all round. In my case it was because my ceph volume was effectively ro because ceph replication couldn’t communicate.

Point is - check all your items in disks and storage are healthy or if clicking one one hangs it.
 
Thanks a lot, scyto, but the disks were all okay and under 5 % wearout. Nothing hangs. No matter, what I try to click.

I'm still stuck in the problem. The replication log always hangs at the same point, like above. "freeze guest filesystem" would be the next log entry, but never appears.

[Edit]
In other threads I found, it could be fuse=1 in the options. I'll try that.
 
  • Like
Reactions: scyto
Update.

Without the fuse-option it's working like expected.

@proxmox-Team: It would be pretty helpfull if there is a warning message in the replication (and backup) log, that warns like "FUSE is active, IO-Deadlocks could appear, See the documentation.". That would save a lot of time. And looking how many forum posts about this topic has been - I was not the first one.
 
Hi,
@proxmox-Team: It would be pretty helpfull if there is a warning message in the replication (and backup) log, that warns like "FUSE is active, IO-Deadlocks could appear, See the documentation.". That would save a lot of time. And looking how many forum posts about this topic has been - I was not the first one.
please note that this is not an official Proxmox-associated account. Feel free to open an enhancement request on the bugzilla for adding the warning: https://bugzilla.proxmox.com/
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!