Hey together !
We have a setup of 23 hosts and a bunch of LXCs running on them. We are using the replication feature to replicate all ZFS resources. Everything beautiful!
But regularly we face the problem, that something freezes. When this happens, we find the following state:
- LXC doing the hanging replication freezes/(suspend?)
- Host is shown in the GUI with question marks, including all other LXCs/resources
- Other LXCs on the Host running normal
- Load as normal. Nothing in 100 % CPU or D-state.
I found this thread: https://forum.proxmox.com/threads/storage-replication-regulary-hangs-after-upgrade.72690/ , but this did not help for some reasons.
"pct status 1003052": This command hangs and never returns.
"strace <PID of pct status>": strace: Can't stat '<PID>': file or directory not found
Other considerations: We found already in the forum that it's not a good idea to have replication and backup running at the same time. Therefore we scheduled the replication job not running in the time of the backup job.
Do you have any idea how to figure out what exactly is hanging or what the system is waiting for?
We have a setup of 23 hosts and a bunch of LXCs running on them. We are using the replication feature to replicate all ZFS resources. Everything beautiful!
But regularly we face the problem, that something freezes. When this happens, we find the following state:
- LXC doing the hanging replication freezes/(suspend?)
- Host is shown in the GUI with question marks, including all other LXCs/resources
- Other LXCs on the Host running normal
- Load as normal. Nothing in 100 % CPU or D-state.
I found this thread: https://forum.proxmox.com/threads/storage-replication-regulary-hangs-after-upgrade.72690/ , but this did not help for some reasons.
root@pvehe18:~# journalctl -u pvesr
-- No entries --
root@pvehe18:~#
-- No entries --
root@pvehe18:~#
root@pvehe18:~# pvesr status
JobID Enabled Target LastSync NextSync Duration FailCount State
1003052-0 Yes local/pvehe19 2023-08-21_22:15:01 pending 78.254903 0 SYNCING
JobID Enabled Target LastSync NextSync Duration FailCount State
1003052-0 Yes local/pvehe19 2023-08-21_22:15:01 pending 78.254903 0 SYNCING
root@pvehe18:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
root@pvehe19:~# pveversion -v
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
proxmox-ve: 8.0.2 (running kernel: 6.2.16-6-pve)
pve-manager: 8.0.4 (running version: 8.0.4/d258a813cfa6b390)
pve-kernel-6.2: 8.0.5
proxmox-kernel-helper: 8.0.3
proxmox-kernel-6.2.16-6-pve: 6.2.16-7
proxmox-kernel-6.2: 6.2.16-7
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.4
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.7
libpve-guest-common-perl: 5.0.4
libpve-http-server-perl: 5.0.4
libpve-rs-perl: 0.8.5
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
proxmox-backup-client: 3.0.2-1
proxmox-backup-file-restore: 3.0.2-1
proxmox-kernel-helper: 8.0.3
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.3
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.3
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-4
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
root@pvehe18:~# zfs list -t snapshot | grep 1003052
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -
root@pvehe18:~# zfs list -t snapshot | grep 1003052
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-0@__replicate_1003052-0_1692648901__ 6.31M - 8.65G -
zp_pve/subvol-1003052-disk-1@__replicate_1003052-0_1692648901__ 327M - 54.5G -
zp_pve/subvol-1003052-disk-2@__replicate_1003052-0_1692648901__ 1.39M - 317G -
zp_pve/subvol-1003052-disk-3@__replicate_1003052-0_1692648901__ 284K - 1.61G -
zp_pve/subvol-1003052-disk-4@__replicate_1003052-0_1692648901__ 0B - 96K -
zp_pve/subvol-1003052-disk-5@__replicate_1003052-0_1692648901__ 0B - 96K -
root@pvehe18:~# service pvesr status
Unit pvesr.service could not be found.
Unit pvesr.service could not be found.
2023-08-21 22:30:01 1003052-0: start replication job
2023-08-21 22:30:01 1003052-0: guest => CT 1003052, running => 1
2023-08-21 22:30:01 1003052-0: volumes => zp_pve:subvol-1003052-disk-0,zp_pve:subvol-1003052-disk-1,zp_pve:subvol-1003052-disk-2,zp_pve:subvol-1003052-disk-3,zp_pve:subvol-1003052-disk-4,zp_pve:subvol-1003052-disk-5
2023-08-21 22:30:01 1003052-0: guest => CT 1003052, running => 1
2023-08-21 22:30:01 1003052-0: volumes => zp_pve:subvol-1003052-disk-0,zp_pve:subvol-1003052-disk-1,zp_pve:subvol-1003052-disk-2,zp_pve:subvol-1003052-disk-3,zp_pve:subvol-1003052-disk-4,zp_pve:subvol-1003052-disk-5
"pct status 1003052": This command hangs and never returns.
"strace <PID of pct status>": strace: Can't stat '<PID>': file or directory not found
Other considerations: We found already in the forum that it's not a good idea to have replication and backup running at the same time. Therefore we scheduled the replication job not running in the time of the backup job.
Do you have any idea how to figure out what exactly is hanging or what the system is waiting for?