Hello, all.
After upgrade one node of our cluster (3 nodes, 2 of them is pve 6.2, replication hangs from 6.2 to 6.2: pve2->pve1) - we seen that replication regulary hangs on different containers and only reboot whole node is helps.
Before upgrade - all working fine in this configuration.
pvesr status showing that container in SYNCING state and other in pending state (for hours and days):
in replication logs ans sysctemctl -u pvesr - nothing:
root@pve2:~# ps axf | grep pvesr
95858 ? Rs 67:21 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
And strace (attached to pvesr pid) showing that problem with container freezing (i think):
lxc-freeze -n 1031 - is hangs forever (as other commands: pct stop, pct stop with force keyword, pct enter, etc).
lxc-unfreeze showing that container in FREEZING state:
We can kill -KILL pid of lxc-start, but after that - we cannot start container again (because all structures in /sys/fs/cgroup for this container not correctly removed in kernel) and only one way - is reboot (or disable replication job zf-> restart pvesr -> delete all snapshots and volumes on replication target -> create replication job -> full replication of container to backup node -> start container on other node - hard and incorrect way).
Any help would be appreciated
After upgrade one node of our cluster (3 nodes, 2 of them is pve 6.2, replication hangs from 6.2 to 6.2: pve2->pve1) - we seen that replication regulary hangs on different containers and only reboot whole node is helps.
Before upgrade - all working fine in this configuration.
pvesr status showing that container in SYNCING state and other in pending state (for hours and days):
Code:
1029-0 Yes local/pve1 2020-07-10_21:01:54 pending 5.302523 0 OK
1031-0 Yes local/pve1 2020-07-10_20:51:55 pending 5.658419 0 SYNCING
1033-0 Yes local/pve1 2020-07-10_20:52:01 pending 5.484414 0 OK
in replication logs ans sysctemctl -u pvesr - nothing:
Code:
2020-07-10 21:02:00 1031-0: start replication job
2020-07-10 21:02:00 1031-0: guest => CT 1031, running => 1
2020-07-10 21:02:00 1031-0: volumes => local-zfs:subvol-1031-disk-1
root@pve2:~# journalctl -u pvesr --since "2 hours ago"
-- Logs begin at Fri 2020-06-26 10:44:19 +10, end at Fri 2020-07-10 22:51:10 +10. --
июл 10 20:57:25 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:57:25 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:57:25 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:57:27 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:57:27 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:58:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:58:07 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:58:07 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:59:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:59:01 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:59:01 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 21:00:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 21:01:16 pve2 pvesr[95858]: trying to acquire cfs lock 'file-replication_cfg' ...
-- Logs begin at Fri 2020-06-26 10:44:19 +10, end at Fri 2020-07-10 22:51:10 +10. --
июл 10 20:57:25 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:57:25 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:57:25 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:57:27 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:57:27 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:58:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:58:07 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:58:07 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 20:59:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 20:59:01 pve2 systemd[1]: pvesr.service: Succeeded.
июл 10 20:59:01 pve2 systemd[1]: Started Proxmox VE replication runner.
июл 10 21:00:00 pve2 systemd[1]: Starting Proxmox VE replication runner...
июл 10 21:01:16 pve2 pvesr[95858]: trying to acquire cfs lock 'file-replication_cfg' ...
root@pve2:~# ps axf | grep pvesr
95858 ? Rs 67:21 /usr/bin/perl -T /usr/bin/pvesr run --mail 1
And strace (attached to pvesr pid) showing that problem with container freezing (i think):
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/freezer///lxc/1031/freezer.state", O_RDONLY|O_CLOEXEC) = 8
ioctl(8, TCGETS, 0x7ffc48c39de0) = -1 ENOTTY (Not applicable to this ioctl device.|Неприменимый к данному устройству ioctl)
lseek(8, 0, SEEK_CUR) = 0
fstat(8, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(8, "FREEZING\n", 8192) = 9
read(8, "", 8192) = 0
close(8) = 0
lxc-freeze -n 1031 - is hangs forever (as other commands: pct stop, pct stop with force keyword, pct enter, etc).
lxc-unfreeze showing that container in FREEZING state:
Code:
lxc-unfreeze 1031 20200710131309.947 DEBUG commands - commands.c:lxc_cmd_rsp_recv:162 - Response data length for command "get_init_pid" is 0
lxc-unfreeze 1031 20200710131309.947 DEBUG commands - commands.c:lxc_cmd_rsp_recv:162 - Response data length for command "get_state" is 0
lxc-unfreeze 1031 20200710131309.947 DEBUG commands - commands.c:lxc_cmd_get_state:713 - Container "1031" is in "FREEZING" state
We can kill -KILL pid of lxc-start, but after that - we cannot start container again (because all structures in /sys/fs/cgroup for this container not correctly removed in kernel) and only one way - is reboot (or disable replication job zf-> restart pvesr -> delete all snapshots and volumes on replication target -> create replication job -> full replication of container to backup node -> start container on other node - hard and incorrect way).
proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-3
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-4.15: 5.4-19
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-30-pve: 4.15.18-58
pve-kernel-4.15.18-27-pve: 4.15.18-55
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-15-pve: 4.15.18-40
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
pve-zsync: 2.0-3
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
source:
root@pve2:~# zfs list -t snapshot | grep 1031
rpool/data/subvol-1031-disk-1@__replicate_1031-0_1594378315__ 1,38M - 2,13G -
target:
root@pve1:~# zfs list -t snapshot | grep 1031
rpool/data/subvol-1031-disk-1@__replicate_1031-0_1594378315__ 192K - 2,13G -
root@pve2:~# zfs list -t snapshot | grep 1031
rpool/data/subvol-1031-disk-1@__replicate_1031-0_1594378315__ 1,38M - 2,13G -
target:
root@pve1:~# zfs list -t snapshot | grep 1031
rpool/data/subvol-1031-disk-1@__replicate_1031-0_1594378315__ 192K - 2,13G -
Any help would be appreciated