LXC container with Wekan/snapd freezes completely

rholighaus

Renowned Member
Dec 15, 2016
97
8
73
61
Berlin
We are running the wekan snap package in an LXC container and it works well, but about once a day, it completely freezes, stops responding to ping requests.
Only way to stop it is lxc-stop -n 121 --kill and restart it using pct start 121.

The container's /var/log/syslog just stops at the time it stops responding, so no indication there...

In the PVE node, syslog shows 4 errors at the time of the freezing, so this could be a hint:

Code:
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/core/9066: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/wekan/807: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/core/8935: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/wekan/813: Permission denied

Any idea what we can do to find out what's happening?

This is the container's config file (/etc/pve/lxc/121.conf):

Code:
arch: amd64
cores: 4
features: keyctl=1,nesting=1,fuse=1
hookscript: local:snippets/pve-hook
hostname: projekte
memory: 2048
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.100.1,hwaddr=B2:7D:AA:xx:xx:xx,ip=192.168.100.121/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: rpool:subvol-121-disk-0,acl=1,size=32G
swap: 512
unprivileged: 1

pveversion --verbose results in:
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

Attachments

  • Bild 15.05.20 um 10.31.jpg
    Bild 15.05.20 um 10.31.jpg
    65.2 KB · Views: 5
Additional information: Looks like the "permission denied" error messages are related to these processes (found after restarting the container):

Code:
root        91     1  0 07:23 ?        00:00:00 squashfuse /var/lib/snapd/snaps/wekan_807.snap /snap/wekan/807 -o ro,nodev,allow_other,suid
root        88     1  0 07:23 ?        00:00:08 squashfuse /var/lib/snapd/snaps/core_9066.snap /snap/core/9066 -o ro,nodev,allow_other,suid
root        90     1  0 07:23 ?        00:00:00 squashfuse /var/lib/snapd/snaps/core_8935.snap /snap/core/8935 -o ro,nodev,allow_other,suid
root        89     1  0 07:23 ?        00:00:38 squashfuse /var/lib/snapd/snaps/wekan_813.snap /snap/wekan/813 -o ro,nodev,allow_other,suid

I am have not much knowledge about snap but somebody might know how to avoid the freezes using this?
Could it be related to ZFS storage replication issuing a lxc_freeze command?
 
Unfortunately, after updating to 6.2, the container hangs again this morning.
Last action was an lxc-freeze which was issued by pve in preparation for a filesystem sync.

Killing the lxc-freeze doesn't help. Even issuing an lxc-stop <id> --kill hangs.
Any idea how to kill the container without having to reboot the node?
 
could you please post the output of `pveversion -v`?
Thanks!
 
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
The fixes mentioned are included in pve-container >= 3.1-6 - you currently have 3.1-5 installed.
(the 3.1-6 is available on the pve-no-subscription repository (and pvetest) for now)

could you try installing 3.1-6 and see if this helps with the issue?

Thanks!
 
Hi Stoiko,

This is a production system on a PVE subscription so I don't want to disk installing 3.1.6 binaries.
I disabled replication to avoid the lxc-freeze command and wait for the 3.1.6 binary to be released for subscription.
Thank you.
 
Ok - the packages will eventually arrive at the pve-enterprise repository

However - I just talked with the author of the patch - the issue should not arise after you reboot (because your containers will be started with the new lxc binaries, avoiding the incompatibility of the freeze and thaw commands)
so a reboot of the PVE nodes should also fix the issue

I hope this helps!
 
Could it be that this reappears? I have the container freeze again, only resolution is to reboot the PVE host...

# pveversion -v proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve) pve-manager: 6.3-6 (running version: 6.3-6/2184247e) pve-kernel-5.4: 6.3-8 pve-kernel-helper: 6.3-8 pve-kernel-5.3: 6.1-6 pve-kernel-5.4.106-1-pve: 5.4.106-1 pve-kernel-5.4.78-2-pve: 5.4.78-2 pve-kernel-5.4.73-1-pve: 5.4.73-1 pve-kernel-4.15: 5.4-12 pve-kernel-5.3.18-3-pve: 5.3.18-3 pve-kernel-4.15.18-24-pve: 4.15.18-52 pve-kernel-4.15.18-12-pve: 4.15.18-36 ceph-fuse: 12.2.11+dfsg1-2.1+b1 corosync: 3.1.0-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.20-pve1 libproxmox-acme-perl: 1.0.8 libproxmox-backup-qemu0: 1.0.3-1 libpve-access-control: 6.1-3 libpve-apiclient-perl: 3.1-3 libpve-common-perl: 6.3-5 libpve-guest-common-perl: 3.1-5 libpve-http-server-perl: 3.1-1 libpve-storage-perl: 6.3-8 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.6-2 lxcfs: 4.0.6-pve1 novnc-pve: 1.1.0-1 proxmox-backup-client: 1.1.1-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.4-9 pve-cluster: 6.2-1 pve-container: 3.3-4 pve-docs: 6.3-1 pve-edk2-firmware: 2.20200531-1 pve-firewall: 4.1-3 pve-firmware: 3.2-2 pve-ha-manager: 3.1-1 pve-i18n: 2.3-1 pve-qemu-kvm: 5.2.0-5 pve-xtermjs: 4.7.0-3 qemu-server: 6.3-10 smartmontools: 7.2-pve2 spiceterm: 3.1-1 vncterm: 1.6-2 zfsutils-linux: 2.0.4-pve1
 
Could it be that this reappears? I have the container freeze again, only resolution is to reboot the PVE host...

There's always a potential for regression - however AFAICS there weren't many updates to that part of our code-base recently - so it would be odd for the issue to reappear now.

When did you upgrade the host?

if the issue is reproducible it could be helpful to start the container in debug-mode and then check the debug-logs after it freezes (and after the reboot of the node):
https://pve.proxmox.com/pve-docs/chapter-pct.html#_obtaining_debugging_logs
 
Hi Stoiko,

The container has recently been upgraded from Ubuntu 18.04 LTS to 20.04 LTS.
It seem to freeze when it receives an lxc-freeze command issued by the Proxmox backup routine.
I can sometimes provoke by trying to create a snapshot using the GUI. Underlying storage is ZFS.

As it's a production system it is kind of burdensome as I always have to reboot the node.

As it seems to be exactly the same behaviour as bevor pve-container >= 3.1-6 so there may be a regression...