LXC container with Wekan/snapd freezes completely

rholighaus

Well-Known Member
Dec 15, 2016
97
8
48
60
Berlin
We are running the wekan snap package in an LXC container and it works well, but about once a day, it completely freezes, stops responding to ping requests.
Only way to stop it is lxc-stop -n 121 --kill and restart it using pct start 121.

The container's /var/log/syslog just stops at the time it stops responding, so no indication there...

In the PVE node, syslog shows 4 errors at the time of the freezing, so this could be a hint:

Code:
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/core/9066: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/wekan/807: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/core/8935: Permission denied
May 15 06:42:32 carrier-1 pvesr[58052]: failed to open /snap/wekan/813: Permission denied

Any idea what we can do to find out what's happening?

This is the container's config file (/etc/pve/lxc/121.conf):

Code:
arch: amd64
cores: 4
features: keyctl=1,nesting=1,fuse=1
hookscript: local:snippets/pve-hook
hostname: projekte
memory: 2048
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.100.1,hwaddr=B2:7D:AA:xx:xx:xx,ip=192.168.100.121/24,type=veth
onboot: 1
ostype: ubuntu
rootfs: rpool:subvol-121-disk-0,acl=1,size=32G
swap: 512
unprivileged: 1

pveversion --verbose results in:
Code:
proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

Attachments

  • Bild 15.05.20 um 10.31.jpg
    Bild 15.05.20 um 10.31.jpg
    65.2 KB · Views: 5
Additional information: Looks like the "permission denied" error messages are related to these processes (found after restarting the container):

Code:
root        91     1  0 07:23 ?        00:00:00 squashfuse /var/lib/snapd/snaps/wekan_807.snap /snap/wekan/807 -o ro,nodev,allow_other,suid
root        88     1  0 07:23 ?        00:00:08 squashfuse /var/lib/snapd/snaps/core_9066.snap /snap/core/9066 -o ro,nodev,allow_other,suid
root        90     1  0 07:23 ?        00:00:00 squashfuse /var/lib/snapd/snaps/core_8935.snap /snap/core/8935 -o ro,nodev,allow_other,suid
root        89     1  0 07:23 ?        00:00:38 squashfuse /var/lib/snapd/snaps/wekan_813.snap /snap/wekan/813 -o ro,nodev,allow_other,suid

I am have not much knowledge about snap but somebody might know how to avoid the freezes using this?
Could it be related to ZFS storage replication issuing a lxc_freeze command?
 
Unfortunately, after updating to 6.2, the container hangs again this morning.
Last action was an lxc-freeze which was issued by pve in preparation for a filesystem sync.

Killing the lxc-freeze doesn't help. Even issuing an lxc-stop <id> --kill hangs.
Any idea how to kill the container without having to reboot the node?
 
could you please post the output of `pveversion -v`?
Thanks!
 
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.34-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-1
pve-kernel-helper: 6.2-1
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.3
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-5
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
The fixes mentioned are included in pve-container >= 3.1-6 - you currently have 3.1-5 installed.
(the 3.1-6 is available on the pve-no-subscription repository (and pvetest) for now)

could you try installing 3.1-6 and see if this helps with the issue?

Thanks!
 
Hi Stoiko,

This is a production system on a PVE subscription so I don't want to disk installing 3.1.6 binaries.
I disabled replication to avoid the lxc-freeze command and wait for the 3.1.6 binary to be released for subscription.
Thank you.
 
Ok - the packages will eventually arrive at the pve-enterprise repository

However - I just talked with the author of the patch - the issue should not arise after you reboot (because your containers will be started with the new lxc binaries, avoiding the incompatibility of the freeze and thaw commands)
so a reboot of the PVE nodes should also fix the issue

I hope this helps!
 
Could it be that this reappears? I have the container freeze again, only resolution is to reboot the PVE host...

# pveversion -v proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve) pve-manager: 6.3-6 (running version: 6.3-6/2184247e) pve-kernel-5.4: 6.3-8 pve-kernel-helper: 6.3-8 pve-kernel-5.3: 6.1-6 pve-kernel-5.4.106-1-pve: 5.4.106-1 pve-kernel-5.4.78-2-pve: 5.4.78-2 pve-kernel-5.4.73-1-pve: 5.4.73-1 pve-kernel-4.15: 5.4-12 pve-kernel-5.3.18-3-pve: 5.3.18-3 pve-kernel-4.15.18-24-pve: 4.15.18-52 pve-kernel-4.15.18-12-pve: 4.15.18-36 ceph-fuse: 12.2.11+dfsg1-2.1+b1 corosync: 3.1.0-pve1 criu: 3.11-3 glusterfs-client: 5.5-3 ifupdown: 0.8.35+pve1 ksm-control-daemon: 1.3-1 libjs-extjs: 6.0.1-10 libknet1: 1.20-pve1 libproxmox-acme-perl: 1.0.8 libproxmox-backup-qemu0: 1.0.3-1 libpve-access-control: 6.1-3 libpve-apiclient-perl: 3.1-3 libpve-common-perl: 6.3-5 libpve-guest-common-perl: 3.1-5 libpve-http-server-perl: 3.1-1 libpve-storage-perl: 6.3-8 libqb0: 1.0.5-1 libspice-server1: 0.14.2-4~pve6+1 lvm2: 2.03.02-pve4 lxc-pve: 4.0.6-2 lxcfs: 4.0.6-pve1 novnc-pve: 1.1.0-1 proxmox-backup-client: 1.1.1-1 proxmox-mini-journalreader: 1.1-1 proxmox-widget-toolkit: 2.4-9 pve-cluster: 6.2-1 pve-container: 3.3-4 pve-docs: 6.3-1 pve-edk2-firmware: 2.20200531-1 pve-firewall: 4.1-3 pve-firmware: 3.2-2 pve-ha-manager: 3.1-1 pve-i18n: 2.3-1 pve-qemu-kvm: 5.2.0-5 pve-xtermjs: 4.7.0-3 qemu-server: 6.3-10 smartmontools: 7.2-pve2 spiceterm: 3.1-1 vncterm: 1.6-2 zfsutils-linux: 2.0.4-pve1
 
Could it be that this reappears? I have the container freeze again, only resolution is to reboot the PVE host...

There's always a potential for regression - however AFAICS there weren't many updates to that part of our code-base recently - so it would be odd for the issue to reappear now.

When did you upgrade the host?

if the issue is reproducible it could be helpful to start the container in debug-mode and then check the debug-logs after it freezes (and after the reboot of the node):
https://pve.proxmox.com/pve-docs/chapter-pct.html#_obtaining_debugging_logs
 
Hi Stoiko,

The container has recently been upgraded from Ubuntu 18.04 LTS to 20.04 LTS.
It seem to freeze when it receives an lxc-freeze command issued by the Proxmox backup routine.
I can sometimes provoke by trying to create a snapshot using the GUI. Underlying storage is ZFS.

As it's a production system it is kind of burdensome as I always have to reboot the node.

As it seems to be exactly the same behaviour as bevor pve-container >= 3.1-6 so there may be a regression...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!