LXC won't start after upgrade to 6.1-8

hinchles · Apr 26, 2020

So this upgrade to 6.1-8 has gone completely wrong and is causing me all kind of problems not least of which it looks like its killed all the containers and their snapshots / backups so I can't even restore the containers on a different machine.
It started with the zfs storage not mounting which I think i've fixed (it now mounts at boot in the correct path).

New problem is any time I attempt to run a container it segfaults. this is as much information as I can find anything else let me know.

pveversion -v

Code:

proxmox-ve: 6.1-2 (running kernel: 5.3.18-3-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-9
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph: 14.2.9-pve1
ceph-fuse: 14.2.9-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

debug output of start

Code:

lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:789 - Processing "keyctl errno 38"
lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:975 - Added native rule for arch 0 for keyctl action 327718(errno)
lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:984 - Added compat rule for arch 1073741827 for keyctl action 327718(errno)
lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:994 - Added compat rule for arch 1073741886 for keyctl action 327718(errno)
lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:1004 - Added native rule for arch -1073741762 for keyctl action 327718(errno)
lxc-start 100 20200426064505.531 INFO     seccomp - seccomp.c:parse_config_v2:1008 - Merging compat seccomp contexts into main context
lxc-start 100 20200426064505.531 INFO     conf - conf.c:run_script_argv:372 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "100", config section "lxc"
lxc-start 100 20200426064506.540 DEBUG    conf - conf.c:run_buffer:340 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 100 lxc pre-start produced output: unable to detect OS distribution

lxc-start 100 20200426064506.555 ERROR    conf - conf.c:run_buffer:352 - Script exited with status 2
lxc-start 100 20200426064506.555 ERROR    start - start.c:lxc_init:897 - Failed to run lxc.hook.pre-start for container "100"
lxc-start 100 20200426064506.555 ERROR    start - start.c:__lxc_start:2032 - Failed to initialize container "100"

If i try creating a container from scratch I get the following error.

Code:

cannot mount '/VMStorage/subvol-106-disk-0': directory is not empty
TASK ERROR: unable to create CT 106 - zfs error: filesystem successfully created, but not mounted

However file system is showing as mounted and I can see the volumes in there

Code:

root@kessel:/VMStorage# ls -lga
total 13
drwxr-xr-x 10 root  10 Apr 24 21:16 .
drwxr-xr-x 22 root 241 Apr 25 11:48 ..
drwxr-----  3 root   3 Apr 24 21:11 subvol-100-disk-0
drwxr-----  3 root   3 Apr 24 21:13 subvol-101-disk-0
drwxr-----  3 root   3 Apr 24 21:13 subvol-102-disk-0
drwxr-----  3 root   3 Apr 24 21:13 subvol-103-disk-0
drwxr-----  4 root   4 Apr 24 21:13 subvol-104-disk-0
drwxr-----  2 root   2 Nov 19 10:41 subvol-104-disk-1
drwxr-----  3 root   3 Apr 24 21:11 subvol-105-disk-0
drwxr-----  3 root   3 Apr 24 21:25 subvol-106-disk-0

But if i try "pct mount 100" for example it only contains a dev folder nothing else.

Code:

root@kessel:/VMStorage# pct mount 100
mounted CT 100 in '/var/lib/lxc/100/rootfs'
root@kessel:/VMStorage# cd /var/lib/lxc/100/rootfs/
root@kessel:/var/lib/lxc/100/rootfs# ls
dev
root@kessel:/var/lib/lxc/100/rootfs#

VM's seem to work its only containers that are effected, I've been able to create and run VM's even after reboot.

hinchles · Apr 26, 2020

Update after going through https://forum.proxmox.com/threads/update-broke-lxc.59776/page-2 as others were having a similar issues. Resetting the cache file and rebooting doesn't help. I was able to create a new container using a container id i'd never used before (in this case i used container 120) and it installed and started (from the zfs pool) however after rebooting proxmox it joined the other containers in no longer wanting to run.

Code:

root@kessel:~# journalctl -b | grep -Ei 'zfs|zpool'
Apr 26 08:40:29 kessel systemd-modules-load[556]: Inserted module 'zfs'
Apr 26 08:40:29 kessel kernel: ZFS: Loaded module v0.8.3-pve1, ZFS pool version 5000, ZFS filesystem version 5
Apr 26 08:40:31 kessel systemd[1]: Starting Import ZFS pools by cache file...
Apr 26 08:40:36 kessel systemd[1]: Started Import ZFS pools by cache file.
Apr 26 08:40:36 kessel systemd[1]: Reached target ZFS pool import target.
Apr 26 08:40:36 kessel systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Apr 26 08:40:36 kessel systemd[1]: Starting Mount ZFS filesystems...
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-102-disk-0':
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-102-disk-0': mount failed
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-105-disk-0': directory is not empty
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-100-disk-0': directory is not empty
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-103-disk-0': directory is not empty
Apr 26 08:40:37 kessel zfs[1604]: cannot mount '/VMStorage/subvol-101-disk-0': directory is not empty
Apr 26 08:40:37 kessel systemd[1]: Started Wait for ZFS Volume (zvol) links in /dev.
Apr 26 08:40:37 kessel systemd[1]: Reached target ZFS volumes are ready.
Apr 26 08:40:37 kessel systemd[1]: zfs-mount.service: Main process exited, code=killed, status=11/SEGV
Apr 26 08:40:37 kessel kernel: zfs[1662]: segfault at 0 ip 00007fea03052694 sp 00007fe9f9ff3420 error 4 in libc-2.28.so[7fea02ff8000+148000]
Apr 26 08:40:37 kessel kernel: zfs[1641]: segfault at 0 ip 00007fea03068554 sp 00007fea0277d478 error 4 in libc-2.28.so[7fea02ff8000+148000]
Apr 26 08:40:37 kessel systemd[1]: zfs-mount.service: Failed with result 'signal'.
Apr 26 08:40:37 kessel systemd[1]: Failed to start Mount ZFS filesystems.
Apr 26 08:40:38 kessel systemd[1]: Started ZFS Event Daemon (zed).
Apr 26 08:40:38 kessel systemd[1]: Starting ZFS file system shares...
Apr 26 08:40:38 kessel zed[1707]: ZFS Event Daemon 0.8.3-pve1 (PID 1707)
Apr 26 08:40:38 kessel systemd[1]: Started ZFS file system shares.
Apr 26 08:40:38 kessel systemd[1]: Reached target ZFS startup target.

Note there's no mention of it not being able to mount subvol-104 but ct 104 won't start either

budy · Apr 26, 2020

I had a similar issue and it turned out that, although I thought that the zpool was correctly mounted, it wasn't. Also the zpool.cache file had a size if 0 which also wasn't right. So I manually stopped all pve services and when trying to export the zpool, something immediately imported it back again. When I took a closer look, it had been the pveadaemon, which did that.

After killing the pvedaemon, I was able to export the zpool and then inspect the mount point, which actually hadn't been empty. So I removed the mountpoint/folder completely and imported the zpool again. That brought everything back to normal and the subsequent pve host reboot also went without a hitch and I was able to run my lxc container again.

hinchles · Apr 26, 2020

My cache file has size so its not that. I've unmounted zfs, deleted the mount point then rebooted it re-creates the mount point but end results are the same as before the pool is apparently mounted i can see the subvol's (although they all appear to be empty apart from the dev folder?) but none of the ct's will start.

I shouldn't need to export the volume surely as its already imported and the system is aware of it.

budy · Apr 26, 2020

hinchles said:
My cache file has size so its not that. I've unmounted zfs, deleted the mount point then rebooted it re-creates the mount point but end results are the same as before the pool is apparently mounted i can see the subvol's (although they all appear to be empty apart from the dev folder?) but none of the ct's will start.

I shouldn't need to export the volume surely as its already imported and the system is aware of it.

I wouldn't tie this to the size of the zpool.cache file… and you wouldn't know what's going on with the mound folder, unless you exported the zpool. The log clearly showed an issue with the mount point, so why don't give it a try? All I am saying is, that I experienced the same behaviour, until I manually exported the zpool and cleaned up the mountpoint. And in order to properly export the zpool, it seems that the pvedaemon has to be shut down because it will otherwise immediately re-import the zpool and then you'll be cleaning up the contents of the zfs root folder for that pool, but not the folder on the pve local volume.

hinchles · Apr 26, 2020

Can you give me the exact command chain to do it so i don't screw it up ? I'm a software dev not really a sysadmin so this is all a little beyond my comfort/knowledge level

budy · Apr 26, 2020

Recalling from my session yesterday, this is what I did to be able to export my zpool pool properly:

Code:

systemctl stop pve-manager
systemctl stop zed
systemctl stop pve-storage
systemctl stop pvedaemon
systemctl stop pve-cluster
systemctl stop pveproxy
systemctl stop pvestatd
systemctl stop pve-ha-crm
systemctl stop pve-firewall
systemctl stop pvefw-logger
systemctl stop pve-ha-lrm
zpool export <ZPOOL>
< wait a couple of seconds>
zpool list
< check up on the mount point, which should be empty, or non-existent at this point>

I'd wait a couple of seconds between both zpool commands to see if the exported zpool stays exported. Then you can examine the zpool mont point on / to see, if there's anything there. Just keep in mind, that ZFS will create the mount folder for on its own, so once the zpool es exported you can also just delete any remains of that folder.

hinchles · Apr 26, 2020

ok as soon as i exported it the mount point became empty so to be safe i deleted the mount point /VMStorage and restarted pve stuff. this auto mounted the zfs now every vm apart from one appears to boot but i think i can fix that or if nothing else I can run a new container and repull the code down from git.

Few more tests to run including a reboot but looking ok so far thank you

Any clue why an update should break zfs mounting as badly as this its never on previous updates including from 5.x to 6.x ?

budy · Apr 26, 2020

No, I actually don't know, what caused this, but I have seen a couple of threats on this forum dealing with stuff that seemed to stem from this issue, which I think has bee introduced in 6.1.8, but I can't tell for sure, since I only rarely restart my geusts/containers. I had a weird pve host reboot last thursday, probably caused by a repeadtedly invoked OOM killer from one of my containers, where I notices this for the first time.

hinchles · Apr 26, 2020

its strange that only containers were effected the vm's ran fine and i could create containers fine just after reboot they were dead.
still got one container not coming back up but i think i can restore that from backup to like 48h ago so no major loss

budy · Apr 26, 2020

I think the issue with containers is, that these are sub-volumes from the ZFS root mount point and since that one got messed up, these sub-vols couldn't been mounted. ZVOLs on the other hand are not mounted, but are treated a raw devices, hence none of my guests had any issue starting up and accessing their volumes.

hinchles · Apr 26, 2020

ahh so containers store their data on the zfs drive differently to vm's? nice to know

also good news rolling back to the latest snapshot of broken container (friday) seems to have fixed it too now so the snapshots look to have been ok just the latest version of it was broke.

still not tested a host reboot yet

Search

Search

LXC won't start after upgrade to 6.1-8

hinchles

New Member

hinchles

New Member

budy

Active Member

hinchles

New Member

budy

Active Member

hinchles

New Member

budy

Active Member

hinchles

New Member

budy

Active Member

hinchles

New Member

budy

Active Member

hinchles

New Member