Update broke LXC

nigi · Mar 2, 2020

OK, thanks. I've added a few logfiles.

dmesg
var/log/lxc/104.log
zpool history rpool
lxc-start -n 104 -F -l DEBUG -o ~/lxc-104.log

/var/log/messages doesn't show more than dmesg. Is it helpful?

nigi · Mar 14, 2020

Do you need some more logs? Or could you find some helpful information in this files?
I am at my wit's end... :-(

Stoiko Ivanov · Mar 16, 2020

From the debug log of pct start 104:

Code:

lxc pre-start produced output: unable to detect OS distribution

this could indicate that it's a problem with importing the ZFS filesystems

as said before: check your journal since last boot for messages from ZFS/Zpool:

Code:

journalctl -b

(there you can search (case-sensitive) by pressing '/')

I hope this helps!

nigi · Mar 16, 2020

OK, I've got an example extracted from log:

Code:

Mär 14 09:47:38 vhost pvesh[3236]: Starting CT 108
Mär 14 09:47:38 vhost pve-guests[9043]: starting CT 108: UPID:vhost:00002353:00001CE4:5E6C9A2A:vzstart:108:root@pam:
Mär 14 09:47:38 vhost pve-guests[3237]: <root@pam> starting task UPID:vhost:00002353:00001CE4:5E6C9A2A:vzstart:108:root@pam:
Mär 14 09:47:38 vhost systemd[1]: Starting PVE LXC Container: 108...
Mär 14 09:47:39 vhost lxc-start[9048]: lxc-start: 108: lxccontainer.c: wait_on_daemonized_start: 865 No such file or directory - Failed to receive the container state
Mär 14 09:47:39 vhost lxc-start[9048]: lxc-start: 108: tools/lxc_start.c: main: 329 The container failed to start
Mär 14 09:47:39 vhost lxc-start[9048]: lxc-start: 108: tools/lxc_start.c: main: 332 To get more details, run the container in foreground mode
Mär 14 09:47:39 vhost lxc-start[9048]: lxc-start: 108: tools/lxc_start.c: main: 335 Additional information can be obtained by setting the --logfile and --logpriority options
Mär 14 09:47:39 vhost systemd[1]: pve-container@108.service: Control process exited, code=exited, status=1/FAILURE
Mär 14 09:47:39 vhost systemd[1]: pve-container@108.service: Failed with result 'exit-code'.
Mär 14 09:47:39 vhost systemd[1]: Failed to start PVE LXC Container: 108.
Mär 14 09:47:39 vhost pve-guests[9043]: command 'systemctl start pve-container@108' failed: exit code 1
Mär 14 09:47:39 vhost kernel: lxc-start[9053]: segfault at 50 ip 00007fb900c70f8b sp 00007ffc22bfc550 error 4 in liblxc.so.1.6.0[7fb900c17000+8a000]
Mär 14 09:47:39 vhost kernel: Code: 9b c0 ff ff 4d 85 ff 0f 85 82 02 00 00 66 90 48 8b 73 50 48 8b bb f8 00 00 00 e8 80 78 fa ff 4c 8b 74 24 10 48 89 de 4c 89 f7 <41> ff 56 50 4c 89 f7 48 89 de 41 ff 56 58 48 8b 83 f8 00 00 00 8b
Mär 14 09:47:39 vhost pvestatd[3103]: unable to get PID for CT 108 (not running?)

Stoiko Ivanov · Mar 16, 2020

as said above - please check the journal from the complete boot for messages from _ZFS_ - the lxc-log is what points to a potential problem with the import of your zpool and the mounting of the datasets

nigi · Mar 17, 2020

OK, you've said "complete"? Maybe I've been unteachable

Sorry.
I've checked the log again - this time from the beginning and line by line. And there was an entry that the root mount point itself is not empty. I don't know, why, but I'll find ich out. Perhaps it's a result of an unclean shutdown.
After asking google for help, the solution is pretty simple:

Code:

zfs set overlay=on rpool

Thank you very much for your help!!!
nigi

Stoiko Ivanov · Mar 17, 2020

hmm - glad you found the workaround! - however the issue you're having is most likely due to a corrupt cache file (which is quite easily fixed as described by @oguz in a thread on the pve-devel mailing-list (https://pve.proxmox.com/pipermail/pve-devel/2020-March/042054.html)

maybe try to follow the steps described there

nigi said:
OK, you've said "complete"? Maybe I've been unteachable Sorry.

Sorry - I did not make myself clear - I basically wanted the output of:
`journalctl -b | grep -Ei 'zfs|zpool'`

but in any case - glad your issue is mitigated!

hvisage · Mar 23, 2020

I've been seeing similar for about the past 2weeks.

I appears that there might either be a race condition, or something not mounting *before* the LXC's root directory and /dev gets created.
I didn't do the overlay mount, II found that `rm -rf ${LXC_root_dir}` and then a `pct start ${LXD_id}` works

I've found this to be problematic especially when the LXC is set to not be started at boot, and also specifically when the host got a reset/power-failure (as happened today).... actually this happened on two different PVEs

Stoiko Ivanov · Mar 23, 2020

as written in this thread - please check the logs of the last boot `journalctl -b` for messages from zfs pool import and zfs mount - that might indicate where the issue is at - also make sure you have a cachefile set and included in your initramfs

I hope this helps!

hvisage · Mar 23, 2020

Stoiko Ivanov said:
as written in this thread - please check the logs of the last boot `journalctl -b` for messages from zfs pool import and zfs mount - that might indicate where the issue is at - also make sure you have a cachefile set and included in your initramfs

I hope this helps!

YEs, it seems pvestatd is waaayyyy too quick out of the blocks, before the ZFS stuff got mounted:

ie.:

Code:

root@blacktest:~# journalctl -b|grep -i zfs
Mar 23 14:12:25 blacktest kernel: Command line: initrd=\EFI\proxmox\5.3.18-2-pve\initrd.img-5.3.18-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Mar 23 14:12:25 blacktest kernel: Kernel command line: initrd=\EFI\proxmox\5.3.18-2-pve\initrd.img-5.3.18-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Mar 23 14:12:25 blacktest kernel: ZFS: Loaded module v0.8.3-pve1, ZFS pool version 5000, ZFS filesystem version 5
Mar 23 14:12:26 blacktest systemd[1]: Starting Import ZFS pools by cache file...
Mar 23 14:12:26 blacktest systemd[1]: Started Import ZFS pools by cache file.
Mar 23 14:12:26 blacktest systemd[1]: Reached target ZFS pool import target.
Mar 23 14:12:26 blacktest systemd[1]: Starting Mount ZFS filesystems...
Mar 23 14:12:26 blacktest systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Mar 23 14:12:26 blacktest systemd[1]: Started Wait for ZFS Volume (zvol) links in /dev.
Mar 23 14:12:26 blacktest systemd[1]: Reached target ZFS volumes are ready.
Mar 23 14:12:26 blacktest systemd[1]: Started Mount ZFS filesystems.
Mar 23 14:12:27 blacktest systemd[1]: Started ZFS Event Daemon (zed).
Mar 23 14:12:27 blacktest systemd[1]: Starting ZFS file system shares...
Mar 23 14:12:27 blacktest systemd[1]: Started ZFS file system shares.
Mar 23 14:12:27 blacktest systemd[1]: Reached target ZFS startup target.
Mar 23 14:12:27 blacktest zed[3478]: ZFS Event Daemon 0.8.3-pve1 (PID 3478)
Mar 23 14:12:42 blacktest pvestatd[4666]: zfs error: cannot open 'hvHdd01': no such pool
Mar 23 14:12:46 blacktest pvestatd[4666]: zfs error: cannot open 'zNVME02': no such pool

Code:

root@blacktest:~# zpool list
NAME      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
hSSD01    199G  15.4G   184G        -         -     0%     7%  1.00x    ONLINE  -
hvHdd01  5.45T  65.8G  5.39T        -         -     0%     1%  1.00x    ONLINE  -
rpool      14G  2.68G  11.3G        -         -     9%    19%  1.00x    ONLINE  -
zNVME01   952G  44.6G   907G        -         -     3%     4%  1.00x    ONLINE  -
zNVME02    83G  4.88G  78.1G        -         -     0%     5%  1.00x    ONLINE  -
root@blacktest:~#

Stoiko Ivanov · Mar 23, 2020

on a hunch - could you try to recreate the cachefile for all your pools and update the intramfs - i.e. for each pool run:

Code:

zpool set cachefile=/etc/zfs/zpool.cache $poolname

and finally:

Code:

update-initramfs -k all -u

(you can verify that all are set by running: `strings /etc/zfs/zpool.cache` )

and finally reboot?

PmUserZFS · Mar 23, 2020

Stoiko Ivanov said:
on a hunch - could you try to recreate the cachefile for all your pools and update the intramfs - i.e. for each pool run:

Code:

zpool set cachefile=/etc/zfs/zpool.cache $poolname

and finally:

Code:

update-initramfs -k all -u

(you can verify that all are set by running: `strings /etc/zfs/zpool.cache` )

and finally reboot?

I have the same issue with lxc that cant start after updating pve.

The proposed fix above doesnt work
I still have to rm -r alla mount points and /dev in the subvol , then remount zfs and finally start the lxc again.

Code:

~ journalctl -b|grep -i zfs
Mar 23 21:48:56 pve kernel: Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.3.18-2-pve root=ZFS=rpool/ROOT/pve-1 ro fbcon=rotate:3 rootdelay=15 quiet
Mar 23 21:48:56 pve kernel: Kernel command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-5.3.18-2-pve root=ZFS=rpool/ROOT/pve-1 ro fbcon=rotate:3 rootdelay=15 quiet
Mar 23 21:48:56 pve kernel: ZFS: Loaded module v0.8.3-pve1, ZFS pool version 5000, ZFS filesystem version 5
Mar 23 21:48:57 pve systemd[1]: Starting Import ZFS pools by cache file...
Mar 23 21:49:01 pve systemd[1]: Started Import ZFS pools by cache file.
Mar 23 21:49:01 pve systemd[1]: Reached target ZFS pool import target.
Mar 23 21:49:01 pve systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Mar 23 21:49:01 pve systemd[1]: Starting Mount ZFS filesystems...
Mar 23 21:49:01 pve zfs[6106]: cannot mount '/tank': directory is not empty
Mar 23 21:49:01 pve zfs[6106]: cannot mount '/rpool':
Mar 23 21:49:01 pve zfs[6106]: cannot mount '/rpool': mount failed
Mar 23 21:49:01 pve systemd[1]: zfs-mount.service: Main process exited, code=killed, status=11/SEGV
Mar 23 21:49:01 pve kernel: zfs[6298]: segfault at 0 ip 00007fb1d79ee694 sp 00007fb1c77f6420 error 4 in libc-2.28.so[7fb1d7994000+148000]
Mar 23 21:49:01 pve systemd[1]: Started Wait for ZFS Volume (zvol) links in /dev.
Mar 23 21:49:01 pve systemd[1]: Reached target ZFS volumes are ready.
Mar 23 21:49:02 pve systemd[1]: zfs-mount.service: Failed with result 'signal'.
Mar 23 21:49:02 pve systemd[1]: Failed to start Mount ZFS filesystems.
Mar 23 21:49:02 pve systemd[1]: Started ZFS Event Daemon (zed).
Mar 23 21:49:02 pve systemd[1]: Starting ZFS file system shares...
Mar 23 21:49:02 pve zed[6494]: ZFS Event Daemon 0.8.3-pve1 (PID 6494)
Mar 23 21:49:02 pve containerd[6682]: time="2020-03-23T21:49:02.776986180+01:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Mar 23 21:49:03 pve dockerd[6696]: time="2020-03-23T21:49:03.215345183+01:00" level=info msg="[graphdriver] using prior storage driver: zfs"
Mar 23 21:49:03 pve dockerd[6696]: time="2020-03-23T21:49:03.548879313+01:00" level=info msg="Docker daemon" commit=afacb8b7f0 graphdriver(s)=zfs version=19.03.8
Mar 23 21:49:17 pve systemd[1]: Started ZFS file system shares.
Mar 23 21:49:17 pve systemd[1]: Reached target ZFS startup target.

Stoiko Ivanov · Mar 24, 2020

kristoffer said:
The proposed fix above doesnt work

please post the output of:

Code:

strings /etc/zfs/zpool.cache
zpool status

Thanks

kristoffer said:
Mar 23 21:49:02 pve containerd[6682]: time="2020-03-23T21:49:02.776986180+01:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1 Mar 23 21:49:03 pve dockerd[6696]: time="2020-03-23T21:49:03.215345183+01:00" level=info msg="[graphdriver] using prior storage driver: zfs" Mar 23 21:49:03 pve dockerd[6696]: time="2020-03-23T21:49:03.548879313+01:00" level=info msg="Docker daemon" commit=afacb8b7f0 graphdriver(s)=zfs version=19.03.8

on a sidenode - installing docker on a PVE host ist not really supported/well-tested

ChESch · Aug 3, 2020

Hey guys,
I had the same issue, that all my containers would not start anymore, sometimes they threw errors, sometimes they just said starting ok but did not start, but what worked for me were the following steps (involves a bit of a downtime)(Thank you to tonci, you really helped me out!):

Let's assume our zpool is called "data_redundant".

Unmount all datasets of the affected zpool (Yes, every dataset of one zpool!).
zfs unmount data_redundant/YOUR_SUBVOL
If you are sure that everything is unmounted, delete the mouted folder on root-level (e.g. if your dataset is called data_redundant, delete the folder /data_redundant
rm -rf /data_redundant
Then you can restart the zfs-mount.service. If that starts successful, you can clap you hands!
As oguz suggested, I set the cachefile new and updated the initramfs.
zpool set cachefile=/etc/zfs/zpool.cache data_redundant
update-initramfs -k all -u
After a reboot, everything worked for me again, the container started without a problem. Just deleting the dev-folders in the mounted datastores did fix the issue only temporarily and after the next reboot, everything was back to beginning.

Hope this works for everybody, if not I would be happy to hear about your experiences!

Search

Search

Update broke LXC

nigi

Well-Known Member

Attachments

nigi

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

nigi

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

nigi

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

hvisage

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

hvisage

Renowned Member

Stoiko Ivanov

Proxmox Staff Member

PmUserZFS

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

ChESch

Member

We value your privacy