lxc container on zfs won't start after reboot of pve host

r.jochum · Sep 14, 2020

I figured it out:

I looped this and it worked, after 3 tries

Code:

systemctl restart zfs-mount
journactl <-- -xb something -- see /rpool/data/* is something

#if not mounted yet, take care you could delete container data!
rm -rf /rpool/data/*

Somehow Proxmox created /rpool/data/ct-id-something-blabla/rootfs/data in the empty data directory.

t.lamprecht · Sep 14, 2020

r.jochum said:
Somehow Proxmox created /rpool/data/ct-id-something-blabla/rootfs/data in the empty data directory.

This sound like a different bug from the existing dev/ directory...

His.Dudeness · Sep 24, 2020

Hi

well is this actually a bug now?
Can I provide further info? Or do some more testing?

cheers
Michael

t.lamprecht · Sep 25, 2020

If it's still true that the CT does not start due to ZFS not being able to mount as the MP directory isn't empty with the new package versions, then yes, this seems like another bug - albeit, I find that weird..

Can you actually check what is in the directory hindering the mount? Something like:

Bash:

ls -lart "/$(zfs get -H -o name mountpoint  datapool-01/vm-crypt/subvol-108-disk-0)"

His.Dudeness · Oct 9, 2020

t.lamprecht said:
If it's still true that the CT does not start due to ZFS not being able to mount as the MP directory isn't empty with the new package versions, then yes, this seems like another bug - albeit, I find that weird..

Can you actually check what is in the directory hindering the mount? Something like:

Bash:

ls -lart "/$(zfs get -H -o name mountpoint datapool-01/vm-crypt/subvol-108-disk-0)"

Hi,

I just did an apt-get upgrade and a reboot but still no luck. LXC won't start after reboot.

Here is the output of the command:

Code:

root@pve01:~# ls -lart "/$(zfs get -H -o name mountpoint  datapool-01/vm-crypt/subvol-108-disk-0)"
total 1
drwxr-xr-x 3 root root 3 Sep 12 09:57 ..
drwxr----- 2 root root 2 Sep 12 09:57 .

I executed it ...
- once after the reboot
- again after zfs load-key
- again after the first attempt to start the lxc

The result was the same every time.

cheers
Michael

His.Dudeness · Oct 25, 2020

Hi,
any news on this?
can I do anything to help to get this bug fixed?
provide additional logs / tests ?

cheers
Michael

Elliott Partridge · Nov 16, 2020

I am also experiencing this issue, using pve-container 3.2-2. In my case, I do not even see a "/dev" directory in the mounting path. I am able to use the workaround described above (delete mounting path and use "zfs mount XXX"), but this is not a good long-term solution.

Here are the relevant logs for me (attached debug log, since it was too large to post):

Code:

Web console error:
TASK ERROR: zfs error: '/tank/vm-dev/subvol-100-disk-1': not a ZFS filesystem

Code:

root@pve1:~# pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-15 (running version: 6.2-15/48bd51b6)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.4-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-6
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-4
pve-xtermjs: 4.7.0-2
pve-zsync: 2.0-3
qemu-server: 6.2-18
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Code:

root@pve1:~# lxc-start -d -n 100 -F -l DEBUG -o /tmp/lxc-100.log
lxc-start: 100: conf.c: mount_autodev: 1074 Permission denied - Failed to create "/dev" directory
                                                                                                 lxc-start: 100: conf.c: lxc_setup: 3238 Failed to mount "/dev"
lxc-start: 100: start.c: do_start: 1224 Failed to setup container "100"
                                                                        lxc-start: 100: sync.c: __sync_wait: 41 An error occurred in another process (expected sequence number 5)
                   lxc-start: 100: start.c: __lxc_start: 1950 Failed to spawn container "100"
                                                                                             lxc-start: 100: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 100: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

Elliott Partridge · Nov 16, 2020

After trying yet another workaround for this, I found simply mounting the ZFS subvolume allowed me to start the container:
zfs mount tank/vm-dev/subvol-100-disk-1
pct start 100

Elliott Partridge · Nov 16, 2020

Figured it out - my zpool mountpoint path existed on the root file system, preventing the pool from mounting at all. To solve this once and for all, I issued this command (after stopping all containers/VMs):
zpool export tank && rm -rf /tank

This cleared out any existing directory structure on the pool's mount point and allowed the zpool to not only import but also to mount all necessary ZFS filesystems under /tank.

You can check to see if this issue affects your system by looking at the output of df on the pool's mounting path. For example:

Example of "masked" zpool mount path ("Mounted on" = system root '/'):

Code:

root@pve1:~# df /tank
Filesystem           1K-blocks     Used Available Use% Mounted on
/dev/mapper/pve-root  59600812 37869940  18673620  67% /

Example of proper zpool mount path ("Filesystem" = pool name):

Code:

root@pve1:~# df /tank
Filesystem      1K-blocks   Used  Available Use% Mounted on
tank           6802108032 504064 6801603968   1% /tank

Elliott Partridge · Nov 16, 2020

Thinking back to how this might have happened in the first place, I vaguely recall creating a Directory storage on top of a ZFS file system, so that I could store backups, snippets and ISOs on a ZFS-backed file system. However, I think this can result in the creation of the requisite paths if for any reason the pool doesn't import on boot, resulting in a directory structure on the pool's mount point. Don't do that!

oguz · Nov 17, 2020

Elliott Partridge said:
Thinking back to how this might have happened in the first place, I vaguely recall creating a Directory storage on top of a ZFS file system, so that I could store backups, snippets and ISOs on a ZFS-backed file system. However, I think this can result in the creation of the requisite paths if for any reason the pool doesn't import on boot, resulting in a directory structure on the pool's mount point. Don't do that!

yes, this is very likely the cause of the issue.

having directory storage on the zpool is not really recommended because of this reason, since when you're booting if the directory storage gets mounted first, zpool won't be able to mount (directory not empty).

Elliott Partridge · Nov 17, 2020

oguz said:
yes, this is very likely the cause of the issue.

having directory storage on the zpool is not really recommended because of this reason, since when you're booting if the directory storage gets mounted first, zpool won't be able to mount (directory not empty).

Thanks for the confirmation. I ended up hosting ISOs, templates, snippets inside a container.

His.Dudeness · Nov 20, 2020

Hi Elliot

just to make sure: You do have the same issues because you are also using an encryptet ZFS as Storage for the LXC ?

Because I havn't been able to get the container running after a reboot.
My workaround is still to delete and restore the container from backup :-/

cheers
Michael

Elliott Partridge · Nov 20, 2020

His.Dudeness said:
Hi Elliot

just to make sure: You do have the same issues because you are also using an encryptet ZFS as Storage for the LXC ?

Because I havn't been able to get the container running after a reboot.
My workaround is still to delete and restore the container from backup :-/

cheers
Michael

No, I'm not using an encrypted ZFS dataset. I only have the same issue with regard to not starting up the container after reboot, with the same error message, "Permission denied - Failed to create "/dev" directory". I think the issue I had could also present itself with encrypted datasets.

His.Dudeness · Nov 20, 2020

OK, so the next thing that I will try is to move my container to a storage that does not need to be unlocked after reboot (network share or local unencrypted zfs pool)

Then I will see if the fact that the storage is unavailable during the boot process of the PVE host is the reason...

where exactly is your LXC hosted on? Local storage or a network share?

Elliott Partridge · Nov 20, 2020

His.Dudeness said:
OK, so the next thing that I will try is to move my container to a storage that does not need to be unlocked after reboot (network share or local unencrypted zfs pool)

Then I will see if the fact that the storage is unavailable during the boot process of the PVE host is the reason...

where exactly is your LXC hosted on? Local storage or a network share?

My LXC is hosted on a local ZFS dataset.

His.Dudeness · Dec 3, 2020

Hm. I moved the containers disk to a new storage (CIFS)
After that the container won't start

Code:

()
run_buffer: 323 Script exited with status 255
lxc_init: 797 Failed to run lxc.hook.pre-start for container "108"
__lxc_start: 1896 Failed to initialize container "108"
TASK ERROR: startup for container '108' failed

A regular VM on the same storage works fine

EDIT:

Deleted it and restored it to the same storage: No Luck
Deleted it and restored it to a local unencrypted ZFS -> working again.
Aren't containers supposed to work on shared cifs storages ?

Next Step: Reboot host and see if it starts again after that

His.Dudeness · Dec 3, 2020

Just did an Update of the PVE Host again: 6.3-2

Problem seems to be solved ! Thanks a lot !!

guletz · Dec 3, 2020

His.Dudeness said:
Just did an Update of the PVE Host again: 6.3-2

The same for my own case, now any CT can start after a node reboot!

suvaline · Apr 11, 2024

Can attest that this ~~feature~~ bug is still a thing as of 2024.04.10: only way to get containers' disks back and to boot up is to clear the /rpool/data folder (of empty folders that should contain the containers' contents) prior to mounting the zfs.

For me this ~~feature~~ bug manifested after upgrading kernels: from 6.5.13-3-pve to vmlinuz-6.5.13-5-pve.
Could have been some other being-upgraded-component too, sure.

Scary though.

lxc container on zfs won't start after reboot of pve host

Renowned Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Well-Known Member

Well-Known Member

Proxmox Retired Staff

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Distinguished Member

New Member

We value your privacy