lxc container on zfs won't start after reboot of pve host

His.Dudeness · Sep 3, 2020

Hi@all

after a reboot of my pve host the lxc container (ID 108) will not start. It was the first reboot of the host after I have created the container

The disk of the container resides on a encrypted zfs dataset that needs to be unlocked after boot
All other regular VMs (Linux and Windows) that have their disks in the same zfs parent dataset work fine

here are some logs as requested:

Code:

pct config 108
arch: amd64
cores: 2
hostname: SYNC01
memory: 512
nameserver: 192.168.166.10
net0: name=eth0,bridge=vmbr0,hwaddr=EE:C2:99:1D:2C:AB,ip=192.168.168.10/24,tag=3,type=veth
net1: name=eth1,bridge=vmbr0,gw=192.168.167.1,hwaddr=5E:F0:18:B3:20:2C,ip=192.168.167.10/24,tag=2,type=veth
ostype: debian
rootfs: vm-crypt:subvol-108-disk-0,size=30G,mountoptions=noatime
swap: 512
unprivileged: 1

Code:

cat /etc/pve/storage.cfg
dir: local
        path /var/lib/vz
        content backup,iso,vztmpl

zfspool: local-zfs
        pool rpool/data
        content images,rootdir
        sparse 1

zfspool: vm-crypt
        pool datapool-01/vm-crypt
        content rootdir,images
        mountpoint /datapool-01/vm-crypt
        sparse 0

cifs: FN01_pve
        path /mnt/pve/FN01_pve
        server 192.168.166.18
        share PVE
        content images
        domain fn01
        maxfiles 1
        username pveuser

dir: FN01_pve_backup_monthly
        path /mnt/pve/FN01_pve/pve-backup/monthly
        content backup
        maxfiles 2
        shared 0

dir: FN01_pve_backup_weekly
        path /mnt/pve/FN01_pve/pve-backup/weekly
        content backup
        maxfiles 2
        shared 0

dir: FN01_pve_iso
        path /mnt/pve/FN01_pve/pve-iso
        content iso
        shared 0

dir: FN01_pve_lxctemplates
        path /mnt/pve/FN01_pve/pve-lxctemplates
        content vztmpl
        shared 0

Code:

zfs get all datapool-01/vm-crypt
NAME                  PROPERTY              VALUE                  SOURCE
datapool-01/vm-crypt  type                  filesystem             -
datapool-01/vm-crypt  creation              Sun Jun  7 18:45 2020  -
datapool-01/vm-crypt  used                  496G                   -
datapool-01/vm-crypt  available             364G                   -
datapool-01/vm-crypt  referenced            192K                   -
datapool-01/vm-crypt  compressratio         1.00x                  -
datapool-01/vm-crypt  mounted               no                     -
datapool-01/vm-crypt  quota                 none                   default
datapool-01/vm-crypt  reservation           none                   default
datapool-01/vm-crypt  recordsize            128K                   default
datapool-01/vm-crypt  mountpoint            /datapool-01/vm-crypt  default
datapool-01/vm-crypt  sharenfs              off                    default
datapool-01/vm-crypt  checksum              on                     default
datapool-01/vm-crypt  compression           off                    default
datapool-01/vm-crypt  atime                 on                     default
datapool-01/vm-crypt  devices               on                     default
datapool-01/vm-crypt  exec                  on                     default
datapool-01/vm-crypt  setuid                on                     default
datapool-01/vm-crypt  readonly              off                    default
datapool-01/vm-crypt  zoned                 off                    default
datapool-01/vm-crypt  snapdir               hidden                 default
datapool-01/vm-crypt  aclinherit            restricted             default
datapool-01/vm-crypt  createtxg             2929                   -
datapool-01/vm-crypt  canmount              on                     default
datapool-01/vm-crypt  xattr                 on                     default
datapool-01/vm-crypt  copies                1                      default
datapool-01/vm-crypt  version               5                      -
datapool-01/vm-crypt  utf8only              off                    -
datapool-01/vm-crypt  normalization         none                   -
datapool-01/vm-crypt  casesensitivity       sensitive              -
datapool-01/vm-crypt  vscan                 off                    default
datapool-01/vm-crypt  nbmand                off                    default
datapool-01/vm-crypt  sharesmb              off                    default
datapool-01/vm-crypt  refquota              none                   default
datapool-01/vm-crypt  refreservation        none                   default
datapool-01/vm-crypt  guid                  14726563484045534726   -
datapool-01/vm-crypt  primarycache          all                    default
datapool-01/vm-crypt  secondarycache        all                    default
datapool-01/vm-crypt  usedbysnapshots       0B                     -
datapool-01/vm-crypt  usedbydataset         192K                   -
datapool-01/vm-crypt  usedbychildren        496G                   -
datapool-01/vm-crypt  usedbyrefreservation  0B                     -
datapool-01/vm-crypt  logbias               latency                default
datapool-01/vm-crypt  objsetid              271                    -
datapool-01/vm-crypt  dedup                 off                    default
datapool-01/vm-crypt  mlslabel              none                   default
datapool-01/vm-crypt  sync                  standard               default
datapool-01/vm-crypt  dnodesize             legacy                 default
datapool-01/vm-crypt  refcompressratio      1.00x                  -
datapool-01/vm-crypt  written               192K                   -
datapool-01/vm-crypt  logicalused           384G                   -
datapool-01/vm-crypt  logicalreferenced     69K                    -
datapool-01/vm-crypt  volmode               default                default
datapool-01/vm-crypt  filesystem_limit      none                   default
datapool-01/vm-crypt  snapshot_limit        none                   default
datapool-01/vm-crypt  filesystem_count      none                   default
datapool-01/vm-crypt  snapshot_count        none                   default
datapool-01/vm-crypt  snapdev               hidden                 default
datapool-01/vm-crypt  acltype               off                    default
datapool-01/vm-crypt  context               none                   default
datapool-01/vm-crypt  fscontext             none                   default
datapool-01/vm-crypt  defcontext            none                   default
datapool-01/vm-crypt  rootcontext           none                   default
datapool-01/vm-crypt  relatime              off                    default
datapool-01/vm-crypt  redundant_metadata    all                    default
datapool-01/vm-crypt  overlay               off                    default
datapool-01/vm-crypt  encryption            aes-192-gcm            -
datapool-01/vm-crypt  keylocation           prompt                 local
datapool-01/vm-crypt  keyformat             passphrase             -
datapool-01/vm-crypt  pbkdf2iters           342K                   -
datapool-01/vm-crypt  encryptionroot        datapool-01/vm-crypt   -
datapool-01/vm-crypt  keystatus             available              -
datapool-01/vm-crypt  special_small_blocks  0                      default

Code:

lxc-start -n 108 -F -l DEBUG -o /tmp/lxc-ID.log
lxc-start: 108: conf.c: mount_autodev: 1074 Permission denied - Failed to create "/dev" directory
                                                                                                 lxc-start: 108: conf.c: lxc_setup: 3311 Failed to mount "/dev"
                                                                                                                                                               lxc-start: 108: start.c: do_start: 1231 Failed to setup container "108"
                      lxc-start: 108: sync.c: __sync_wait: 41 An error occurred in another process (expected sequence number 5)
                                                                                                                               lxc-start: 108: start.c: __lxc_start: 1957 Failed to spawn container "108"
                                                                                                                                                                                                         lxc-start: 108: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 108: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

oguz · Sep 3, 2020

hi,

please run lxc-start -n CTID -l DEBUG -o /tmp/lxc-ID.log and attach the file /tmp/lxc-ID.log here.

also on a hunch you can try this workaround[0] where some users had similar problems with zfs.

[0]: https://forum.proxmox.com/threads/update-broke-lxc.59776/#post-277303

His.Dudeness · Sep 3, 2020

Hi Oguz
thanks for your reply!

You can find the log file attached

regarding the workaround from the other thread: Do you mean these two commands ?

Code:

zpool set cachefile=/etc/zfs/zpool.cache POOLNAME.

initramfs: update-initramfs -u -k all.

I don't think that the pool doesn't get imported because the kvm VMs in the same dataset work fine.

what exactly do the two commands do and would that affect my other VMs on the pool?

thanks a lot!

Michael

oguz · Sep 3, 2020

His.Dudeness said:
what exactly do the two commands do and would that affect my other VMs on the pool?

first sets the cachefile parameter for the pool. if you read that thread there's more detailed explanation, but basically sometimes the cachefile doesnt get updated after adding a new pool so it doesnt get picked up at boot until you activate it somehow

second just updates initramfs for next boot

in the log
well the first thing i notice is:

Code:

lxc-start 108 20200903112049.267 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 108 lxc pre-start produced output: /etc/os-release file not found and autodetection failed, falling back to 'unmanaged'

this is a debian container? could you try running pct mount CTID and checking the mounted filesystem? container falling back to 'unmanaged' could be causing issues

His.Dudeness · Sep 3, 2020

Hi,

i tried to mount and unmount. That seems to work:

Code:

pct mount 108
mounted CT 108 in '/var/lib/lxc/108/rootfs'
pct unmount 108

You wrote:

oguz said:
so it doesnt get picked up at boot until you activate it somehow

The pool is imported at boot time, but the encrypted dataset is locked. To be able to start VMs I have to unlock it via zfs load-key.
So it's pefectly normal that the pool is not available during boot time until I unlock it.

The KVM vms that use zvols in the same dataset won't boot until I unlock the encrypted dataset either - which is logical because as long as the dataset is locked, pve cannot access the zvols within it. It just seems that the lxc container with the sub-dataset struggles more than the kvm vm with the zvols in the same parent dataset.

If the lxc volumes are affected by a pool that is not imported during boot time because the cache file is corrupted, they might be also affected by a pool that is not available at boot time because it is locked?

cheers
Michael

oguz · Sep 3, 2020

His.Dudeness said:
If the lxc volumes are affected by a pool that is not imported during boot time because the cache file is corrupted, they might be also affected by a pool that is not available at boot time because it is locked?

yes. is the container set to boot automatically? if that's the case and the reason it's locking, then you can try setting a boot delay or order it to boot after another VM/CT..

His.Dudeness said:
i tried to mount and unmount. That seems to work:

but is the /etc/os-release file there? can you browse the mounted fs and check if everything looks normal? you can run something like tree or find

His.Dudeness · Sep 3, 2020

the lxc container is not set to auto boot.
But I am not sure if I might have clicked manually on "Start" before unlocking the encrypted dataset. I guess I have to reboot the host again...

where am I supposed to look for the file after the mount? "/var/lib/lxc/108/rootfs/" ?
there is only an empty "dev" subfolder

oguz · Sep 3, 2020

His.Dudeness said:
where am I supposed to look for the file after the mount? "/var/lib/lxc/108/rootfs/" ?
there is only an empty "dev" subfolder

yes what it prints out when you mount it.

if there's only /dev that's probably why the container isn't working then. (likely files are missing because of some problem with the volume?)

His.Dudeness · Sep 3, 2020

Hi,
well, I have just shut down all the VMs, rebooted the host, unlocked the dataset first and THEN tried to start the container -> no luck (log file attached)

If there is an issue with the volume then I have no idea what could have caused it. As I wrote: The disks of the "regular" VMs reside inside the same dataset (as zvols) and they are working just fine.
Only the lxc container got the issues.

The only thing I did was to shutdown everything and reboot the host

His.Dudeness · Sep 4, 2020

Hm, I just removed the container and tried to restore from backup. Also no luck

Code:

cannot mount '/datapool-01/vm-crypt/subvol-108-disk-0': directory is not empty
TASK ERROR: unable to restore CT 108 - zfs error: filesystem successfully created, but not mounted

The output of zfs list shows that the subvol is gone after I remove the container and is created when I try to restore it.

Edit:
OK, I had to remove the /datapool-01/vm-crypt/subvol-108-disk-0 and the subfolder "dev" manually. Now the restore worked and the container is starting again.

The question is: What happens if I reboot the host again ?

His.Dudeness · Sep 7, 2020

Hi,

I have just just shut down all the VMs and the lxc container and rebooted the host.

Same result. The VMs are starting fine, the container is not.

So I guess this is a bug ? Can I do anything to help to fix this somehow?

cheers,
Michael

oguz · Sep 9, 2020

could you try creating another container and see if this issue happens there as well? try with the same basic settings

debug logs like in the previous posts are always helpful to post also.

His.Dudeness · Sep 9, 2020

oguz said:
could you try creating another container and see if this issue happens there as well? try with the same basic settings

Hi!
yep, the result with a new LXC is the same:

Code:

root@pve01:~# lxc-start -n 111 -l DEBUG -o /tmp/lxc-111.log
lxc-start: 111: lxccontainer.c: wait_on_daemonized_start: 852 Received container state "ABORTING" instead of "RUNNING"
lxc-start: 111: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 111: tools/lxc_start.c: main: 311 To get more details, run the container in foreground mode
lxc-start: 111: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

cheers
Michael

EDIT: One thing I noticed:
After the reboot of the PVE host the first attempt to start the container takes a little bit longer. There is even the tiny green arrow at the container's icon in the web ui before it aborts
the next attempt to start it aborts much quicker. No green arrow at the icon
Most of the logs from before were from the second or third or 100th attempt to start ;-)
but this one right here is from the first attempt after the pve reboot.

oguz · Sep 9, 2020

the errors start around here:

Code:

lxc-start 111 20200909143713.175 INFO     conf - conf.c:mount_autodev:1059 - Preparing "/dev"
lxc-start 111 20200909143713.175 DEBUG    conf - conf.c:mount_autodev:1065 - Using mount options: size=500000,mode=755
lxc-start 111 20200909143713.175 ERROR    conf - conf.c:mount_autodev:1074 - Permission denied - Failed to create "/dev" directory
lxc-start 111 20200909143713.175 INFO     conf - conf.c:mount_autodev:1108 - Prepared "/dev"
lxc-start 111 20200909143713.175 ERROR    conf - conf.c:lxc_setup:3311 - Failed to mount "/dev"
lxc-start 111 20200909143713.175 ERROR    start - start.c:do_start:1231 - Failed to setup container "111"
lxc-start 111 20200909143713.175 ERROR    sync - sync.c:__sync_wait:41 - An error occurred in another process (expected sequence number 5)

so it looks like the container is failing to set up the /dev directory
and it really seems like you're having the same issue i mentioned in the second post with the workaround.

could you try running: zfs mount -a and post the output here?

and also please the output of pveversion -v because i cannot reproduce it on my testing machine

i strongly suspect your issue is related to the zfs volume activation.

to rule out other problems could you try creating another container but this time on a different storage, for example local directory storage and see if the issue occurs again

His.Dudeness · Sep 9, 2020

Hi,
did you try it with an encrypted zfs volume that has to been unlocked after boot?

As I wrote before: I read in other threads, that there seems to be a problem with containers not starting after pve reboot if the zfs pool fails to mount during boot time.
I don't have that issue, the pool gets imported during boot time.

BUT: due to the encryption the volume is also not available during boot time. so maybe the outcome is the same?

Anyways: here is the output you requested:

Code:

zfs mount -a
cannot mount '/datapool-01/vm-crypt': directory is not empty

Code:

 pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.44-1-pve)
pve-manager: 6.2-6 (running version: 6.2-6/ee1d7754)
pve-kernel-5.4: 6.2-3
pve-kernel-helper: 6.2-3
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-3
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-8
pve-cluster: 6.1-8
pve-container: 3.1-8
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-3
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

t.lamprecht · Sep 10, 2020

We fixed an issue with this in the new pve-container_3.2-1, so it should not happen again in the future.

Try rmdir /datapool-01/vm-crypt/dev then retry the mount.

If that does not help, please post ls -la /datapool-01/vm-crypt/

His.Dudeness · Sep 12, 2020

Hi Guys,

I just tried the following:

- Upgrade pve to latest version
- Reboot pve host
- unlock zfs encryption
- Start the container → not working
- delete the container
- delete the folder „/datapool-01/vm-crypt/subvol-108-disk-0“ and the subfolder dev manually
- restore the container from backup
- start the container successfully
- shutdown the container
- reboot pve host
- unlock zfs encryption
- try to start the container → same error as before

as before the kvm vms start successfully the container does not

cheers
Michael

r.jochum · Sep 12, 2020

I have the same problem but without encryption, on each boot i have to delete /rpool/data/* and mount zfs.

t.lamprecht · Sep 14, 2020

His.Dudeness said:
- Upgrade pve to latest version

Which ones would that be? Can you please post pveversion -v ?

Because the pve-container package moved to the no-subscription repository only recently and is not yet present on pve-enterprise (will be soon).

His.Dudeness · Sep 14, 2020

Code:

pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.60-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-12
pve-cluster: 6.1-8
pve-container: 3.2-1
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-1
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

lxc container on zfs won't start after reboot of pve host

Member

Proxmox Retired Staff

Member

Attachments

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Proxmox Retired Staff

Member

Attachments

Member

Member

Proxmox Retired Staff

Member

Attachments

Proxmox Retired Staff

Member

Proxmox Staff Member

Member

Renowned Member

Proxmox Staff Member

Member