Proxmox Containers Start First Time, Then Fail to Start After Shutdown Until Host is Rebooted

Mercurial_Mongoose · Aug 2, 2020

I’m at my wits ends and would be incredibly grateful for any help fixing this. I’ve got a situation where lxc containers successfully boot the first time only, upon starting up the Proxmox node. However, once you shutdown a container (or reboot a container), the container simply will not boot again. Instead, you must reboot the entire node to get the containers to boot again. VMs, on the other hand, work as expected. I can successfully shutdown and restart them. I can also reboot them.

I have two containers, 101 (set to auto-start) and 999 (I have to manually start it). In both cases, the containers start the first time after the Proxmox node boots. (I have to start the 999 container manually, and it works.) But once you’ve shut down the containers, you cannot get them to start again unless you reboot the entire node.

(Note that I created the 999 container from scratch to see if the problem was somehow related to my existing 101 container. I thought having a brand new, fresh container might reveal that the problem was specific to one container rather than all containers in general, but alas, that doesn’t seem to be the case.)

So I found this thread, which looked like it might describe my problem. I tried the following:

zpool set cachefile=/etc/zfs/zpool.cache rpool

and

update-initramfs -k all -u

(If I understand correctly, I only have one zfs pool, which 'zpool list' tells me is 'rpool.')

I then rebooted. Didn’t solve the problem.

Based on another thread that looked promising, I edited the following file:
/lib/systemd/system/zfs-mount.service

changing the following line:
ExecStart=/sbin/zfs mount -a

to add the overlay option:
ExecStart=/sbin/zfs mount -O -a

After rebooting, however, the problem was unchanged. I therefore reverted the file back.

I then realized that maybe this wasn’t the best or only way to add the overlay option. So I tried:
zfs set overlay=on rpool

Also didn’t work, so I turned it back off.

I found another thread (hopefully not too old) that recommend editing:
/etc/pve/storage.cfg

To add the following options to zfs directories:

mkdir 0
is_mountpoint 1

It didn’t work either. So I reverted back.

Any thoughts? I’m really at a loss.

Also, if you want me to try something, spelling out the commands is very helpful. I’m trying hard to learn but am green.

I wonder if this is a solution? I’m not quite grasping what was done here or how I would go about doing it myself.

Finally, is this not a bug in Proxmox? I’ve done little beyond run updates in the containers and VMs. It seemed to be working fine for a few months but now, it’s just (seemingly and suddenly) stopped. I note that there are a lot of recent forum posts where people are having problems getting lxc containers to start.

Here’s some debugging and configuration information that I’ve collected that may be helpful.

Code:

root@pve:~# lxc-start -n 101 -F -l DEBUG

lxc-start: 101: conf.c: run_buffer: 323 Script exited with status 255
lxc-start: 101: start.c: lxc_init: 804 Failed to run lxc.hook.pre-start for container "101"
lxc-start: 101: start.c: __lxc_start: 1903 Failed to initialize container "101"
lxc-start: 101: conf.c: run_buffer: 323 Script exited with status 1
lxc-start: 101: start.c: lxc_end: 971 Failed to run lxc.hook.post-stop for container "101"
lxc-start: 101: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 101: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

The second container returns exactly the same.

Code:

root@pve:~# lxc-start -n 999 -F -l DEBUG

lxc-start: 999: conf.c: run_buffer: 323 Script exited with status 255
lxc-start: 999: start.c: lxc_init: 804 Failed to run lxc.hook.pre-start for container "999"
lxc-start: 999: start.c: __lxc_start: 1903 Failed to initialize container "999"
lxc-start: 999: conf.c: run_buffer: 323 Script exited with status 1
lxc-start: 999: start.c: lxc_end: 971 Failed to run lxc.hook.post-stop for container "999"
lxc-start: 999: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 999: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

For some reason, running the same command above but writing the output to a file yields a more verbose description:

Code:

root@pve:~# lxc-start -n 999 -F -l DEBUG -o debug.txt

lxc-start 999 20200801163911.590 INFO     confile - confile.c:set_config_idmaps:2051 - Read uid map: type u nsid 0 hostid 100000 range 65536
lxc-start 999 20200801163911.590 INFO     confile - confile.c:set_config_idmaps:2051 - Read uid map: type g nsid 0 hostid 100000 range 65536
lxc-start 999 20200801163911.592 INFO     lsm - lsm/lsm.c:lsm_init:29 - LSM security driver AppArmor
lxc-start 999 20200801163911.593 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "999", config section "lxc"
lxc-start 999 20200801163956.769 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 999 lxc pre-start produced output: mount: /var/lib/lxc/.pve-staged-mounts/rootfs: can't read superblock on /dev/loop1.


lxc-start 999 20200801163956.794 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 999 lxc pre-start produced output: command 'mount /dev/loop1 /var/lib/lxc/.pve-staged-mounts/rootfs' failed: exit code 32

lxc-start 999 20200801163956.801 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 255
lxc-start 999 20200801163956.801 ERROR    start - start.c:lxc_init:804 - Failed to run lxc.hook.pre-start for container "999"
lxc-start 999 20200801163956.801 ERROR    start - start.c:__lxc_start:1903 - Failed to initialize container "999"
lxc-start 999 20200801163956.801 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "999", config section "lxc"
lxc-start 999 20200801163957.816 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 999 lxc post-stop produced output: umount: /var/lib/lxc/999/rootfs: not mounted


lxc-start 999 20200801163957.816 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 999 lxc post-stop produced output: command 'umount --recursive -- /var/lib/lxc/999/rootfs' failed: exit code 1


lxc-start 999 20200801163957.872 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 1
lxc-start 999 20200801163957.873 ERROR    start - start.c:lxc_end:971 - Failed to run lxc.hook.post-stop for container "999"
lxc-start 999 20200801163957.873 ERROR    lxc_start - tools/lxc_start.c:main:308 - The container failed to start
lxc-start 999 20200801163957.873 ERROR    lxc_start - tools/lxc_start.c:main:314 - Additional information can be obtained by setting the --logfile and --logpriority options

While trying to start, I see this in the Proxmox GUI:

I tried mounting the container:

Code:

root@pve:~# pct mount 999

mount: /var/lib/lxc/999/rootfs: can't read superblock on /dev/loop1.
mounting container failed
command 'mount /dev/loop1 /var/lib/lxc/999/rootfs//' failed: exit code 32

I notice an odd double slash (‘//‘) at the end of the path below. Don’t know if that means anything.

Also, I notice there is NO directory /dev/loop1 ... at least when I look for it later. I see loop3, loop4, all the way to loop7 when I list out the directory manually. Does that mean anything?

[It looks like I'm out of space, so I'll paste more logs in subsequent comments...]

Mercurial_Mongoose · Aug 2, 2020

[CONTINUING]

A search of the journal shows:

Code:

root@pve:~# journalctl -b | grep -Ei 'zfs|zpool'

Aug 01 11:19:35 pve kernel: Command line: initrd=\EFI\proxmox\5.4.44-2-pve\initrd.img-5.4.44-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Aug 01 11:19:35 pve kernel: Kernel command line: initrd=\EFI\proxmox\5.4.44-2-pve\initrd.img-5.4.44-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs
Aug 01 11:19:35 pve kernel: ZFS: Loaded module v0.8.4-pve1, ZFS pool version 5000, ZFS filesystem version 5
Aug 01 11:19:36 pve systemd[1]: Starting Import ZFS pools by cache file...
Aug 01 11:19:36 pve zpool[1025]: no pools available to import
Aug 01 11:19:36 pve systemd[1]: Started Import ZFS pools by cache file.
Aug 01 11:19:36 pve systemd[1]: Reached target ZFS pool import target.
Aug 01 11:19:36 pve systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Aug 01 11:19:36 pve systemd[1]: Starting Mount ZFS filesystems...
Aug 01 11:19:36 pve systemd[1]: Started Wait for ZFS Volume (zvol) links in /dev.
Aug 01 11:19:36 pve systemd[1]: Reached target ZFS volumes are ready.
Aug 01 11:19:36 pve systemd[1]: Started Mount ZFS filesystems.
Aug 01 11:19:36 pve systemd[1]: Started ZFS Event Daemon (zed).
Aug 01 11:19:36 pve systemd[1]: Starting ZFS file system shares...
Aug 01 11:19:36 pve systemd[1]: Started ZFS file system shares.
Aug 01 11:19:36 pve systemd[1]: Reached target ZFS startup target.
Aug 01 11:19:36 pve zed[1110]: ZFS Event Daemon 0.8.4-pve1 (PID 1110)
Aug 01 11:20:31 pve pmxcfs[1372]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve/local-zfs: -1
Aug 01 11:25:36 pve pmxcfs[1372]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve/local-zfs: -1
Aug 01 11:29:06 pve pmxcfs[1372]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve/local-zfs: -1

Surprisingly, the following didn’t produce an error. (Based on some threads I read, I had expected it would.)
root@pve:~# pct start 999

However, the GUI did report that the config was now locked (mounted). But it still showed to not be running.

Mercurial_Mongoose · Aug 2, 2020

[CONTINUING]

A manual — and successful — start of the 999 container after rebooting the host produces:
root@pve:~# lxc-start -n 999 -F -l DEBUG -o debug.txt -- [See file attachment...]

Looking at zpool.cache with strings reveals:

Code:

root@pve:~# strings /etc/zfs/zpool.cache
rpool
version
name
rpool
state
pool_guid
errata
hostid
hostname
(none)
com.delphix:has_per_vdev_zaps
vdev_children
vdev_tree
type
root
guid
create_txg
children
type
disk
guid
path
/dev/disk/by-id/ata-TOSHIBA-TR150_66BB70K1K8XU-part3
whole_disk
metaslab_array
metaslab_shift
ashift
asize
is_log
create_txg
com.delphix:vdev_zap_leaf
com.delphix:vdev_zap_top
features_for_read
com.delphix:hole_birth
com.delphix:embedded_data

zpool seems fine:

Code:

root@pve:~# zpool status

  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:01:41 with 0 errors on Sun Jul 12 00:25:42 2020
config:


    NAME                                    STATE     READ WRITE CKSUM
    rpool                                   ONLINE       0     0     0
      ata-TOSHIBA-TR150_66BB70K1K8XU-part3  ONLINE       0     0     0


errors: No known data errors

Here’s the container config files if they helps:

Code:

root@pve:~# pct config 101

arch: amd64
cores: 4
hostname: plex
memory: 2048
mp0: /mnt/netfolder,mp=/mnt/extfolder
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=2A:69:E8:A3:F2:8A,ip=dhcp,type=veth
onboot: 1
ostype: ubuntu
rootfs: NAS:101/vm-101-disk-1.raw,size=24G
swap: 512
unprivileged: 1

and

Code:

root@pve:~# pct config 999

arch: amd64
cores: 1
hostname: random-DZAJUTpyWy
memory: 512
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=7E:5F:50:6D:00:E7,ip=dhcp,type=veth
ostype: ubuntu
rootfs: NAS:999/vm-999-disk-0.raw,size=8G
swap: 512
unprivileged: 1

Here’s the storage config file of the Proxmox host:

Code:

root@pve:~# cat /etc/pve/storage.cfg

dir: local
    path /var/lib/vz
    content iso,backup,vztmpl


zfspool: local-zfs
    pool rpool/data
    content rootdir,images
    sparse 1


cifs: NAS
    path /mnt/pve/NAS
    server 192.168.50.19
    share home
    content images,vztmpl,backup,rootdir,iso
    maxfiles 10
    username pve-storage

I should be running the latest of everything:

Code:

root@pve:~# pveversion -v

proxmox-ve: 6.2-1 (running kernel: 5.4.44-2-pve)
pve-manager: 6.2-10 (running version: 6.2-10/a20769ed)
pve-kernel-5.4: 6.2-4
pve-kernel-helper: 6.2-4
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.44-1-pve: 5.4.44-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-5
libpve-guest-common-perl: 3.1-1
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-9
pve-cluster: 6.1-8
pve-container: 3.1-12
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-11
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-11
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1

Mercurial_Mongoose · Aug 4, 2020

Anyone? I would be so grateful for some help. It's so disheartening to spend the better part of two days trying to troubleshoot and writeup a lengthy documentation with logs and then get no response. Surely Proxmox isn't supposed to be behaving this way?

Stoiko Ivanov · Aug 5, 2020

Ok - both containers you're referencing here are not on ZFS as far as I can see (both are on the storage NAS) - so for now I would rule out ZFS as the source of the issue.

Mercurial_Mongoose said:
mount: /var/lib/lxc/999/rootfs: can't read superblock on /dev/loop1.

seems the filesystem of the container is corrupted/has problems - does:

Code:

pct fsck 999

help?
(if not paste the output)

I hope this helps!

Mercurial_Mongoose · Aug 5, 2020

Thank you so much Stoiko!

Yes, running off the NAS was the issue. I'm new to Proxmox but also Linux and have been trying really hard to learn enough to be competent.

I must have inadvertently setup the container that way because Proxmox prompted me to use the NAS in the GUI for the rootfs. I also may be having this problem because the NAS is connected via CIFS instead of NFS. I already had CIFS enabled on my NAS because the Mac uses it, so I used it when setting up Proxmox too rather than turning on an additional service, NFS. (I am now aware that NFS is the more robust service and probably preferred in Linux, but the container seemed to work great running from CIFS as long as it booted.)

Some things you (or others) might find interesting:

The solution was to first destroy the existing container (container ID: 101).
pct destroy 101

Then recover from an existing backup but using these options to make the rootfs use the local-zfs (on my host) instead of running off the NAS.
pct restore 101 vzdump-lxc-101-2020_07_31-15_36_53.tar.zst --rootfs local-zfs:24

Once that was done, I can now shutdown and restart the container no problem.

Also, here's the results of running fsck on another container, as requested

Code:

root@pve:~# pct fsck 999

fsck from util-linux 2.33.1
MMP interval is 10 seconds and total wait time is 42 seconds. Please wait...
/mnt/pve/NAS/images/999/vm-999-disk-0.raw: clean, 21869/524288 files, 254018/2097152 blocks

Everything seems fine, which I would expect because container 999 is a fresh Ubuntu server install with literally NOTHING installed or modified within it. I created it for the sole purchases of testing.

Here are my questions / suggestions:

Was I doing anything unsupported? Like trying to run the containers off the NAS connected via CIFS?
Can the Proxmox GUI be changed to prevent the problem I encountered? For example, can (or should) it default to using local storage over the NAS if CIFS is a problem? In fact, wouldn't local storage generally be safer/faster/less complex, thus justifying it being the default?
Basically, I'm finding Proxmox to be wonderful. The GUI certainly helps beginners (and just makes things faster and more discoverable in many ways), but it seems Proxmox happily let me set up containers in such a way that they couldn't restart and I had no indication I was making bad choices when setting them up like this.

Thanks for the product and doubly, thanks for the reply. I know you must give so much time respond to threads, but man, sometimes a poor soul like me just needs a little bit of expertise to get pointed in the right direction and solve the problem. Thank you, thank you, thank you!!!

Stoiko Ivanov · Aug 5, 2020

Mercurial_Mongoose said:
I must have inadvertently setup the container that way because Proxmox prompted me to use the NAS in the GUI for the rootfs.

Probably due to the sorting of the storages - you can however click on the storage and select a different one (like e.g. your zpool)

Mercurial_Mongoose said:
I also may be having this problem because the NAS is connected via CIFS instead of NFS. I already had CIFS enabled on my NAS because the Mac uses it, so I used it when setting up Proxmox too rather than turning on an additional service, NFS. (I am now aware that NFS is the more robust service and probably preferred in Linux, but the container seemed to work great running from CIFS as long as it booted.)

In general both CIFS and NFS work reliably and well (for containers, vms, backups,....) - in certain cases it can happen that one implementation on a particular NAS does not work well and a switch to the other (e.g. from CIFS to NFS or vice versa) helps.

Mercurial_Mongoose said:
Everything seems fine, which I would expect because container 999 is a fresh Ubuntu server install with literally NOTHING installed or modified within it. I created it for the sole purchases of testing.

could you start container 999 afterwards?

Mercurial_Mongoose said:
Was I doing anything unsupported? Like trying to run the containers off the NAS connected via CIFS?

Not in particular - the config seems ok - did you maybe at some point stop the NAS without stopping the containers first - or disrupt the network between NAS and PVE node? (this could lead to the problems you experienced

Mercurial_Mongoose said:
Can the Proxmox GUI be changed to prevent the problem I encountered? For example, can (or should) it default to using local storage over the NAS if CIFS is a problem? In fact, wouldn't local storage generally be safer/faster/less complex, thus justifying it being the default?

No wouldn't say so - remote storages do work in most circumstances and quite a few users rely on them - I could not say that one is generally preferred over the other - they just provide different features for different use-cases.

In your case I would check the logs of the PVE system (if it happens again) - with `journalctl` - see `man journalctl` - this usually gives you some pointers to what went wrong.

I hope this helps!

Mercurial_Mongoose · Aug 5, 2020

OK. So I’ve done the following, which may prove helpful. I’m also happy to run further commands. I genuinely don’t understand why I’m having trouble running containers off the NAS, especially since they all happily start up the first time but then refuse to boot later.

Here’s the steps I took and the order I did them in. Be sure to read to the bottom. There’s a new clue...

Booted up container 999 successfully.

Shut it down.

Ran:
pct fsck 999

Result is the same as above. Clean.

Booted up 999 a second time from the GUI. Surprise, surprise. It was successful. I swear on my mother’s grave that containers were not working the last few days after they had booted once.

Shut down container 999.

(Did not run fsck this time.)

Booted up 999 a third time. Again, a surprise. It failed and won’t boot.

Ran fsck again, this time the output is slightly different but still clean:

Code:

root@pve:~# pct fsck 999

fsck from util-linux 2.33.1
MMP interval is 10 seconds and total wait time is 42 seconds. Please wait...
/mnt/pve/NAS/images/999/vm-999-disk-0.raw: recovering journal
/mnt/pve/NAS/images/999/vm-999-disk-0.raw: clean, 21901/524288 files, 256107/2097152 blocks

Finally, not knowing exactly what to look for, I ran the following and have attached the log:
journalctl | grep 999 > journalctl_999.log

And finally, I typed all this up this morning, intending to submit it. Got busy and didn’t. The attached log file was made at 12:19pm. Four hours later, I tried starting container 999 again. I had done NOTHING to it whatever beyond trying to start it (and failing) the last time. Low and behold, it worked and start right up.

Five minutes later, I shut it down. Waited over a full two minutes more and tried starting it again. It failed.

So I’m wondering if there’s a timing problem somehow. Like something isn’t getting released or reset when I shut down. This then causes the next boot to fail. Give it a long enough time (or maybe get lucky like I did on my 2nd attempt at starting container 999 above) and it gets released/reset. This is wild speculation on my part, so hopefully the log helps.

Anyway, if there’s anything else I can do to troubleshoot this and contribute in my own tiny way to making the product better, I’m happy to do so.

Cheers.

Mercurial_Mongoose · Aug 5, 2020

One other thing to mention... Shutdowns and lost networks shouldn't be a problem. I have my NAS, my router, and a NUC (proxmox host) all running off a UPS battery backup. If the power blinks or even goes out for 30 minutes, everything keeps running. I also get emails from the NAS any time it detects it's switched over to battery power or if it auto shuts down. I haven't had any power problems recently.

Stoiko Ivanov · Aug 6, 2020

Mercurial_Mongoose said:
ournalctl | grep 999 > journalctl_999.log

That way you miss quite many potentially relevant messages e.g. network issues with your NAS are not related to 999
- read the log - look for errors - try pasting them into google - that way you'll learn the most and probably solve your problem!
I hope this helps!

maenda · Oct 4, 2020

Sorry for bumping this thread.
I have the same issue when using CIFS to my nas. The VM is created correctly but does not start. Only after an fsck, it runs.

Reboot: the same issue again.

Switching to NFS fixes it but I want to find out why this is. Because it should run stable on CIFS also.

Search

Search

Proxmox Containers Start First Time, Then Fail to Start After Shutdown Until Host is Rebooted

Mercurial_Mongoose

New Member

Mercurial_Mongoose

New Member

Mercurial_Mongoose

New Member

Attachments

Mercurial_Mongoose

New Member

Stoiko Ivanov

Proxmox Staff Member

Mercurial_Mongoose

New Member

Stoiko Ivanov

Proxmox Staff Member

Mercurial_Mongoose

New Member

Attachments

Mercurial_Mongoose

New Member

Stoiko Ivanov

Proxmox Staff Member

maenda

Well-Known Member

We value your privacy