LXC will not start with root disk on CIFS

Oct 15, 2020
3
0
1
124
Hi. I recently installed Proxmox on my homelab cluster of 3 machines.
I use a server running Windows Server 2019 for storage using CIFS.
With this configuration, I have run into the following issue using LXC:

Containers, that I create, will not start with the root disk on CIFS storage (unless I run "pct fsck xxx" before each start).

Here is the output when running "lxc-start" and "lxc-stop" after fsck (where the container starts and works as expected, then is stopped):
Code:
root:~# lxc-start -n 104
root:~# lxc-stop -n 104

journalctl during "lxc-stop":
kernel: CIFS VFS: No writable handle in writepages rc=-9
kernel: EXT4-fs warning (device loop0): ext4_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.

When attempting to start the container again:
Code:
root:~# lxc-start -n 104
lxc-start: 104: lxccontainer.c: wait_on_daemonized_start: 841 No such file or directory - Failed to receive the container state
lxc-start: 104: tools/lxc_start.c: main: 308 The container failed to start
lxc-start: 104: tools/lxc_start.c: main: 311 To get more details, run the container in foreground mode
lxc-start: 104: tools/lxc_start.c: main: 314 Additional information can be obtained by setting the --logfile and --logpriority options

journalctl during "lxc-start":
kernel: blk_update_request: I/O error, dev loop0, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
kernel: Buffer I/O error on dev loop0, logical block 0, lost sync page write
kernel: EXT4-fs (loop0): I/O error while writing superblock
kernel: EXT4-fs (loop0): mount failed
pvestatd[1470]: unable to get PID for CT 104 (not running?)
pvestatd[1470]: status update time (41.062 seconds)
kernel: CIFS VFS: No writable handle in writepages rc=-9

And with additional logging:
Code:
root:~# lxc-start -n 104 -F -l DEBUG -o /tmp/lxc-104.txt
root:~# cat /tmp/lxc-104.txt
lxc-start 104 20201015145948.122 INFO     confile - confile.c:set_config_idmaps:2055 - Read uid map: type u nsid 0 hostid 100000 range 65536
lxc-start 104 20201015145948.122 INFO     confile - confile.c:set_config_idmaps:2055 - Read uid map: type g nsid 0 hostid 100000 range 65536
lxc-start 104 20201015145948.122 INFO     lsm - lsm/lsm.c:lsm_init:29 - LSM security driver AppArmor
lxc-start 104 20201015145948.122 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-prestart-hook" for container "104", config section "lxc"
lxc-start 104 20201015150033.517 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 104 lxc pre-start produced output: mount: /var/lib/lxc/.pve-staged-mounts/rootfs: can't read superblock on /dev/loop0.

lxc-start 104 20201015150033.538 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-prestart-hook 104 lxc pre-start produced output: command 'mount /dev/loop0 /var/lib/lxc/.pve-staged-mounts/rootfs' failed: exit code 32

lxc-start 104 20201015150033.546 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 255
lxc-start 104 20201015150033.546 ERROR    start - start.c:lxc_init:797 - Failed to run lxc.hook.pre-start for container "104"
lxc-start 104 20201015150033.546 ERROR    start - start.c:__lxc_start:1896 - Failed to initialize container "104"
lxc-start 104 20201015150033.547 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxcfs/lxc.reboot.hook" for container "104", config section "lxc"
lxc-start 104 20201015150034.488 INFO     conf - conf.c:run_script_argv:340 - Executing script "/usr/share/lxc/hooks/lxc-pve-poststop-hook" for container "104", config section "lxc"
lxc-start 104 20201015150034.399 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 104 lxc post-stop produced output: umount: /var/lib/lxc/104/rootfs: not mounted

lxc-start 104 20201015150034.399 DEBUG    conf - conf.c:run_buffer:312 - Script exec /usr/share/lxc/hooks/lxc-pve-poststop-hook 104 lxc post-stop produced output: command 'umount --recursive -- /var/lib/lxc/104/rootfs' failed: exit code 1

lxc-start 104 20201015150034.406 ERROR    conf - conf.c:run_buffer:323 - Script exited with status 1
lxc-start 104 20201015150034.406 ERROR    start - start.c:lxc_end:964 - Failed to run lxc.hook.post-stop for container "104"
lxc-start 104 20201015150034.406 ERROR    lxc_start - tools/lxc_start.c:main:308 - The container failed to start
lxc-start 104 20201015150034.406 ERROR    lxc_start - tools/lxc_start.c:main:314 - Additional information can be obtained by setting the --logfile and --logpriority options

/etc/pve/storage.cfg
Code:
dir: local
        path /var/lib/vz
        content iso,vztmpl,backup

zfspool: local-zfs
        pool rpool/data
        content rootdir,images
        sparse 1

cifs: nas
        path /mnt/pve/nas
        server nas.lan
        share pve$
        content images,iso,rootdir,vztmpl,backup
        maxfiles 1
        username pve

pct config 104
Code:
arch: amd64
cores: 2
hostname: test
memory: 4096
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=06:01:5F:63:F9:EC,ip=dhcp,type=veth
ostype: ubuntu
rootfs: nas:104/vm-104-disk-0.raw,size=8G
swap: 0
unprivileged: 1

pvecm status
Code:
Cluster information
-------------------
Name:             homelab
Config Version:   3
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Thu Oct 15 16:45:43 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000003
Ring ID:          1.292
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.20.0.10
0x00000002          1 10.20.0.20
0x00000003          1 10.20.0.30 (local)

pveversion -v
Code:
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-9
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.1-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2

Any ideas on what I am doing wrong, or what I could try?

Thank you for your time.
 
CIFS is not a good choice for VM storage. If you must host your storage on the Windows Server across the network, I would suggest you install Server for NFS role and create an NFS share. The performance of your VM's will probably not be very good unless you have fast storage and networking
 
  • Like
Reactions: BassT
CIFS is not a good choice for VM storage. If you must host your storage on the Windows Server across the network, I would suggest you install Server for NFS role and create an NFS share. The performance of your VM's will probably not be very good unless you have fast storage and networking

Thank you for your reply. What makes CIFS unsuitable as storage? I don't see any warnings on the wiki in regards to CIFS.

Running KVM from the same storage works perfectly and so does LXC after I run fsck before starting. Performance is excellent.

My issue is that the root disk seems to get corrupted when stopping the container, causing a failure to start up afterwards. Running fsck fixes that temporarily, but I don't feel comfortable with that sort of workaround.

This seems like a bug to me, so I am hoping that someone is able to tell me why it happens, and where to look for a fix.
 
I installed the "Server for NFS" role on the NAS and it works, but it comes with a significant performance hit.

The performance hit was observed using fio:
Bash:
fio --loops=1 --size=2500m --filename=/test/fiotest.tmp --stonewall --ioengine=libaio --direct=1
  --name=4KQD32R --bs=4k --iodepth=32 --rw=randread
  --name=4KQD32W --bs=4k --iodepth=32 --rw=randwrite

Results on two identical VMs, each running on shared storage (all flash array with 10G link):

Read (4KQD32)Write (4KQD32)
CIFSIOPS=71.8k, BW=281MiB/sIOPS=67.8k, BW=265MiB/s
NFSIOPS=33.2k, BW=130MiB/sIOPS=9935, BW=38.6MiB/s

These results are consistent across multiple tests.

For now I am running VMs instead of LXC due to this issue (VMs works perfectly on CIFS). But I would like to switch to LXC for the reduced overhead.

Does anyone have any suggestions on what to try next?
 
Same error message for me, different scenario.
I'm copying a several thousand files from a local LVM on the proxmox machine to a Samba share on a NAS.
On the main console, infrequently but randomly, I see "CIFS VFS: Close unmatched open", followed by ten repeats of "CIFS VFS: No writable handle in writepages".

I'm just using a plain recursive 'cp' command, and it does continue after the error, but I suspect somewhere a file has been clobbered.
 
Can I please add to this thread. I know it's a bit old, but I just setup a two node cluster and have had local LXC working great for a long time. Now when I create or clone an LXC to my new CIFS network storage they will not start. I tried OPs trick of using "pct fsck xxx" from shell then hitting start in the gui and it works great. Normal LXC performance. But, as soon as I turn off the LXC I can't start it again. What is the trick here that I'm missing? How is this not a bug?
 
Also stumbled onto this 'quirk'. Is there a solution to this problem, other than using NFS or local storage?

edit: this seems to have been temporary... after running
Code:
pct fsck 101
again, the image stays clean and I can now stop and start the container without issue. Let's see if this sticks...
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!