ZFS woes - cannot mount '/': directory not empty

Mar 19, 2018
27
1
6
48
Hi there proxmoxers,
I'm in a real bind. My usually always stable Proxmox install is completely on it's knees right now.

I have a SSD + Larger HDD. Only 2 VM's ( pfsense, and a Centos VM used to host docker containers for various things ).

There was a power failure here, and upon reboot, my VM's wouldn't come back up. Logging into the physical host, I noticed a failure being ZFS Mount failed ( the zfs-mount.service ).

Doing a zfs mount -a doesn't work. I get the error
Code:
cannot mount '/': directory not empty
Looking at another person who had a similar issue and trying:

1. zfs mount -O -a - whilst it doesn't error, it still doesn't work - nothing mounts
2. adding
Code:
mkdir 0
and
Code:
is_mountpoint 1
in storage.cfg also does nothing

Output of zpool status:
Code:
root@proxmox:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Sat Apr 13 12:23:07 2019
   261G scanned out of 334G at 53.7M/s, 0h23m to go
   0B repaired, 78.01% done
config:

   NAME        STATE     READ WRITE CKSUM
   rpool       ONLINE       0     0     0
     sda2      ONLINE       0     0     0
     sdb       ONLINE       0     0     0

errors: No known data errors

output of zfs list:
Code:
root@proxmox:~# zfs list
NAME                                               USED  AVAIL  REFER  MOUNTPOINT
rpool                                              340G  2.41T    96K  /
rpool/ROOT                                        15.6G  2.41T    96K  /ROOT
rpool/ROOT/pve-1                                  15.6G  2.41T  15.6G  /
rpool/data                                         315G  2.41T    96K  /rpool/data
rpool/data/vm-100-disk-1                          4.62G  2.41T  2.73G  -
rpool/data/vm-100-state-Backup                     822M  2.41T   822M  -
rpool/data/vm-100-state-US_VPN_Added              1.05G  2.41T  1.05G  -
rpool/data/vm-100-state-pfsense_post_letsencrypt   832M  2.41T   832M  -
rpool/data/vm-100-state-pfsenseworking             473M  2.41T   473M  -
rpool/data/vm-100-state-preupgrade                 509M  2.41T   509M  -
rpool/data/vm-102-disk-1                          56.7G  2.41T  27.2G  -
rpool/data/vm-102-disk-2                           235G  2.41T   223G  -
rpool/data/vm-102-disk-4                            56K  2.41T    56K  -
rpool/data/vm-102-state-centos_letsencrypt        5.17G  2.41T  5.17G  -
rpool/data/vm-102-state-centosbackup              2.61G  2.41T  2.61G  -
rpool/data/vm-102-state-postgres_container        4.73G  2.41T  4.73G  -
rpool/data/vm-102-state-update                    2.69G  2.41T  2.69G  -
rpool/swap                                        8.50G  2.41T  3.15G  -

pvesm status:
Code:
root@proxmox:~# pvesm status
Name             Type     Status           Total            Used       Available        %
local             dir     active      2600781056        16374016      2584407040    0.63%
local-zfs     zfspool     active      2915206884       330799788      2584407096   11.35%

pveperf:
Code:
root@proxmox:~# pveperf
CPU BOGOMIPS:      25536.00
REGEX/SECOND:      3457446
HD SIZE:           2480.30 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND:     57.72
DNS EXT:           1004.50 ms
DNS INT:           1001.66 ms (seb)

The drives:
Code:
root@proxmox:~# lsblk
NAME      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda         8:0    0 111.8G  0 disk
├─sda1      8:1    0  1007K  0 part
├─sda2      8:2    0 111.8G  0 part
└─sda9      8:9    0     8M  0 part
sdb         8:16   0   2.7T  0 disk
├─sdb1      8:17   0   2.7T  0 part
└─sdb9      8:25   0     8M  0 part
zd0       230:0    0     8G  0 disk [SWAP]
zd16      230:16   0   8.5G  0 disk
zd32      230:32   0  32.5G  0 disk
zd48      230:48   0   8.5G  0 disk
zd64      230:64   0  16.5G  0 disk
zd80      230:80   0   8.5G  0 disk
zd96      230:96   0  16.5G  0 disk
zd112     230:112  0   1.2T  0 disk
zd128     230:128  0   8.5G  0 disk
zd144     230:144  0    80G  0 disk
├─zd144p1 230:145  0     1G  0 part
└─zd144p2 230:146  0    79G  0 part
zd160     230:160  0   8.5G  0 disk
zd176     230:176  0    10G  0 disk
├─zd176p1 230:177  0   512K  0 part
├─zd176p2 230:178  0   9.5G  0 part
└─zd176p3 230:179  0   512M  0 part
zd192     230:192  0   1.2T  0 disk
└─zd192p1 230:193  0   1.2T  0 part
zd208     230:208  0  16.5G  0 disk

The hardware is an i5-6500, 32Gb DDR4 RAM - Proxmox is installed on the SSD ( sda2 ).

Is this recoverable? if so, how? I feel like I've tried all the things Google suggests from people in similar situations, and basically I changed absolutely nothing config wise prior to this happening, so I'm at a loss as to why this just stopped working.

Is there any other output I could provide here that may help someone point me in the right direction?

Please let me know if one of you Proxmox guru's could assist!

Thanks

Seb
 
it looks like all is there, what exact error message do you get when starting the guests? (qm start ID)
 
Hi Dominik,
After walking away from the machine for a while and having another go today I've managed to get things running again. I'm still having the error on boot with ZFS mounting failing via Systemd, and systemctl status zfs-mount.service is showing it as failed - but as you say, it seems to be up. Are zfs volumes mounted some other way before systemd can get to it? This install was done using the ISO from 5.2 ( and I upgraded to 5.4 after I got all this running, so I'm up to date as of now ).

When I finally managed to get access ( by physically plugging in monitor/keyboard to the host ), qm status <vm_id> showed it was running. However it wasn't until I was able to access the GUI, and then the console of the VM in question ( my pfsense guest ), that I realised it was having a kernel panic. A subsequent reboot got it working again, but whenever I reboot the host machine, and it tries to auto start that VM, it now goes into a kernel panic.

I think one lesson here is that whilst pfsense works great in a VM with hardware NIC's passed through - it's a pain when it goes down! I'm probably going to move pfsense to a separate physical machine after this experience.

I noticed alot of entries like:
Code:
pvedaemon[5674]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
in my host logs, and on the VM logs, there were CPU soft locks, so I'm not sure if there is an issue with scheduling or timing or something.

There were a lot of network events happening ( many separate downloads from different containers, accessing an NFS share ). But load on the VM's wasn't high, and memory usage was fine. I did find that if I had Plex container transcoding video + multiple containers doing other work on the Centos VM, then things would freeze up again.

I'll keep monitoring things, but for now it's working - but I think deep down, something still isn't quite right.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!