[SOLVED] Proxmox reports VM disks don't exist only on startup

Mog

New Member
Oct 7, 2021
7
4
3
28
This is the second time I've installed a proxmox host from scratch. This time I imported a ZFS pool from another proxmox machine and restored all my VMs and containers from that. However on startup, any of the containers or VMs configured to start at boot fail to start, saying that their respective disks don't exist. For example:

Code:
TASK ERROR: volume 'prox-vm:103/vm-103-disk-0.qcow2' does not exist

Similarly for containers configured to start at boot:

Code:
TASK ERROR: volume 'prox-vm:100/vm-100-disk-1.raw' does not exist

These disks definitely do exist at this location. prox-vm is a directory storage referencing a path in a zfs pool named prox-ZFS. These errors only occur on startup, and after boot (and these errors) I can start the respective VMs and containers without any issue at all. Fiddling with the startup delays didn't seem to work, and googling only turns up issues with the VM disks actually being missing which isn't the case here.

This same ZFS pool existed on a slightly older install of proxmox and didn't have any issues. Does anyone know what the issue is here?

storage.cfg:
Code:
dir: local
        path /var/lib/vz
        content iso,backup,vztmpl

zfspool: local-zfs
        pool rpool/data
        content images,rootdir
        sparse 1

zfspool: prox-ZFS
        pool prox-ZFS
        content images,rootdir
        mountpoint /prox-ZFS
        sparse 0

dir: prox-backups
        path /mnt/zfs/backup
        content backup
        prune-backups keep-all=1
        shared 0

dir: prox-vm
        path /mnt/zfs/vm
        content images,rootdir
        prune-backups keep-all=1
        shared 0

dir: prox-iso
        path /mnt/zfs/iso
        content iso,vztmpl
        prune-backups keep-all=1
        shared 0

LXC config:
Code:
arch: amd64
cores: 1
hostname: piHole
memory: 256
net0: name=eth0,bridge=vmbr0,firewall=1,hwaddr=9E:3D:1F:85:AA:76,ip=dhcp,type=veth
onboot: 1
ostype: ubuntu
rootfs: prox-vm:100/vm-100-disk-1.raw,size=4G
startup: up=25
swap: 512
unprivileged: 1

Associated disk:
Code:
ls -lrta /mnt/zfs/vm/images/100/
total 1748886
-rw-r-----  1 root root 4294967296 Aug  2  2020 vm-100-disk-0.raw
drwxr-----  2 root root          4 Jan  8 17:39 .
-rw-r-----  1 root root 4294967296 Jan 11 16:40 vm-100-disk-1.raw
drwxr-xr-x 10 root root         10 Jan 11 21:49 ..


VM config:
Code:
balloon: 1024
bootdisk: scsi0
cores: 2
ide2: none,media=cdrom
memory: 2048
name: Plex
net0: virtio=66:2E:C5:F4:71:99,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: prox-vm:103/vm-103-disk-0.qcow2,size=21G
scsihw: virtio-scsi-pci
smbios1: uuid=49e4b457-f598-4b4f-8d46-9574a77a24c0
sockets: 1
startup: up=30
vmgenid: 6d35fc36-1fc6-482d-9c17-4350de8e2619

Associated disk:
Code:
ls -lrta /mnt/zfs/vm/images/103/
total 13451226
drwxr-----  2 root root           3 Jan 10 16:55 .
drwxr-xr-x 10 root root          10 Jan 11 21:49 ..
-rw-r-----  1 root root 22552248320 Jan 16 18:47 vm-103-disk-0.qcow2
 
An update here to my own thread. I have tried a lot of different options after about a week of Googling. The only thing I have gotten to work to some degree so far has been to disable the zfs-import-cache service, and instead enable zfs-import-scan.

This works and my containers & VMs start on boot, but zfs-import-scan also detects the ZFS pool on the disks I have attached to a HBA card for a virtualized TrueNAS instance and tries to import that as well. This is not ideal, and I wasn't able to find a way to blacklist certain pools or only allow certain pools to be imported via the scan service. The main solution on forums seems to be "use the cache service instead".

From what I can only assume is happening, the cache service fails because the cache file exists on my boot zpool. On startup this pool isnt initialized/mounted therefore the cache file isn't able to be read and no pools are imported. The hole in my theory is that after boot is complete and I log into the system, all my pools are loaded! The root pool is obviously loaded because the system is running, but the secondary pool I use for VM disks and etc. is also loaded.

So something is eventually loading/mounting these pools, but not in time. I cannot figure out what it is. Please if anyone has any guidance or suggestions here, they are much appreciated. I will continue to update this thread if I can get something working.


More context:
journalctl -b
Code:
Jan 24 22:44:31 pve1 pvestatd[3259]: unable to activate storage 'prox-backups' - directory is expected to be a mount point but is not mounted: '/mnt/zfs/backup'
Jan 24 22:44:31 pve1 pvestatd[3259]: zfs error: cannot open 'prox-ZFS': no such pool
Jan 24 22:44:31 pve1 zed[4359]: eid=7 class=config_sync pool='prox-ZFS'
Jan 24 22:44:31 pve1 zed[4361]: eid=8 class=pool_import pool='prox-ZFS'
Jan 24 22:44:31 pve1 zed[4419]: eid=10 class=config_sync pool='prox-ZFS'

systemctl status zfs-import-cache.service
Code:
● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; vendor preset: enabled)
     Active: active (exited) since Tue 2023-01-24 22:44:17 PST; 10min ago
       Docs: man:zpool(8)
    Process: 2677 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOOL_IMPORT_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 2677 (code=exited, status=0/SUCCESS)
        CPU: 5ms

Jan 24 22:44:17 pve1 systemd[1]: Starting Import ZFS pools by cache file...
Jan 24 22:44:17 pve1 zpool[2677]: no pools available to import
Jan 24 22:44:17 pve1 systemd[1]: Finished Import ZFS pools by cache file.

zpool list
Code:
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
prox-ZFS   928G   528G   400G        -         -    18%    56%  1.00x    ONLINE  -
rpool      464G  1.58G   462G        -         -     0%     0%  1.00x    ONLINE  -
 
HI,
so your main problem here seems to be that the directory based storage is not available when the service wants to start the VMs because the ZFS mountpoint is not mounted yet.

Have you tried these steps https://forum.proxmox.com/threads/second-zfs-pool-failed-to-import-on-boot.102409/post-441193
So basically recreating the cache files for all of you pools and recreate the initramfs?

I would recommend to migrate the VM disks from the directory based storage to a ZVOL.
You should be able to migrate the disks via the WebUI, by clicking on the Container and Resources > <disk> > Volume Action > Move Storage. There you can select the zfs storage as target.
 
  • Like
Reactions: Mog
Thanks for the reply Chris. I did try recreating the cachefile that my pools are configured at and that didn't resolve this for me. However I did try your suggestion at migrating the VM disks to the zfs storage rather than the directory storage and got some interesting results.

It seems moving at least one CT disk (and no VM disk) in the zfs storage results in containers and VMs being started at boot correctly. Putting at least one VM disk (but no CT disk) in the zfs storage results in VMs being started correctly but not containers.

I'm glad there's a better solution to get this working than by using zfs-import-scan, but do you have any explanation of this behavior? I don't know why having just some storage at the zfs level and leaving the rest in the container storage results in all the pools being loaded properly at boot time.
 
Have you tried to set a startup delay for the VM/CT in order for it to let the zfs import finish?
I tried playing with the delays and couldn't get the VM/CTs to startup on boot. I assume this is because the startup delay only takes effect after a VM/CT is successfully started. Since all the ones in the boot order fail, it never tries to wait.

In the meantime I've just moved all my disks onto the ZFS store rather than directory storage and it seems to be happy with that.
 
  • Like
Reactions: mikeinnyc
Thanks for this, I somehow didn't turn this up while I was looking for a way to delay VM/CT startups. I just tried this out by moving my disks all back to the directory storage and adding a delay of 30s on the node, and things are looking good.

So it looks like in this case where no ZFS or grub options resolve VM/CTs not starting at boot, there are 3 options:
  1. Use zfs-import-scan. Works, but can pick up pools you don't want to import and requires good physical security of the machine as it will automatically import any pools it detects if it can, including ones on usb devices. Or,
  2. Don't use directory storage to store VM/CT disks. Keep them on the ZVOL as Chris suggested. Or,
  3. Add a delay to the node to give more time for the system to mount all the pools. In my case 30s was more than enough. I ran this command on the node: pvenode config set --startall-onboot-delay 30
Hope this helps anyone else finding their way here via Google or a forum search. Since there are multiple options to get this working, I will update this thread as solved. Thanks folks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!