Proxmox not waiting for NFS shares at boot

UltraHKR

New Member
Apr 13, 2023
9
1
3
Ecuador
Hello I have a problem:

I have 4 servers:
  • SC836: Proxmox, Supermicro X8DTH-iF, 2x X5650, 192GB RAM
  • Dell R710: Proxmox, 2x X5650, 192GB RAM (4TB of ZFS local storage)
  • Dell R610: Proxmox, 1x X5640, 96GB RAM
  • IBM X3650 M3: TrueNAS, 2x X5645, 192GB RAM
The Proxmox (all on v8.3 lastest) hosts are all clustered, they work fine. TrueNAS Scale host serves NFS as VM storage for all hosts. Running lastest and/or near lastest version of both BIOS, PMX and TN.
  • The Proxmox hosts boot in less than 6-7 minutes.
  • The TrueNAS host takes longer to boot 10-11 minutes.
  • NOTE: It's a combination of IBM BIOS/UEFI slowness and time to bring up 3x ZFS pools with 54 HDD spinners.
Every time power gets back the VM's with NFS-backed storage fail to come online because NFS is not available at that moment.

I tried editing the pve-guest.service and adding a 240 second delay to its start but it doesn't work, it fires right away...

How can I fix this?
EDIT: Note this is a homelab setup.
 
Last edited:
https://www.reddit.com/r/Proxmox/comments/1gybd6z/proxmox_not_waiting_for_nfs_shares_at_boot/

Thought I remembered seeing this.

> Every time power gets back the VM's with NFS-backed storage fail to come online because NFS is not available at that moment

What do you mean "every time power gets back"? With 54 HD spinners you should be running everything on a pretty beefy UPS system - and not tolerating brownouts unless the power gets cut for more than 5 minutes or so; then you want a nice clean shutdown.

Additionally, with a rig like that you should arguably be running 24/7. Are you shutting things down on a daily basis or something?
 
Additionally, with a rig like that you should arguably be running 24/7. Are you shutting things down on a daily basis or something?
2-3 power cuts daily 6-12 hours long (total, nationwide), unless I add a genset or big solar panel setup, it's not sustainable...
(the beauty of third world countries /sarcasm)

yep, that's also me on reddit...
 
Last edited:
I would seriously consider colocation in a datacenter then. They *do* have AC and generators.

Trying to run a big spinner setup in a home environment where the power is incredibly unreliable is just going to damage your hardware.
 
  • Like
Reactions: bbgeek17
@UltraHKR, based on your unique setup, it seems you're working with an edge-case configuration and environment, which is unlikely to have widespread developer interest. Given the specialized requirements, you’ll need to implement a custom solution.

For example, a script ensures that:
  1. The system waits until the disks are visible by the OS.
  2. It starts the NAS, waiting for it to be fully operational, including service health checks.
  3. After the NAS is ready, proceed to start the other VMs.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
@bbgeek17
I can write a script that part would be easy, I have already done it with fstab (x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=600s,_netdev), and they work perfectly...

But what i don't get is why Proxmox by itself doesn't do that, it just assumes storage is already up and running...

I'm not the only one doing NFS/iSCSI/FC setups?

Maybe a PDU would solve half of this issue if they can do staggered outlets, but right now that's $$$

https://midnightreign.org/xenserver-xcp-ng-vm-autostart
Something like this but on whole cluster... more stuff to break/forget...

Another way could be to delay the whole cluster quorum somehow...
 
Last edited:
But what i don't get is why Proxmox by itself doesn't do that, it just assumes storage is already up and running...
Because that IS how you should be setting up an operation. Your situation is outside the design criteria for a highly available resource.

Maybe a PDU would solve half of this issue if they can do staggered outlets, but right now that's $$$
A PDU cant fix a power outage. maybe size and procure a UPS?

Another way could be to delay the whole cluster quorum somehow...
Thats just a terrible idea.

potential solutions to your situation from my perspective: create a systemd daemon that monitors your filer availability and turns if on/off in pvesm. would be relatively trivial to write. depending on what the storage use is, may also be possible to have a local disk act as a cachefilesd target which would insulate pvesm from timeout, but can cause other issues.