vm guest "startup on boot" fails when using shared NFSv3 storage

ymir

Renowned Member
Mar 18, 2016
6
0
66
Testing Proxmox VE 4.2 as an alternative virtualization solution for larger installations with shared storage we ran into several issues, including a severe problem:

When booting up a server with NFSv3 shares the start-up of kvm vm guests "on boot" inconsistently* fails with " {hostname} pve-manager[2915]: storage '{nfs share}' is not online". *Inconsistently, since from time to time the start-up of kvm guests is OK. When starting multiple kvm vm guests using shared storage the effect is totally random since it's unpredictable if or when any vm guest is able to connect to it's shared storage.

In our test-scenario, all NFS shares are connected by a LACP bond connected to a dedicated storage LAN attached to a OVSwitch with one OVInterface using a tagged VLAN. The used storage servers are amd64 FreNAS 9.3 and 9.10 systems with ZFS RaidZ2 using mirrored SLC SSDs for the ZIL log.

So far, manual starting of any vm guest was always successful. All NFS connections are working after a successful boot of the Proxmox VE host.

It appears as if the "start-up vm guests at boot time" procedure is not in sync with the (remote?) storage handling. Trying to specify start-up delay timeouts for vm guests also fails (unless the first started vm guest use a local storage device? - this was not tested) since the functionality seems to apply the specified timeout for any vm guest only after a previously successfully started other vm guest.

Any ideas how to fix or at least bypass this critical problem (without creating a dummy vm guest or modifying the perl source code :confused:)?
 
I've tested it here with actual 4.2. Works fine. Ubuntu VM on NFS3 share. Share is on ZFS Proxmoxstorage. Started and stopped the VM 20 times with no problems. Can you test it with another storage? Also use here LACP on both sites.
 
Thank's for the information:). Based on your feed-back (it's not a general problem) i was able to isolate the cause for the unexpected behavior of Proxmox (and find sort of a workaround o_O):

A usable workaround is the usage of a Linux bond, bond attached VLAN connection and bridge instead of an OVS bond, bridge and bridge interface with tagged VLAN.

The "start-up vm guests at boot time" procedure is obviously not in sync with the remote storage handling, or to be more precise - with the network initialization. However, this seems to apply only if the remote storage is connected using Open vSwitch. This assumption is also indicated by the fact - evaluated by some testing - when starting a vm guest using only locally attached storage as first vm, utilizing a startup timeout, all subsequent vm guests with remote storage connected using OVS are starting without any problem.
(A short assumption without reviewing the source code here: maybe there is generally no sync checking implemented? But since the Linux bridge handling is much faster compared to OVS - syncing network initialization, the availability of remote storage and starting vm guests at boot time is typically not an issue :D?)

So, at least for the time being, it seems only possible to connect remote storage using a Linux bond and bridge instead of an OVS bond and bridge to ensure all vm guests are started at boot time as expected. Of course, to use tagged VLANs on a Linux bridge with (LACP) bonded interfaces some manual manipulation (see https://pve.proxmox.com/wiki/Vlans) is also required.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!