VM not starting at boot on 3 node Proxmox cluster

tommisan · Feb 28, 2022

We have a 3-node production Cluster (enterprise repo) with Dell PowerEdge r740XD with both local and shared storage.
After power loss in the datacenter VM on Proxmox cluster didn't start at boot, even if Start at boot is set to yes.

pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-3-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.11: 7.0-10
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-4-pve: 5.11.22-9

Servers are configured to start automatically after power restore.
VM started if powered on manually.

Any ideas?
Thanks

dcsapak · Feb 28, 2022

can you post the journal from such a boot where it does not work?

tommisan · Feb 28, 2022

dcsapak said:
can you post the journal from such a boot where it does not work?

-- Journal begins at Mon 2021-09-20 16:22:32 CEST, ends at Mon 2022-02-28 12:17:01 CET. --
Feb 24 17:15:01 pve2 systemd[1]: Starting The Proxmox VE cluster filesystem...
Feb 24 17:15:01 pve2 pmxcfs[2218]: [quorum] crit: quorum_initialize failed: 2
Feb 24 17:15:01 pve2 pmxcfs[2218]: [quorum] crit: can't initialize service
Feb 24 17:15:01 pve2 pmxcfs[2218]: [confdb] crit: cmap_initialize failed: 2
Feb 24 17:15:01 pve2 pmxcfs[2218]: [confdb] crit: can't initialize service
Feb 24 17:15:01 pve2 pmxcfs[2218]: [dcdb] crit: cpg_initialize failed: 2
Feb 24 17:15:01 pve2 pmxcfs[2218]: [dcdb] crit: can't initialize service
Feb 24 17:15:01 pve2 pmxcfs[2218]: [status] crit: cpg_initialize failed: 2
Feb 24 17:15:01 pve2 pmxcfs[2218]: [status] crit: can't initialize service
Feb 24 17:15:02 pve2 systemd[1]: Started The Proxmox VE cluster filesystem.
Feb 24 17:15:07 pve2 pmxcfs[2218]: [status] notice: update cluster info (cluster name clusterdisia, version = 3)
Feb 24 17:15:07 pve2 pmxcfs[2218]: [dcdb] notice: members: 2/2218
Feb 24 17:15:07 pve2 pmxcfs[2218]: [dcdb] notice: all data is up to date
Feb 24 17:15:07 pve2 pmxcfs[2218]: [status] notice: members: 2/2218
Feb 24 17:15:07 pve2 pmxcfs[2218]: [status] notice: all data is up to date
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: members: 2/2218, 3/2197
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: starting data syncronisation
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: cpg_send_message retried 1 times
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: node has quorum
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: members: 2/2218, 3/2197
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: starting data syncronisation
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: received sync request (epoch 2/2218/00000002)
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: received sync request (epoch 2/2218/00000002)
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: received all states
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: leader is 2/2218
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: synced members: 2/2218, 3/2197
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: start sending inode updates
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: sent all (0) updates
Feb 24 17:16:10 pve2 pmxcfs[2218]: [dcdb] notice: all data is up to date
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: received all states
Feb 24 17:16:10 pve2 pmxcfs[2218]: [status] notice: all data is up to date
Feb 24 17:16:11 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:16:11 pve2 pmxcfs[2218]: [dcdb] notice: members: 1/2227, 2/2218, 3/2197
Feb 24 17:16:11 pve2 pmxcfs[2218]: [dcdb] notice: starting data syncronisation
Feb 24 17:16:11 pve2 pmxcfs[2218]: [status] notice: members: 1/2227, 2/2218, 3/2197
Feb 24 17:16:11 pve2 pmxcfs[2218]: [status] notice: starting data syncronisation
Feb 24 17:16:12 pve2 pmxcfs[2218]: [dcdb] notice: received sync request (epoch 1/2227/00000002)
Feb 24 17:16:12 pve2 pmxcfs[2218]: [status] notice: received sync request (epoch 1/2227/00000002)
Feb 24 17:16:12 pve2 pmxcfs[2218]: [dcdb] notice: received all states
Feb 24 17:16:12 pve2 pmxcfs[2218]: [dcdb] notice: leader is 1/2227
Feb 24 17:16:12 pve2 pmxcfs[2218]: [dcdb] notice: synced members: 1/2227, 2/2218, 3/2197
Feb 24 17:16:12 pve2 pmxcfs[2218]: [dcdb] notice: all data is up to date
Feb 24 17:16:12 pve2 pmxcfs[2218]: [status] notice: received all states
Feb 24 17:16:12 pve2 pmxcfs[2218]: [status] notice: all data is up to date
Feb 24 17:16:12 pve2 pmxcfs[2218]: [status] notice: dfsm_deliver_queue: queue length 3
Feb 24 17:16:12 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:16:18 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:16:19 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:16:26 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:33:45 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:35:28 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:35:28 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:35:32 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:35:33 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:36:08 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:36:09 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 17:48:21 pve2 pmxcfs[2218]: [status] notice: received log
Feb 24 18:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 24 19:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 24 20:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 24 21:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 24 22:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 24 23:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 25 00:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 25 01:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 25 02:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful
Feb 25 03:14:47 pve2 pmxcfs[2218]: [dcdb] notice: data verification successful

dcsapak · Feb 28, 2022

mhmm... that is not the whole journal since boot, i'd need the output of

Code:

journalct -b -0

for example (the -0 is the current boot, -1 the one before that, etc.)

tommisan · Feb 28, 2022

dcsapak said:
mhmm... that is not the whole journal since boot, i'd need the output of

Code:

journalct -b -0

for example (the -0 is the current boot, -1 the one before that, etc.)

in attachment,
journalctl -b -0

dcsapak · Feb 28, 2022

it seems that the storage is not online during the boot (or not reachable), i see many lines like this:

Feb 24 17:16:31 pve2 pve-guests[2779]: storage 'dm5000h' is not online

can you show your storage config? (/etc/pve/storage.cfg)

tommisan · Mar 1, 2022

dcsapak said:
it seems that the storage is not online during the boot (or not reachable), i see many lines like this:

can you show your storage config? (/etc/pve/storage.cfg)

The shared storage was also badly switched off during the power loss, so it could be possible it would take some time to boot and recover properly.

#cat /etc/pve/storage.cfg
dir: local
path /var/lib/vz
content iso,vztmpl,backup

lvmthin: local-lvm
thinpool data
vgname pve
content rootdir,images

dir: bck1
path /mnt/pve/bck1
content iso,rootdir,backup,snippets,images,vztmpl
is_mountpoint 1
nodes pve1

dir: bck2
path /mnt/pve/bck2
content vztmpl,images,snippets,backup,rootdir,iso
is_mountpoint 1
nodes pve2

dir: bck3
path /mnt/pve/bck3
content iso,rootdir,backup,snippets,images,vztmpl
is_mountpoint 1
nodes pve3

nfs: dm5000h
export /vm_nfs
path /mnt/pve/dm5000h
server 10.0.34.11
content images,rootdir
prune-backups keep-all=1

#pvesm status --storage dm5000h
Name Type Status Total Used Available %
dm5000h nfs active 4294967360 321790400 3973176960 7.49%

dcsapak · Mar 1, 2022

tommisan said:
The shared storage was also badly switched off during the power loss, so it could be possible it would take some time to boot and recover properly.

if the necessary storage is not online when trying to start the vm, that cannot work...

tommisan · Mar 1, 2022

dcsapak said:
if the necessary storage is not online when trying to start the vm, that cannot work...

I agree, but when different systems, like servers and storages, start at the same time, there could be a misalignment of some minutes.
The retry actions to mount the storage nfs and start a vm how long will go after PVE node boot?

dcsapak · Mar 1, 2022

the nfs will be retried every 10 seconds, but once the 'startall' task was run (even with errors) it will not be run again (until the next reboot)

you can set an 'onboot delay' for the startall, but that will then unconditionally happen everytime (regardless if the storage is online or not)

Code:

pvenode config set -startall-onboot-delay <seconds>

tommisan · Mar 1, 2022

dcsapak said:
the nfs will be retried every 10 seconds, but once the 'startall' task was run (even with errors) it will not be run again (until the next reboot)

you can set an 'onboot delay' for the startall, but that will then unconditionally happen everytime (regardless if the storage is online or not)

Code:

pvenode config set -startall-onboot-delay <seconds>

Could also be hardcoded in a configuration file?
What is the default startall-onboot-delay?
Startall-onboot-delay could be set per storage and not globally?
Thanks for your support

dcsapak · Mar 1, 2022

tommisan said:
Could also be hardcoded in a configuration file?

what do you mean? if you execute that command this is saved until you set it to a different value

tommisan said:
What is the default startall-onboot-delay?

the default is no delay at all

tommisan said:
Startall-onboot-delay could be set per storage and not globally?

not thats not how that works currently

before starting the vms, the 'startall' call does not have an idea which storages are needed, that is handled deeper in the code by the vm starting itself

tommisan · Mar 1, 2022

dcsapak said:
what do you mean? if you execute that command this is saved until you set it to a different value

I mean if I can edit the file (where is it?) in which the parameters are written and then eventually reload it in some ways, instead of using command line to set it at runtime, just to remember that I set it in case of cluster upgrade/migration.
But command line is fine too.

dcsapak said:
the default is no delay at all

not thats not how that works currently

before starting the vms, the 'startall' call does not have an idea which storages are needed, that is handled deeper in the code by the vm starting itself

Ok

dcsapak · Mar 1, 2022

tommisan said:
I mean if I can edit the file (where is it?) in which the parameters are written and then eventually reload it in some ways, instead of using command line to set it at runtime, just to remember that I set it in case of cluster upgrade/migration.

the file where it is saved is '/etc/pve/nodes/<NODENAME>/config'

Search

Search

VM not starting at boot on 3 node Proxmox cluster

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

Attachments

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

tommisan

Renowned Member

dcsapak

Proxmox Staff Member

We value your privacy