pvestatd and pve-firewall doesn`t start after reboot

Denis Kulikov

Member
Feb 28, 2018
26
2
23
42
Recently we upgrade 5.4 to 6.1 and replace ifupdown to ifupdown2.
After changes - pvestatd and pve-firewall doesn`t start after reboot, logs showing only - timeout.

If 'systemctl start pvestatd' invoked by hands - service starts.

Can someone help us - find the right direction to debug this ?

root@pve1:/var/log# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1


Code:
Mar 23 22:22:42 pve1 systemd[1]: pvestatd.service: Start operation timed out. Terminating.
Mar 23 22:22:52 pve1 systemd[1]: pvestatd.service: Control process exited, code=killed, status=15/TERM
Mar 23 22:22:52 pve1 systemd[1]: pvestatd.service: Failed with result 'timeout'.

Code:
Mar 23 22:22:42 pve1 systemd[1]: pve-firewall.service: Start operation timed out. Terminating.
Mar 23 22:22:52 pve1 systemd[1]: pve-firewall.service: Control process exited, code=killed, status=15/TERM
Mar 23 22:22:52 pve1 systemd[1]: pve-firewall.service: Failed with result 'timeout'.
Mar 23 22:22:52 pve1 systemd[1]: Failed to start Proxmox VE firewall.
 
please - check the complete journal since boot (`journalctl -b`) - probably some hints to what times out are around the messages from the 2 services.
Is this a clustered environment?
is the cluster quorate and pmxcfs running at the time the services cannot start?
also check dmesg for potential hints

I hope this helps!
 
Many thanks for answer!
Is this a clustered environment?
Yes, we try to step-by-step upgrade of our 3 nodes to 6.
One node is upgraded to 6.

Cluster is quorate:

Code:
root@pve1:/etc/pve# pvecm status
Cluster information
-------------------
Name:             aaa
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Mar 31 17:03:37 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.209
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.168.7 (local)
0x00000002          1 192.168.168.6
0x00000004          1 192.168.168.3


is the cluster quorate and pmxcfs running at the time the services cannot start?
also check dmesg for potential hints

You are absolutely right, root cause - is delayed start of Proxmox VE cluster filesystem (it cannot start correctly then boot is progress, but after these work correctly and without user action).
But i cannot find root cause of this:

Code:
mar 26 22:07:32 pve1 systemd[1]: Starting The Proxmox VE cluster filesystem...
mar 26 22:07:33 pve1 systemd[1]: Started LXC Container Initialization and Autoboot Code.

mar 26 22:07:37 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [quorum] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [confdb] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [dcdb] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [status] crit: can't initialize service

mar 26 22:07:38 pve1 systemd[1]: Started The Proxmox VE cluster filesystem.

[cut]

mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Corosync Cluster Engine 3.0.3 starting up
mar 26 22:07:48 pve1 systemd[1]: Starting User Manager for UID 0...
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
mar 26 22:07:49 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.7 resolved as 192.168.168.7
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.7 resolved as 192.168.168.7
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.6 resolved as 192.168.168.6
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.3 resolved as 192.168.168.3
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Please migrate config file to nodelist.
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] cpu.rt_runtime_us doesn't exists -> system without cgroup or with disabled CONFIG_RT_GROUP_SCHED
mar 26 22:07:55 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2

I will try to migrate corosync config to nodelist and read (https://forum.proxmox.com/threads/after-upgrade-to-5-2-11-corosync-does-not-come-up.49075/).
I hope this helps.
 
We try to change /lib/systemd/system/pve-cluster.service:
change Before=corosync.service
to After=corosync.service
and It works (as workaround after reboot).
 
I would not suggest to keep this change - the cluster-stack (corosync) is there to keep all nodes in your cluster synchronized
if you start pmxcfs locally before it can synchronize the state from the other nodes this could lead to a split-brain situation (e.g. the node starts a VM which is running on another node according to the quorate part of the cluster)

if possible try to check why corosync needs so long to start up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!