pvestatd and pve-firewall doesn`t start after reboot

Denis Kulikov · Mar 24, 2020

Recently we upgrade 5.4 to 6.1 and replace ifupdown to ifupdown2.
After changes - pvestatd and pve-firewall doesn`t start after reboot, logs showing only - timeout.

If 'systemctl start pvestatd' invoked by hands - service starts.

Can someone help us - find the right direction to debug this ?

root@pve1:/var/log# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-4.15: 5.4-12
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.13-3-pve: 5.3.13-3
pve-kernel-4.13: 5.2-2
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-23-pve: 4.15.18-51
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.15.18-19-pve: 4.15.18-45
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.13.16-4-pve: 4.13.16-51
pve-kernel-4.10.17-2-pve: 4.10.17-20
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve8
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Code:

Mar 23 22:22:42 pve1 systemd[1]: pvestatd.service: Start operation timed out. Terminating.
Mar 23 22:22:52 pve1 systemd[1]: pvestatd.service: Control process exited, code=killed, status=15/TERM
Mar 23 22:22:52 pve1 systemd[1]: pvestatd.service: Failed with result 'timeout'.

Code:

Mar 23 22:22:42 pve1 systemd[1]: pve-firewall.service: Start operation timed out. Terminating.
Mar 23 22:22:52 pve1 systemd[1]: pve-firewall.service: Control process exited, code=killed, status=15/TERM
Mar 23 22:22:52 pve1 systemd[1]: pve-firewall.service: Failed with result 'timeout'.
Mar 23 22:22:52 pve1 systemd[1]: Failed to start Proxmox VE firewall.

Stoiko Ivanov · Mar 25, 2020

please - check the complete journal since boot (`journalctl -b`) - probably some hints to what times out are around the messages from the 2 services.
Is this a clustered environment?
is the cluster quorate and pmxcfs running at the time the services cannot start?
also check dmesg for potential hints

I hope this helps!

Denis Kulikov · Mar 31, 2020

Many thanks for answer!

Stoiko Ivanov said:
Is this a clustered environment?

Yes, we try to step-by-step upgrade of our 3 nodes to 6.
One node is upgraded to 6.

Cluster is quorate:

Code:

root@pve1:/etc/pve# pvecm status
Cluster information
-------------------
Name:             aaa
Config Version:   16
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Mar 31 17:03:37 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.209
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.168.7 (local)
0x00000002          1 192.168.168.6
0x00000004          1 192.168.168.3

Stoiko Ivanov said:
is the cluster quorate and pmxcfs running at the time the services cannot start?
also check dmesg for potential hints

You are absolutely right, root cause - is delayed start of Proxmox VE cluster filesystem (it cannot start correctly then boot is progress, but after these work correctly and without user action).
But i cannot find root cause of this:

Code:

mar 26 22:07:32 pve1 systemd[1]: Starting The Proxmox VE cluster filesystem...
mar 26 22:07:33 pve1 systemd[1]: Started LXC Container Initialization and Autoboot Code.

mar 26 22:07:37 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [quorum] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [confdb] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [dcdb] crit: can't initialize service
mar 26 22:07:37 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2
mar 26 22:07:37 pve1 pmxcfs[9250]: [status] crit: can't initialize service

mar 26 22:07:38 pve1 systemd[1]: Started The Proxmox VE cluster filesystem.

[cut]

mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Corosync Cluster Engine 3.0.3 starting up
mar 26 22:07:48 pve1 systemd[1]: Starting User Manager for UID 0...
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
mar 26 22:07:49 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:49 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.7 resolved as 192.168.168.7
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.7 resolved as 192.168.168.7
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.6 resolved as 192.168.168.6
mar 26 22:07:49 pve1 corosync[10975]:   [TOTEM ] totemip_parse: IPv4 address of 192.168.168.3 resolved as 192.168.168.3
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] Please migrate config file to nodelist.
mar 26 22:07:49 pve1 corosync[10975]:   [MAIN  ] cpu.rt_runtime_us doesn't exists -> system without cgroup or with disabled CONFIG_RT_GROUP_SCHED
mar 26 22:07:55 pve1 pmxcfs[9250]: [quorum] crit: quorum_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [confdb] crit: cmap_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [dcdb] crit: cpg_initialize failed: 2
mar 26 22:07:55 pve1 pmxcfs[9250]: [status] crit: cpg_initialize failed: 2

I will try to migrate corosync config to nodelist and read (https://forum.proxmox.com/threads/after-upgrade-to-5-2-11-corosync-does-not-come-up.49075/).
I hope this helps.

Denis Kulikov · Apr 15, 2020

We try to change /lib/systemd/system/pve-cluster.service:
change Before=corosync.service
to After=corosync.service
and It works (as workaround after reboot).

Stoiko Ivanov · Apr 15, 2020

I would not suggest to keep this change - the cluster-stack (corosync) is there to keep all nodes in your cluster synchronized
if you start pmxcfs locally before it can synchronize the state from the other nodes this could lead to a split-brain situation (e.g. the node starts a VM which is running on another node according to the quorate part of the cluster)

if possible try to check why corosync needs so long to start up.

Search

Search

pvestatd and pve-firewall doesn`t start after reboot

Denis Kulikov

Active Member

Stoiko Ivanov

Proxmox Staff Member

Denis Kulikov

Active Member

Denis Kulikov

Active Member

Stoiko Ivanov

Proxmox Staff Member

We value your privacy