long shutdown : A stop job is running for PVE Local HA Resource Manager Daemon

realynot

Member
Sep 19, 2019
5
0
6
40
Hello All
I installed two new nods today.
No worries except after joining the cluster ...
At each stop of the two new nods I have this message for several long minutes ...

<< A stop job is running for PVE Local HA Resource Manager Daemon ... >>

I exclusively use proxmox in zfs and nfs ... no cephfs or glusterfs ..

An idea ?

# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-2-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-2
pve-kernel-helper: 6.1-2
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-10
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-3
pve-container: 3.0-18
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 

Are you sure on no CephFS? Because this symptom is something we saw exclusively in combination with a CephFS mount, that's why I re ask.. For the CephFS related issue I made a fix just an hour ago, should trickle to pvetest tomorrow if internal testing is OK.

An idea ?

If you really have no CephFS, can you enable persistent journal with:
Code:
mkdir /var/log/journal
systemctl restart systemd-journald.service

Then note the current time and do a reboot, once the wait is over and you rebooted check out the log of the last boot:
journalctl -b-1

Scroll to the time where you started the reboot and check for anything weird/error-like. Especially things like "Found ordering cycle on" or similar sounding messages. Post those here please.
 
  • Like
Reactions: realynot
i dont really use CephFS..

Log of the last boot :
Jan 30 12:12:30 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:32 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:33 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:35 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:36 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:37 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 300060 ms
Jan 30 12:12:38 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:39 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:41 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:42 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:44 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:45 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 307980 ms
Jan 30 12:12:45 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:47 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:48 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:50 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:51 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:53 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 315900 ms
Jan 30 12:12:53 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:55 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:56 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:58 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:12:59 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:01 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:01 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 323820 ms
Jan 30 12:13:02 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:04 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:05 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:07 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:08 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:09 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 331740 ms
Jan 30 12:13:10 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:11 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:13 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:14 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:16 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:17 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 339660 ms
Jan 30 12:13:17 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:19 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:20 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:22 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:23 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:25 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:25 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 347580 ms
Jan 30 12:13:26 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:28 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:29 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:31 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:32 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:32 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 355500 ms
Jan 30 12:13:34 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:35 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:37 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:38 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:39 zfs-proxmox-01 systemd[1]: pvestatd.service: State 'stop-final-sigterm' timed out. Killing.
Jan 30 12:13:39 zfs-proxmox-01 systemd[1]: pvestatd.service: Killing process 13478 (pvestatd) with signal SIGKILL.
Jan 30 12:13:39 zfs-proxmox-01 systemd[1]: pve-firewall.service: State 'stop-final-sigterm' timed out. Killing.
Jan 30 12:13:39 zfs-proxmox-01 systemd[1]: pve-firewall.service: Killing process 13480 (pve-firewall) with signal SIGKILL.
Jan 30 12:13:40 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
Jan 30 12:13:40 zfs-proxmox-01 corosync[1581]: [TOTEM ] Token has not been received in 363420 ms
Jan 30 12:13:41 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
lines 7626-7686
 
ifupdown2: 2.0.1-1+pve2

There seems to be an issue with that version of ifupdown2, we located it and made a fix.
An updated version should become available soon, first through our pvetest repository as "ifupdown2 version 2.0.1-1+pve3"
 
There seems to be an issue with that version of ifupdown2, we located it and made a fix.
An updated version should become available soon, first through our pvetest repository as "ifupdown2 version 2.0.1-1+pve3"

Ok thanks i check pvetest repo now !

and this recurrent log at the shutdown ?
Jan 30 12:12:30 zfs-proxmox-01 corosync[1581]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuousl
?
 
It's OK for first reboot after pvetest repo
I will try several times!


Edit :

It OK OK OK

Thanks

Great Job :D
 
@t.lamprecht : I'm facing a similar issue running the latest 6.1-5 from the enterprise repository, we using both CephFS and ifupdown2. A reboot of one of the Proxmox hosts ended up stuck with "A stop job is running for PVE Local HA Resource Manager Daemon" for about 20 minutes, ended up power cycling the server. Will the CephFS fix you mentioned be pushed to the enterprise repo soon?
 
Will the CephFS fix you mentioned be pushed to the enterprise repo soon?

Yes, in the process of uploading a selected set of package updates, which includes this fix.
Note though,
  • the first reboot after the update can still trigger the cephfs issue, as it's still mounted with the old auto generated dependencies. If it does not impact your setup you could unmount it once after upgrading, then pvestatd will remounted it soon again with the correct order. E.g., for a CephFS storage named "cephfs" just do umount /mnt/pve/cephfs once.
  • the ifupdown2 issue should be fixed on upgrading immediately.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!