Homelab nodes won't reboot/shut down: "A Stop Job Is Running For..."

AllanM

Well-Known Member
Oct 17, 2019
119
39
48
41
Hello!

I've been running proxmox on my 4-node homelab cluster of X9 generation supermicro fat-twins for a couple months now. All has been pretty well until recently, the nodes don't want to reboot/shut-down anymore. Not sure what triggered this, might be latest updates?

I get these "A stop job is running for..." messages with "no limit." After 30 minutes of this, this happens:

Jan 24 03:56:13 proxbox3 systemd[1]: poweroff.target: Job poweroff.target/start timed out.
Jan 24 03:56:13 proxbox3 systemd[1]: Timed out starting Power-Off.
Jan 24 03:56:13 proxbox3 systemd[1]: poweroff.target: Job poweroff.target/start failed with result 'timeout'.
Jan 24 03:56:13 proxbox3 systemd[1]: Forcibly powering off: job timed out
Jan 24 03:56:13 proxbox3 systemd[1]: Shutting down.
Jan 24 03:56:13 proxbox3 kernel: [ 7465.851881] printk: systemd-shutdow: 59 output lines suppressed due to ratelimiting

Basically, they don't reboot/shut down properly anymore. All 4 nodes are doing this when I attempt to reboot or shut down a node, but the specific "stop job" called out isn't consistent. Sometimes it's a guest process, sometimes and HA process, sometimes a firewall process.... I think the "a stop job is running for" message on the console is actually a red herring.

Jan 24 03:26:43 proxbox3 kernel: [ 5695.655549] libceph: mon3 (1)10.21.21.104:6789 session lost, hunting for new mon
Jan 24 03:26:43 proxbox3 kernel: [ 5695.655603] libceph: mon2 (1)10.21.21.103:6789 connect error
Jan 24 03:26:44 proxbox3 kernel: [ 5696.615615] libceph: mon2 (1)10.21.21.103:6789 connect error
.....
Jan 24 03:27:05 proxbox3 kernel: [ 5717.608345] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:09 proxbox3 kernel: [ 5721.768489] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:11 proxbox3 kernel: [ 5723.816517] ceph: mds0 caps stale
Jan 24 03:27:17 proxbox3 kernel: [ 5729.704745] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:26 proxbox3 kernel: [ 5738.409054] libceph: mon0 (1)10.21.21.101:6789 connect error
....
Jan 24 03:31:03 proxbox3 kernel: [ 5955.632707] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:05 proxbox3 kernel: [ 5957.616761] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:08 proxbox3 kernel: [ 5960.624528] ceph: mds0 hung
Jan 24 03:31:09 proxbox3 kernel: [ 5961.648942] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:17 proxbox3 kernel: [ 5969.841287] libceph: mon1 (1)10.21.21.102:6789 connect error
....



Looks to me like Ceph isn't hanging up the phone properly.


---------------

root@proxbox1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-10
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: residual config
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-3
pve-container: 3.0-16
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2


Will add syslog in reply....

Thoughts?

Any help greatly appreciated. I suspect I could "fix" this by going from node to node and performing a clean reinstall of proxmox, rejoin to the cluster, copy/past network config, add OSD's, etc... but I'm hoping there's an easier "fix."

Thanks,
-Eric
 
Last edited:
Syslog from a shutdown sequence:

Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 259:0...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 8:16...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 8:3...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 8:32...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target Timers.
Jan 24 03:26:13 proxbox3 systemd[1]: apt-daily-upgrade.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily apt upgrade and clean activities.
Jan 24 03:26:13 proxbox3 systemd[1]: apt-daily.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily apt download activities.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 8:48...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LVM event activation on device 8:64...
Jan 24 03:26:13 proxbox3 systemd[1]: systemd-rfkill.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Closed Load/Save RF Kill Switch Status /dev/rfkill Watch.
Jan 24 03:26:13 proxbox3 systemd[1]: logrotate.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily rotation of log files.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Network initialization...
Jan 24 03:26:13 proxbox3 systemd[1]: man-db.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily man-db regeneration.
Jan 24 03:26:13 proxbox3 systemd[1]: Unmounting RPC Pipe File System...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Availability of block devices...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target Graphical Interface.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target Multi-User System.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping LXC Container Monitoring Daemon...
Jan 24 03:26:13 proxbox3 systemd[1]: postfix.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Postfix Mail Transport Agent.
Jan 24 03:26:13 proxbox3 systemd[42169]: run-rpc_pipefs.mount: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Postfix Mail Transport Agent (instance -)...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Regular background program processing daemon...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target Login Prompts.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Getty on tty1...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping PVE Qemu Event Daemon...
Jan 24 03:26:13 proxbox3 smartd[4007]: smartd received signal 15: Terminated
Jan 24 03:26:13 proxbox3 smartd[1185]: smartd received signal 15: Terminated
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Self Monitoring and Reporting Technology (SMART) Daemon...
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Micron_M600_MTFDDAV128MBF-155211621D3A.ata.state
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target ZFS startup target.
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Micron_M600_MTFDDAV128MBF-155211621D3A.ata.state
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target ZFS pool import target.
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.SHGS31_500GS_2-MJ98N75481060AQ24.ata.state
Jan 24 03:26:13 proxbox3 systemd[1]: zfs-share.service: Succeeded.
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.SHGS31_500GS_2-MJ98N75481060AQ24.ata.state
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD40EZRZ_75GXCB0-WD_WCC7K4LULDKN.ata.state
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/sdc [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD40EZRZ_75GXCB0-WD_WCC7K4LULDKN.ata.state
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.SHGS31_500GS_2-MJ98N75481060AQ3P.ata.state
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/sdd [SAT], state written to /var/lib/smartmontools/smartd.SHGS31_500GS_2-MJ98N75481060AQ3P.ata.state
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/sde [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD1002FBYS_02A6B0-WD_WMATV6439842.ata.state
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/sde [SAT], state written to /var/lib/smartmontools/smartd.WDC_WD1002FBYS_02A6B0-WD_WMATV6439842.ata.state
Jan 24 03:26:13 proxbox3 smartd[1185]: Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.Force_MP510-1946824200012918341A.nvme.state
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped ZFS file system shares.
Jan 24 03:26:13 proxbox3 smartd[4007]: Device: /dev/nvme0, state written to /var/lib/smartmontools/smartd.Force_MP510-1946824200012918341A.nvme.state
Jan 24 03:26:13 proxbox3 smartd[1185]: smartd is exiting (exit status 0)
Jan 24 03:26:13 proxbox3 smartd[4007]: smartd is exiting (exit status 0)
Jan 24 03:26:13 proxbox3 zed[1189]: Exiting
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping ZFS Event Daemon (zed)...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target ZFS volumes are ready.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping PVE guests...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped target RPC Port Mapper.
Jan 24 03:26:13 proxbox3 systemd[1]: pvesr.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Proxmox VE replication runner.
Jan 24 03:26:13 proxbox3 blkdeactivate[42340]: Deactivating block devices:
Jan 24 03:26:13 proxbox3 systemd[1]: Removed slice system-ceph\x2dvolume.slice.
Jan 24 03:26:13 proxbox3 systemd[1]: pve-daily-update.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily PVE download activities.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-lvmpolld.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Closed LVM2 poll daemon socket.
Jan 24 03:26:13 proxbox3 systemd[1]: systemd-tmpfiles-clean.timer: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Daily Cleanup of Temporary Directories.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Session 13 of user root.
Jan 24 03:26:13 proxbox3 systemd[1]: smartmontools.service: Got notification message from PID 1185, but reception only permitted for main PID 4007
Jan 24 03:26:13 proxbox3 systemd[1]: qmeventd.service: Main process exited, code=killed, status=15/TERM
Jan 24 03:26:13 proxbox3 systemd[1]: qmeventd.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped PVE Qemu Event Daemon.
Jan 24 03:26:13 proxbox3 blkdeactivate[42340]: [SKIP]: unmount of pve-swap (dm-6) mounted on [SWAP]
Jan 24 03:26:13 proxbox3 blkdeactivate[42340]: [SKIP]: unmount of pve-root (dm-1) mounted on /
Jan 24 03:26:13 proxbox3 systemd[1]: zfs-zed.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped ZFS Event Daemon (zed).
Jan 24 03:26:13 proxbox3 systemd[1]: ksmtuned.service: Main process exited, code=killed, status=15/TERM
Jan 24 03:26:13 proxbox3 systemd[1]: ksmtuned.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Kernel Samepage Merging (KSM) Tuning Daemon.
Jan 24 03:26:13 proxbox3 systemd[1]: lxc-monitord.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LXC Container Monitoring Daemon.
Jan 24 03:26:13 proxbox3 pmxcfs[1317]: [status] notice: received log
Jan 24 03:26:13 proxbox3 postfix/postfix-script[42362]: stopping the Postfix mail system
Jan 24 03:26:13 proxbox3 postfix/master[1493]: terminating on signal 15
 
Jan 24 03:26:13 proxbox3 systemd[1]: getty@tty1.service: Main process exited, code=killed, status=15/TERM
Jan 24 03:26:13 proxbox3 systemd[1]: getty@tty1.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Getty on tty1.
Jan 24 03:26:13 proxbox3 systemd[1]: cron.service: Main process exited, code=killed, status=15/TERM
Jan 24 03:26:13 proxbox3 systemd[1]: cron.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Regular background program processing daemon.
Jan 24 03:26:13 proxbox3 systemd[1]: smartmontools.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@259:0.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 259:0.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@8:16.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 8:16.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@8:3.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 8:3.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@8:32.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 8:32.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@8:48.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 8:48.
Jan 24 03:26:13 proxbox3 systemd[1]: lvm2-pvscan@8:64.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped LVM event activation on device 8:64.
Jan 24 03:26:13 proxbox3 systemd[1]: run-rpc_pipefs.mount: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Unmounted RPC Pipe File System.
Jan 24 03:26:13 proxbox3 systemd[1]: postfix@-.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Postfix Mail Transport Agent (instance -).
Jan 24 03:26:13 proxbox3 systemd[1]: session-13.scope: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Session 13 of user root.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping User Manager for UID 0...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Login Service...
Jan 24 03:26:13 proxbox3 systemd[1]: Removed slice system-postfix.slice.
Jan 24 03:26:13 proxbox3 systemd[42169]: Stopped target Default.
Jan 24 03:26:13 proxbox3 systemd[42169]: Stopped target Basic System.
Jan 24 03:26:13 proxbox3 systemd[42169]: Stopped target Timers.
Jan 24 03:26:13 proxbox3 systemd[42169]: Stopped target Sockets.
Jan 24 03:26:13 proxbox3 systemd[42169]: gpg-agent.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 24 03:26:13 proxbox3 systemd[42169]: gpg-agent-extra.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jan 24 03:26:13 proxbox3 systemd[42169]: gpg-agent-ssh.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jan 24 03:26:13 proxbox3 systemd[42169]: gpg-agent-browser.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 24 03:26:13 proxbox3 systemd[42169]: Stopped target Paths.
Jan 24 03:26:13 proxbox3 systemd[1]: Removed slice system-lvm2\x2dpvscan.slice.
Jan 24 03:26:13 proxbox3 systemd[42169]: dirmngr.socket: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Closed GnuPG network certificate management daemon.
Jan 24 03:26:13 proxbox3 systemd[42169]: Reached target Shutdown.
Jan 24 03:26:13 proxbox3 systemd[42169]: systemd-exit.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[42169]: Started Exit the Session.
Jan 24 03:26:13 proxbox3 systemd[42169]: Reached target Exit the Session.
Jan 24 03:26:13 proxbox3 systemd[1]: Removed slice system-getty.slice.
Jan 24 03:26:13 proxbox3 systemd[1]: user@0.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped User Manager for UID 0.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping User Runtime Directory /run/user/0...
Jan 24 03:26:13 proxbox3 systemd[1]: run-user-0.mount: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Unmounted /run/user/0.
Jan 24 03:26:13 proxbox3 systemd[1]: user-runtime-dir@0.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped User Runtime Directory /run/user/0.
Jan 24 03:26:13 proxbox3 systemd[1]: Removed slice User Slice of UID 0.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping Permit User Sessions...
Jan 24 03:26:13 proxbox3 systemd[1]: Stopping D-Bus System Message Bus...
Jan 24 03:26:13 proxbox3 systemd[1]: systemd-user-sessions.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Permit User Sessions.
Jan 24 03:26:13 proxbox3 systemd[1]: dbus.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped D-Bus System Message Bus.
Jan 24 03:26:13 proxbox3 blkdeactivate[42340]: [LVM]: deactivating Volume Group ceph-5fb35b6b-f12c-409c-bc6e-20b6e9dfb8d5... skipping
Jan 24 03:26:13 proxbox3 systemd[1]: systemd-logind.service: Succeeded.
Jan 24 03:26:13 proxbox3 systemd[1]: Stopped Login Service.
Jan 24 03:26:13 proxbox3 kernel: [ 5666.007379] vmbr0: port 1(eno1) entered disabled state
Jan 24 03:26:13 proxbox3 kernel: [ 5666.030170] vmbr0: port 1(eno1) entered disabled state
Jan 24 03:26:14 proxbox3 blkdeactivate[42340]: [LVM]: deactivating Volume Group ceph-4f909d92-61a6-4939-8ac6-e4f6d2ead8d6... skipping
Jan 24 03:26:14 proxbox3 blkdeactivate[42340]: [LVM]: deactivating Volume Group ceph-879cd629-93ad-4118-8ce4-247d864d6ed1... skipping
Jan 24 03:26:14 proxbox3 kernel: [ 5666.520064] vmbr1: port 1(eno2) entered disabled state
Jan 24 03:26:14 proxbox3 kernel: [ 5666.536194] device eno2 left promiscuous mode
Jan 24 03:26:14 proxbox3 kernel: [ 5666.536307] vmbr1: port 1(eno2) entered disabled state
Jan 24 03:26:14 proxbox3 blkdeactivate[42340]: [LVM]: deactivating Volume Group ceph-3ce225cd-e454-4043-856d-46f84ea9289f... skipping
Jan 24 03:26:14 proxbox3 kernel: [ 5666.815241] vmbr101: port 1(eno1.101) entered disabled state
Jan 24 03:26:14 proxbox3 kernel: [ 5666.830412] device eno1.101 left promiscuous mode
Jan 24 03:26:14 proxbox3 kernel: [ 5666.830553] vmbr101: port 1(eno1.101) entered disabled state
Jan 24 03:26:14 proxbox3 blkdeactivate[42340]: [LVM]: deactivating Volume Group ceph-7acc9c06-b228-4897-8b0e-87ff094f864d... skipping
Jan 24 03:26:14 proxbox3 systemd[1]: blk-availability.service: Succeeded.
Jan 24 03:26:14 proxbox3 systemd[1]: Stopped Availability of block devices.
Jan 24 03:26:15 proxbox3 kernel: [ 5667.167812] vmbr102: port 1(eno1.102) entered disabled state
Jan 24 03:26:15 proxbox3 kernel: [ 5667.183162] device eno1.102 left promiscuous mode
Jan 24 03:26:15 proxbox3 kernel: [ 5667.183270] vmbr102: port 1(eno1.102) entered disabled state
Jan 24 03:26:15 proxbox3 corosync[1523]: [TOTEM ] Token has not been received in 1725 ms
Jan 24 03:26:15 proxbox3 kernel: [ 5667.459537] vmbr103: port 1(eno1.103) entered disabled state
Jan 24 03:26:15 proxbox3 kernel: [ 5667.475326] device eno1.103 left promiscuous mode
Jan 24 03:26:15 proxbox3 kernel: [ 5667.475440] vmbr103: port 1(eno1.103) entered disabled state
Jan 24 03:26:15 proxbox3 pve-guests[42449]: <root@pam> starting task UPID:proxbox3:0000A61B:0008A5D0:5E2AC647:stopall::root@pam:
Jan 24 03:26:15 proxbox3 pve-guests[42523]: all VMs and CTs stopped
Jan 24 03:26:15 proxbox3 pve-guests[42449]: <root@pam> end task UPID:proxbox3:0000A61B:0008A5D0:5E2AC647:stopall::root@pam: OK
Jan 24 03:26:15 proxbox3 systemd[1]: pve-guests.service: Succeeded.
Jan 24 03:26:15 proxbox3 systemd[1]: Stopped PVE guests.
Jan 24 03:26:15 proxbox3 systemd[1]: Stopping PVE Status Daemon...
Jan 24 03:26:15 proxbox3 systemd[1]: Stopping Proxmox VE firewall...
Jan 24 03:26:15 proxbox3 systemd[1]: Stopping PVE SPICE Proxy Server...
Jan 24 03:26:15 proxbox3 systemd[1]: Stopping PVE Local HA Resource Manager Daemon...
Jan 24 03:26:15 proxbox3 kernel: [ 5667.859861] vmbr201: port 1(eno1.201) entered disabled state
Jan 24 03:26:15 proxbox3 kernel: [ 5667.876062] device eno1.201 left promiscuous mode
Jan 24 03:26:15 proxbox3 kernel: [ 5667.876173] vmbr201: port 1(eno1.201) entered disabled state
Jan 24 03:26:15 proxbox3 corosync[1523]: [TOTEM ] A processor failed, forming new configuration.
Jan 24 03:26:16 proxbox3 spiceproxy[2001]: received signal TERM
Jan 24 03:26:16 proxbox3 spiceproxy[2001]: server closing
Jan 24 03:26:16 proxbox3 spiceproxy[2002]: worker exit
Jan 24 03:26:16 proxbox3 spiceproxy[2001]: worker 2002 finished
Jan 24 03:26:16 proxbox3 spiceproxy[2001]: server stopped
Jan 24 03:26:16 proxbox3 kernel: [ 5668.119779] vmbr100: port 1(eno1.100) entered disabled state
Jan 24 03:26:16 proxbox3 kernel: [ 5668.136239] device eno1.100 left promiscuous mode
Jan 24 03:26:16 proxbox3 kernel: [ 5668.136242] device eno1 left promiscuous mode
Jan 24 03:26:16 proxbox3 kernel: [ 5668.136414] vmbr100: port 1(eno1.100) entered disabled state
Jan 24 03:26:16 proxbox3 pvestatd[1696]: received signal TERM
Jan 24 03:26:16 proxbox3 pvestatd[1696]: server closing
Jan 24 03:26:16 proxbox3 pvestatd[1696]: server stopped
Jan 24 03:26:16 proxbox3 pve-firewall[1697]: received signal TERM
Jan 24 03:26:16 proxbox3 pve-firewall[1697]: server closing
Jan 24 03:26:16 proxbox3 pve-firewall[1697]: clear firewall rules
Jan 24 03:26:16 proxbox3 pve-firewall[1697]: server stopped
 
Jan 24 03:26:16 proxbox3 pve-ha-lrm[2003]: received signal TERM
Jan 24 03:26:16 proxbox3 pve-ha-lrm[2003]: got shutdown request with shutdown policy 'migrate'
Jan 24 03:26:16 proxbox3 pve-ha-lrm[2003]: shutdown LRM, doing maintenance, removing this node from active list
Jan 24 03:26:16 proxbox3 systemd[1]: networking.service: Succeeded.
Jan 24 03:26:16 proxbox3 systemd[1]: Stopped Network initialization.
Jan 24 03:26:17 proxbox3 systemd[1]: spiceproxy.service: Succeeded.
Jan 24 03:26:17 proxbox3 systemd[1]: Stopped PVE SPICE Proxy Server.
Jan 24 03:26:17 proxbox3 systemd[1]: pvestatd.service: Succeeded.
Jan 24 03:26:17 proxbox3 systemd[1]: Stopped PVE Status Daemon.
Jan 24 03:26:17 proxbox3 systemd[1]: pve-firewall.service: Succeeded.
Jan 24 03:26:17 proxbox3 systemd[1]: Stopped Proxmox VE firewall.
Jan 24 03:26:17 proxbox3 systemd[1]: Stopping Proxmox VE firewall logger...
Jan 24 03:26:17 proxbox3 pvefw-logger[1144]: received terminate request (signal)
Jan 24 03:26:17 proxbox3 pvefw-logger[1144]: stopping pvefw logger
Jan 24 03:26:17 proxbox3 systemd[1]: pvefw-logger.service: Succeeded.
Jan 24 03:26:17 proxbox3 systemd[1]: Stopped Proxmox VE firewall logger.
Jan 24 03:26:23 proxbox3 corosync[1523]: [TOTEM ] Token has not been received in 9545 ms
Jan 24 03:26:28 proxbox3 corosync[1523]: [TOTEM ] Token has not been received in 14605 ms
Jan 24 03:26:33 proxbox3 corosync[1523]: [TOTEM ] Token has not been received in 19665 ms
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.101:6824 osd.0 since back 2020-01-24 03:26:09.627997 front 2020-01-24 03:26:13.728302 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.101:6809 osd.1 since back 2020-01-24 03:26:09.628096 front 2020-01-24 03:26:13.728388 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.102:6811 osd.2 since back 2020-01-24 03:26:09.628138 front 2020-01-24 03:26:13.728425 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.102:6813 osd.3 since back 2020-01-24 03:26:09.628164 front 2020-01-24 03:26:13.728363 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.104:6819 osd.6 since back 2020-01-24 03:26:09.628026 front 2020-01-24 03:26:13.728479 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.104:6808 osd.7 since back 2020-01-24 03:26:09.628193 front 2020-01-24 03:26:13.728437 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.101:6815 osd.9 since back 2020-01-24 03:26:09.627912 front 2020-01-24 03:26:13.728311 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.102:6821 osd.11 since back 2020-01-24 03:26:09.628201 front 2020-01-24 03:26:13.728463 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.104:6812 osd.13 since back 2020-01-24 03:26:09.628153 front 2020-01-24 03:26:13.728545 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1919]: 2020-01-24 03:26:34.478 7ff472205700 -1 osd.12 16582 heartbeat_check: no reply from 10.21.21.102:6818 osd.15 since back 2020-01-24 03:26:09.628227 front 2020-01-24 03:26:13.728400 (oldest deadline 2020-01-24 03:26:33.727881)
Jan 24 03:26:34 proxbox3 ceph-osd[1925]: 2020-01-24 03:26:34.498 7fb10e42b700 -1 osd.4 16582 heartbeat_check: no reply from 10.21.21.101:6824 osd.0 since back 2020-01-24 03:26:13.518459 front 2020-01-24 03:26:13.518530 (oldest deadline 2020-01-24 03:26:34.018203)

It produces thousands of these heartbeat: no reply logs in the 30 minutes before forced shutdown. It's like, ceph didn't get the "we're shutting down" message or something.
 
At a loss...

I reinstalled proxmox on a node, removed and rejoined it to the cluster, copied in my networking config, and a few other odds and ends.

The "clean" install node was restarting fine, until I installed ceph on it. Problem came right back on freshly installed and joined node as soon as ceph was installed. Won't reboot/shutdown... gets stuck with endless "heartbeat_check: no reply from" events just like before.
 
Hi,
another user has a similar issue (the same?). Here is the thread in german. Are you using CephFS? Then this is likely because of a bug in the Ceph MDS daemon.
 
Bonjour,
I have the same problem.
My servers take a long time to reboot or shut down. Without Ceph no worries.
Thank's.
 
Bonjour,
I have the same problem.
My servers take a long time to reboot or shut down. Without Ceph no worries.
Thank's.

We discovered the cause of the issue and we will provide a fix soon. (cephfs mount/systemd).
 
  • Like
Reactions: badji