Hello!
I've been running proxmox on my 4-node homelab cluster of X9 generation supermicro fat-twins for a couple months now. All has been pretty well until recently, the nodes don't want to reboot/shut-down anymore. Not sure what triggered this, might be latest updates?
I get these "A stop job is running for..." messages with "no limit." After 30 minutes of this, this happens:
Basically, they don't reboot/shut down properly anymore. All 4 nodes are doing this when I attempt to reboot or shut down a node, but the specific "stop job" called out isn't consistent. Sometimes it's a guest process, sometimes and HA process, sometimes a firewall process.... I think the "a stop job is running for" message on the console is actually a red herring.
Looks to me like Ceph isn't hanging up the phone properly.
---------------
Will add syslog in reply....
Thoughts?
Any help greatly appreciated. I suspect I could "fix" this by going from node to node and performing a clean reinstall of proxmox, rejoin to the cluster, copy/past network config, add OSD's, etc... but I'm hoping there's an easier "fix."
Thanks,
-Eric
I've been running proxmox on my 4-node homelab cluster of X9 generation supermicro fat-twins for a couple months now. All has been pretty well until recently, the nodes don't want to reboot/shut-down anymore. Not sure what triggered this, might be latest updates?
I get these "A stop job is running for..." messages with "no limit." After 30 minutes of this, this happens:
Jan 24 03:56:13 proxbox3 systemd[1]: poweroff.target: Job poweroff.target/start timed out.
Jan 24 03:56:13 proxbox3 systemd[1]: Timed out starting Power-Off.
Jan 24 03:56:13 proxbox3 systemd[1]: poweroff.target: Job poweroff.target/start failed with result 'timeout'.
Jan 24 03:56:13 proxbox3 systemd[1]: Forcibly powering off: job timed out
Jan 24 03:56:13 proxbox3 systemd[1]: Shutting down.
Jan 24 03:56:13 proxbox3 kernel: [ 7465.851881] printk: systemd-shutdow: 59 output lines suppressed due to ratelimiting
Jan 24 03:56:13 proxbox3 systemd[1]: Timed out starting Power-Off.
Jan 24 03:56:13 proxbox3 systemd[1]: poweroff.target: Job poweroff.target/start failed with result 'timeout'.
Jan 24 03:56:13 proxbox3 systemd[1]: Forcibly powering off: job timed out
Jan 24 03:56:13 proxbox3 systemd[1]: Shutting down.
Jan 24 03:56:13 proxbox3 kernel: [ 7465.851881] printk: systemd-shutdow: 59 output lines suppressed due to ratelimiting
Basically, they don't reboot/shut down properly anymore. All 4 nodes are doing this when I attempt to reboot or shut down a node, but the specific "stop job" called out isn't consistent. Sometimes it's a guest process, sometimes and HA process, sometimes a firewall process.... I think the "a stop job is running for" message on the console is actually a red herring.
Jan 24 03:26:43 proxbox3 kernel: [ 5695.655549] libceph: mon3 (1)10.21.21.104:6789 session lost, hunting for new mon
Jan 24 03:26:43 proxbox3 kernel: [ 5695.655603] libceph: mon2 (1)10.21.21.103:6789 connect error
Jan 24 03:26:44 proxbox3 kernel: [ 5696.615615] libceph: mon2 (1)10.21.21.103:6789 connect error
.....
Jan 24 03:27:05 proxbox3 kernel: [ 5717.608345] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:09 proxbox3 kernel: [ 5721.768489] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:11 proxbox3 kernel: [ 5723.816517] ceph: mds0 caps stale
Jan 24 03:27:17 proxbox3 kernel: [ 5729.704745] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:26 proxbox3 kernel: [ 5738.409054] libceph: mon0 (1)10.21.21.101:6789 connect error
....
Jan 24 03:31:03 proxbox3 kernel: [ 5955.632707] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:05 proxbox3 kernel: [ 5957.616761] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:08 proxbox3 kernel: [ 5960.624528] ceph: mds0 hung
Jan 24 03:31:09 proxbox3 kernel: [ 5961.648942] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:17 proxbox3 kernel: [ 5969.841287] libceph: mon1 (1)10.21.21.102:6789 connect error
....
Jan 24 03:26:43 proxbox3 kernel: [ 5695.655603] libceph: mon2 (1)10.21.21.103:6789 connect error
Jan 24 03:26:44 proxbox3 kernel: [ 5696.615615] libceph: mon2 (1)10.21.21.103:6789 connect error
.....
Jan 24 03:27:05 proxbox3 kernel: [ 5717.608345] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:09 proxbox3 kernel: [ 5721.768489] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:11 proxbox3 kernel: [ 5723.816517] ceph: mds0 caps stale
Jan 24 03:27:17 proxbox3 kernel: [ 5729.704745] libceph: mon3 (1)10.21.21.104:6789 connect error
Jan 24 03:27:26 proxbox3 kernel: [ 5738.409054] libceph: mon0 (1)10.21.21.101:6789 connect error
....
Jan 24 03:31:03 proxbox3 kernel: [ 5955.632707] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:05 proxbox3 kernel: [ 5957.616761] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:08 proxbox3 kernel: [ 5960.624528] ceph: mds0 hung
Jan 24 03:31:09 proxbox3 kernel: [ 5961.648942] libceph: mon1 (1)10.21.21.102:6789 connect error
Jan 24 03:31:17 proxbox3 kernel: [ 5969.841287] libceph: mon1 (1)10.21.21.102:6789 connect error
....
Looks to me like Ceph isn't hanging up the phone properly.
---------------
root@proxbox1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-10
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: residual config
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-3
pve-container: 3.0-16
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-10
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: residual config
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-3
pve-container: 3.0-16
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
Will add syslog in reply....
Thoughts?
Any help greatly appreciated. I suspect I could "fix" this by going from node to node and performing a clean reinstall of proxmox, rejoin to the cluster, copy/past network config, add OSD's, etc... but I'm hoping there's an easier "fix."
Thanks,
-Eric
Last edited: