ceph not working monitors and managers lost

All nodes are identical on packages and are up to date


Code:
proxmox-ve: 7.1-1 (running kernel: 5.13.19-4-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-helper: 7.1-12
pve-kernel-5.13: 7.1-7
pve-kernel-5.11: 7.0-10
pve-kernel-5.4: 6.4-4
pve-kernel-5.13.19-4-pve: 5.13.19-9
pve-kernel-5.13.19-1-pve: 5.13.19-3
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.4.124-1-pve: 5.4.124-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 16.2.7
ceph-fuse: 16.2.7
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve1
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-3
libpve-guest-common-perl: 4.1-1
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.1-1
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-2
proxmox-backup-client: 2.1.5-1
proxmox-backup-file-restore: 2.1.5-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-6
pve-cluster: 7.1-3
pve-container: 4.1-4
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-5
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.1-2
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1
 
Just tried to create manager on node1 (stack1)

1646105367596.png

1646105375708.png

no managers will start... I dont see OSD listed on any node other than on the drive assignments themselves...

wondering if map is lost and any way to restore the managers and metadata before I lose it from all the nodes
 
So for giggles I rebooted node 2 and am looking at logs on reboot

Code:
Feb 28 21:23:36 node2 systemd[1]: Started The Proxmox VE cluster filesystem.
Feb 28 21:23:36 node2 systemd[1]: Started Ceph metadata server daemon.
Feb 28 21:23:36 node2 systemd[1]: Started Ceph cluster manager daemon.
Feb 28 21:23:36 node2 systemd[1]: Started Ceph cluster monitor daemon.
Feb 28 21:23:36 node2 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mon@.service instances at once.
Feb 28 21:23:36 node2 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once.
Feb 28 21:23:36 node2 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once.
Feb 28 21:23:36 node2 systemd[1]: Starting Ceph object storage daemon osd.1...
Feb 28 21:23:36 node2 systemd[1]: Starting Corosync Cluster Engine...
Feb 28 21:23:36 node2 systemd[1]: Started Regular background program processing daemon.
Feb 28 21:23:36 node2 systemd[1]: Starting Proxmox VE firewall...
Feb 28 21:23:36 node2 cron[951]: (CRON) INFO (pidfile fd = 3)
Feb 28 21:23:36 node2 systemd[1]: Starting PVE Status Daemon...
Feb 28 21:23:36 node2 cron[951]: (CRON) INFO (Running @reboot jobs)
Feb 28 21:23:36 node2 corosync[950]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Feb 28 21:23:36 node2 corosync[950]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Feb 28 21:23:36 node2 corosync[950]:   [TOTEM ] Initializing transport (Kronosnet).
Feb 28 21:23:36 node2 kernel: sctp: Hash tables configured (bind 256/256)
Feb 28 21:23:36 node2 systemd[1]: Started Ceph object storage daemon osd.1.
Feb 28 21:23:36 node2 systemd[1]: Reached target ceph target allowing to start/stop all ceph-osd@.service instances at once.
Feb 28 21:23:36 node2 systemd[1]: Reached target ceph target allowing to start/stop all ceph*@.service instances at once.
Feb 28 21:23:36 node2 systemd[1]: Reached target PVE Storage Target.
Feb 28 21:23:36 node2 systemd[1]: Starting Proxmox VE scheduler...
Feb 28 21:23:36 node2 systemd[1]: Starting Map RBD devices...
Feb 28 21:23:36 node2 systemd[1]: Finished Map RBD devices.
Feb 28 21:23:36 node2 systemd[1]: Finished Availability of block devices.
Feb 28 21:23:36 node2 systemd[1]: ceph-volume@lvm-1-80a96de2-675b-4123-9296-6219994c2d11.service: Succeeded.
Feb 28 21:23:36 node2 systemd[1]: Finished Ceph Volume activation: lvm-1-80a96de2-675b-4123-9296-6219994c2d11.
Feb 28 21:23:36 node2 corosync[950]:   [TOTEM ] totemknet initialized
Feb 28 21:23:36 node2 corosync[950]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Feb 28 21:23:37 node2 corosync[950]:   [QB    ] server name: cmap
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Feb 28 21:23:37 node2 corosync[950]:   [QB    ] server name: cfg
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Feb 28 21:23:37 node2 corosync[950]:   [QB    ] server name: cpg
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Feb 28 21:23:37 node2 corosync[950]:   [WD    ] Watchdog not enabled by configuration
Feb 28 21:23:37 node2 corosync[950]:   [WD    ] resource load_15min missing a recovery key.
Feb 28 21:23:37 node2 corosync[950]:   [WD    ] resource memory_used missing a recovery key.
Feb 28 21:23:37 node2 corosync[950]:   [WD    ] no resources configured.
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Feb 28 21:23:37 node2 corosync[950]:   [QUORUM] Using quorum provider corosync_votequorum
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Feb 28 21:23:37 node2 corosync[950]:   [QB    ] server name: votequorum
Feb 28 21:23:37 node2 corosync[950]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Feb 28 21:23:37 node2 corosync[950]:   [QB    ] server name: quorum
Feb 28 21:23:37 node2 corosync[950]:   [TOTEM ] Configuring link 0
Feb 28 21:23:37 node2 corosync[950]:   [TOTEM ] Configured link number 0: local addr: 10.0.1.2, port=5405
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 3 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 4 has no active links
Feb 28 21:23:37 node2 systemd[1]: Started Corosync Cluster Engine.
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 6 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [QUORUM] Sync members[1]: 2
Feb 28 21:23:37 node2 corosync[950]:   [QUORUM] Sync joined[1]: 2
Feb 28 21:23:37 node2 corosync[950]:   [TOTEM ] A new membership (2.51c3) was formed. Members joined: 2
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 0)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 5 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 0)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 8 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 0)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [QUORUM] Members[1]: 2
Feb 28 21:23:37 node2 corosync[950]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 7 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 has no active links
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 28 21:23:37 node2 corosync[950]:   [KNET  ] host: host: 1 has no active links
Feb 28 21:23:37 node2 kernel: random: crng init done
Feb 28 21:23:37 node2 systemd[1]: Starting PVE API Daemon...
Feb 28 21:23:37 node2 pvestatd[1038]: starting server
Feb 28 21:23:37 node2 systemd[1]: Started PVE Status Daemon.
Feb 28 21:23:37 node2 pve-firewall[1039]: starting server
Feb 28 21:23:37 node2 systemd[1]: Started Proxmox VE firewall.
Feb 28 21:23:37 node2 pvescheduler[1072]: starting server
Feb 28 21:23:37 node2 systemd[1]: Started Proxmox VE scheduler.
Feb 28 21:23:37 node2 pvefw-logger[560]: received terminate request (signal)
Feb 28 21:23:37 node2 pvefw-logger[560]: stopping pvefw logger
Feb 28 21:23:37 node2 systemd[1]: Stopping Proxmox VE firewall logger...
Feb 28 21:23:37 node2 systemd[1]: pvefw-logger.service: Succeeded.
Feb 28 21:23:37 node2 systemd[1]: Stopped Proxmox VE firewall logger.
Feb 28 21:23:37 node2 systemd[1]: Starting Proxmox VE firewall logger...
Feb 28 21:23:37 node2 pvefw-logger[1092]: starting pvefw logger
Feb 28 21:23:37 node2 systemd[1]: Started Proxmox VE firewall logger.
Feb 28 21:23:38 node2 pvedaemon[1094]: starting server
Feb 28 21:23:38 node2 pvedaemon[1094]: starting 3 worker(s)
Feb 28 21:23:38 node2 pvedaemon[1094]: worker 1095 started
Feb 28 21:23:38 node2 pvedaemon[1094]: worker 1096 started
Feb 28 21:23:38 node2 pvedaemon[1094]: worker 1097 started
Feb 28 21:23:38 node2 systemd[1]: Started PVE API Daemon.
Feb 28 21:23:38 node2 systemd[1]: Starting PVE Cluster HA Resource Manager Daemon...
Feb 28 21:23:38 node2 systemd[1]: Starting PVE API Proxy Server...
Feb 28 21:23:38 node2 kernel: bnx2 0000:02:00.0 eno1: NIC Copper Link is Up, 1000 Mbps full duplex
Feb 28 21:23:38 node2 kernel: , receive & transmit flow control ON
Feb 28 21:23:38 node2 kernel: vmbr0: port 1(eno1) entered blocking state
Feb 28 21:23:38 node2 kernel: vmbr0: port 1(eno1) entered forwarding state
Feb 28 21:23:38 node2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Feb 28 21:23:38 node2 pve-ha-crm[1102]: starting server
Feb 28 21:23:38 node2 pve-ha-crm[1102]: status change startup => wait_for_quorum
Feb 28 21:23:38 node2 systemd[1]: Started PVE Cluster HA Resource Manager Daemon.
Feb 28 21:23:39 node2 pveproxy[1103]: starting server
Feb 28 21:23:39 node2 pveproxy[1103]: starting 3 worker(s)
Feb 28 21:23:39 node2 pveproxy[1103]: worker 1104 started
Feb 28 21:23:39 node2 pveproxy[1103]: worker 1105 started
Feb 28 21:23:39 node2 pveproxy[1103]: worker 1106 started
Feb 28 21:23:39 node2 systemd[1]: Started PVE API Proxy Server.
Feb 28 21:23:39 node2 systemd[1]: Starting PVE Local HA Resource Manager Daemon...
Feb 28 21:23:39 node2 systemd[1]: Starting PVE SPICE Proxy Server...
Feb 28 21:23:40 node2 spiceproxy[1110]: starting server
Feb 28 21:23:40 node2 spiceproxy[1110]: starting 1 worker(s)
Feb 28 21:23:40 node2 spiceproxy[1110]: worker 1111 started
Feb 28 21:23:40 node2 systemd[1]: Started PVE SPICE Proxy Server.
Feb 28 21:23:40 node2 pve-ha-lrm[1112]: starting server
Feb 28 21:23:40 node2 pve-ha-lrm[1112]: status change startup => wait_for_agent_lock
Feb 28 21:23:40 node2 systemd[1]: Started PVE Local HA Resource Manager Daemon.
Feb 28 21:23:40 node2 systemd[1]: Starting PVE guests...
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 1 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 8 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 5 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 7 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 8 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 6 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 4 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] rx: host: 3 link: 0 is up
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 5 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 7 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 6 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Feb 28 21:23:40 node2 corosync[950]:   [KNET  ] pmtud: Global data MTU changed to: 469
 
ceph log on node2

Code:
2022-02-21T11:23:39.813865-0600 osd.0 (osd.0) 4244643 : cluster [WRN] slow request osd_op(client.471789829.0:1612332 7.dc 7.2b14fadc (undecoded) ondisk+retry+write+known_if_redirected e967246) initiated 2022-02-20T22:30:29.742607-0600 currently delayed
2022-02-21T11:23:39.813869-0600 osd.0 (osd.0) 4244644 : cluster [WRN] slow request osd_op(client.471789829.0:1612332 7.dc 7.2b14fadc (undecoded) ondisk+retry+write+known_if_redirected e967374) initiated 2022-02-21T02:30:29.797312-0600 currently delayed
2022-02-21T11:23:39.813873-0600 osd.0 (osd.0) 4244645 : cluster [WRN] slow request osd_op(client.471789829.0:1612332 7.dc 7.2b14fadc (undecoded) ondisk+retry+write+known_if_redirected e967508) initiated 2022-02-21T06:30:29.844746-0600 currently delayed
2022-02-21T11:23:39.813876-0600 osd.0 (osd.0) 4244646 : cluster [WRN] slow request osd_op(client.471789829.0:1612332 7.dc 7.2b14fadc (undecoded) ondisk+retry+write+known_if_redirected e967645) initiated 2022-02-21T10:30:29.892240-0600 currently delayed
2022-02-21T11:23:40.212664-0600 mon.node2 (mon.0) 2000851 : cluster [INF] mon.node2 is new leader, mons node2,stack1,node7 in quorum (ranks 0,1,3)
2022-02-21T11:23:40.219565-0600 mon.node2 (mon.0) 2000852 : cluster [DBG] monmap e14: 4 mons at {node2=[v2:10.0.1.2:3300/0,v1:10.0.1.2:6789/0],node7=[v2:10.0.1.7:3300/0,v1:10.0.1.7:6789/0],node900=[v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0],stack1=[v2:10.0.1.1:3300/0,v1:10.0.1.1:6789/0]}
2022-02-21T11:23:40.219633-0600 mon.node2 (mon.0) 2000853 : cluster [DBG] fsmap cephfs:1 {0=node2=up:active} 2 up:standby
2022-02-21T11:23:40.219653-0600 mon.node2 (mon.0) 2000854 : cluster [DBG] osdmap e967678: 14 total, 4 up, 10 in
2022-02-21T11:23:40.220140-0600 mon.node2 (mon.0) 2000855 : cluster [DBG] mgrmap e649: stack1(active, since 5d), standbys: node2, node7
2022-02-21T11:23:40.228388-0600 mon.node2 (mon.0) 2000856 : cluster [ERR] Health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; mon node7 is very low on available space; mon stack1 is low on available space; 1/4 mons down, quorum node2,stack1,node7; 6 osds down; 1 host (7 osds) down; Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale; 138 slow ops, oldest one blocked for 61680 sec, osd.0 has slow ops
2022-02-21T11:23:40.228404-0600 mon.node2 (mon.0) 2000857 : cluster [ERR] [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
2022-02-21T11:23:40.228409-0600 mon.node2 (mon.0) 2000858 : cluster [ERR]     mds.node2(mds.0): 7 slow metadata IOs are blocked > 30 secs, oldest blocked for 78670 secs
2022-02-21T11:23:40.228413-0600 mon.node2 (mon.0) 2000859 : cluster [ERR] [ERR] MON_DISK_CRIT: mon node7 is very low on available space
2022-02-21T11:23:40.228416-0600 mon.node2 (mon.0) 2000860 : cluster [ERR]     mon.node7 has 1% avail
2022-02-21T11:23:40.228422-0600 mon.node2 (mon.0) 2000861 : cluster [ERR] [WRN] MON_DISK_LOW: mon stack1 is low on available space
2022-02-21T11:23:40.228428-0600 mon.node2 (mon.0) 2000862 : cluster [ERR]     mon.stack1 has 8% avail
2022-02-21T11:23:40.228432-0600 mon.node2 (mon.0) 2000863 : cluster [ERR] [WRN] MON_DOWN: 1/4 mons down, quorum node2,stack1,node7
2022-02-21T11:23:40.228437-0600 mon.node2 (mon.0) 2000864 : cluster [ERR]     mon.node900 (rank 2) addr [v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0] is down (out of quorum)
2022-02-21T11:23:40.228443-0600 mon.node2 (mon.0) 2000865 : cluster [ERR] [WRN] OSD_DOWN: 6 osds down
2022-02-21T11:23:40.228449-0600 mon.node2 (mon.0) 2000866 : cluster [ERR]     osd.8 (root=default,host=node900) is down
2022-02-21T11:23:40.228454-0600 mon.node2 (mon.0) 2000867 : cluster [ERR]     osd.9 (root=default,host=node900) is down
2022-02-21T11:23:40.228460-0600 mon.node2 (mon.0) 2000868 : cluster [ERR]     osd.10 (root=default,host=node900) is down
2022-02-21T11:23:40.228466-0600 mon.node2 (mon.0) 2000869 : cluster [ERR]     osd.11 (root=default,host=node900) is down
2022-02-21T11:23:40.228471-0600 mon.node2 (mon.0) 2000870 : cluster [ERR]     osd.12 (root=default,host=node900) is down
2022-02-21T11:23:40.228477-0600 mon.node2 (mon.0) 2000871 : cluster [ERR]     osd.13 (root=default,host=node900) is down
2022-02-21T11:23:40.228483-0600 mon.node2 (mon.0) 2000872 : cluster [ERR] [WRN] OSD_HOST_DOWN: 1 host (7 osds) down
2022-02-21T11:23:40.228488-0600 mon.node2 (mon.0) 2000873 : cluster [ERR]     host node900 (root=default) (7 osds) is down
2022-02-21T11:23:40.228527-0600 mon.node2 (mon.0) 2000874 : cluster [ERR] [WRN] PG_AVAILABILITY: Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale
2022-02-21T11:23:40.228534-0600 mon.node2 (mon.0) 2000875 : cluster [ERR]     pg 7.cd is stuck inactive for 21h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228539-0600 mon.node2 (mon.0) 2000876 : cluster [ERR]     pg 7.ce is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228544-0600 mon.node2 (mon.0) 2000877 : cluster [ERR]     pg 7.cf is stuck stale for 21h, current state stale+active+clean, last acting [6,3,8]
2022-02-21T11:23:40.228550-0600 mon.node2 (mon.0) 2000878 : cluster [ERR]     pg 7.d0 is stuck stale for 21h, current state stale+active+clean, last acting [12,2,6]
2022-02-21T11:23:40.228555-0600 mon.node2 (mon.0) 2000879 : cluster [ERR]     pg 7.d1 is stuck stale for 21h, current state stale+active+clean, last acting [9,1,2]
2022-02-21T11:23:40.228561-0600 mon.node2 (mon.0) 2000880 : cluster [ERR]     pg 7.d2 is stuck stale for 21h, current state stale+active+clean, last acting [3,9,2]
2022-02-21T11:23:40.228567-0600 mon.node2 (mon.0) 2000881 : cluster [ERR]     pg 7.d3 is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228574-0600 mon.node2 (mon.0) 2000882 : cluster [ERR]     pg 7.d4 is stuck stale for 21h, current state stale+active+clean, last acting [8,6,1]
2022-02-21T11:23:40.228580-0600 mon.node2 (mon.0) 2000883 : cluster [ERR]     pg 7.d5 is stuck stale for 21h, current state stale+active+clean, last acting [13,6,7]
2022-02-21T11:23:40.228585-0600 mon.node2 (mon.0) 2000884 : cluster [ERR]     pg 7.d6 is stuck stale for 21h, current state stale+active+clean, last acting [11,1,3]
2022-02-21T11:23:40.228591-0600 mon.node2 (mon.0) 2000885 : cluster [ERR]     pg 7.d7 is stuck stale for 21h, current state stale+active+clean, last acting [8,2,6]
2022-02-21T11:23:40.228597-0600 mon.node2 (mon.0) 2000886 : cluster [ERR]     pg 7.d8 is stuck stale for 21h, current state stale+active+clean, last acting [11,7,6]
2022-02-21T11:23:40.228602-0600 mon.node2 (mon.0) 2000887 : cluster [ERR]     pg 7.d9 is stuck stale for 21h, current state stale+active+clean, last acting [2,6,11]
2022-02-21T11:23:40.228608-0600 mon.node2 (mon.0) 2000888 : cluster [ERR]     pg 7.da is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228613-0600 mon.node2 (mon.0) 2000889 : cluster [ERR]     pg 7.db is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228619-0600 mon.node2 (mon.0) 2000890 : cluster [ERR]     pg 7.dc is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228624-0600 mon.node2 (mon.0) 2000891 : cluster [ERR]     pg 7.dd is stuck stale for 18h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228630-0600 mon.node2 (mon.0) 2000892 : cluster [ERR]     pg 7.de is stuck stale for 21h, current state stale+active+clean, last acting [2,3,10]
2022-02-21T11:23:40.228635-0600 mon.node2 (mon.0) 2000893 : cluster [ERR]     pg 7.df is stuck stale for 21h, current state stale+active+clean, last acting [3,1,14]
2022-02-21T11:23:40.228641-0600 mon.node2 (mon.0) 2000894 : cluster [ERR]     pg 7.e0 is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228646-0600 mon.node2 (mon.0) 2000895 : cluster [ERR]     pg 7.e1 is stuck peering for 21h, current state peering, last acting [0,1]
2022-02-21T11:23:40.228652-0600 mon.node2 (mon.0) 2000896 : cluster [ERR]     pg 7.e2 is stuck stale for 21h, current state stale+active+clean, last acting [9,2,6]
2022-02-21T11:23:40.228658-0600 mon.node2 (mon.0) 2000897 : cluster [ERR]     pg 7.e3 is stuck stale for 21h, current state stale+active+clean, last acting [7,9,1]
2022-02-21T11:23:40.228663-0600 mon.node2 (mon.0) 2000898 : cluster [ERR]     pg 7.e4 is stuck stale for 21h, current state stale+active+clean, last acting [8,7,1]
2022-02-21T11:23:40.228669-0600 mon.node2 (mon.0) 2000899 : cluster [ERR]     pg 7.e5 is stuck peering for 22h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228675-0600 mon.node2 (mon.0) 2000900 : cluster [ERR]     pg 7.e6 is stuck stale for 21h, current state stale+active+clean, last acting [3,11,6]
2022-02-21T11:23:40.228680-0600 mon.node2 (mon.0) 2000901 : cluster [ERR]     pg 7.e7 is down, acting [0,7]
2022-02-21T11:23:40.228685-0600 mon.node2 (mon.0) 2000902 : cluster [ERR]     pg 7.e8 is stuck stale for 21h, current state stale+active+clean, last acting [8,3,6]
2022-02-21T11:23:40.228691-0600 mon.node2 (mon.0) 2000903 : cluster [ERR]     pg 7.e9 is stuck stale for 21h, current state stale+active+clean, last acting [14,3,2]
2022-02-21T11:23:40.228697-0600 mon.node2 (mon.0) 2000904 : cluster [ERR]     pg 7.ea is stuck stale for 21h, current state stale+active+clean, last acting [12,1,7]
2022-02-21T11:23:40.228702-0600 mon.node2 (mon.0) 2000905 : cluster [ERR]     pg 7.eb is stuck stale for 17h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228709-0600 mon.node2 (mon.0) 2000906 : cluster [ERR]     pg 7.ec is stuck stale for 21h, current state stale+active+clean, last acting [14,1,6]
2022-02-21T11:23:40.228713-0600 mon.node2 (mon.0) 2000907 : cluster [ERR]     pg 7.ed is stuck stale for 20h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228718-0600 mon.node2 (mon.0) 2000908 : cluster [ERR]     pg 7.ee is stuck stale for 21h, current state stale+active+clean, last acting [6,1,12]
2022-02-21T11:23:40.228721-0600 mon.node2 (mon.0) 2000909 : cluster [ERR]     pg 7.ef is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228725-0600 mon.node2 (mon.0) 2000910 : cluster [ERR]     pg 7.f0 is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228729-0600 mon.node2 (mon.0) 2000911 : cluster [ERR]     pg 7.f1 is stuck stale for 21h, current state stale+active+clean, last acting [6,3,13]
2022-02-21T11:23:40.228733-0600 mon.node2 (mon.0) 2000912 : cluster [ERR]     pg 7.f2 is stuck stale for 17h, current state stale+peering, last acting [0,14]
2022-02-21T11:23:40.228737-0600 mon.node2 (mon.0) 2000913 : cluster [ERR]     pg 7.f3 is stuck stale for 21h, current state stale+active+clean, last acting [14,7,2]
2022-02-21T11:23:40.228741-0600 mon.node2 (mon.0) 2000914 : cluster [ERR]     pg 7.f4 is stuck stale for 21h, current state stale+active+clean, last acting [14,6,3]
2022-02-21T11:23:40.228745-0600 mon.node2 (mon.0) 2000915 : cluster [ERR]     pg 7.f5 is stuck stale for 21h, current state stale+active+clean, last acting [6,10,1]
2022-02-21T11:23:40.228749-0600 mon.node2 (mon.0) 2000916 : cluster [ERR]     pg 7.f6 is stuck stale for 21h, current state stale+active+clean, last acting [2,3,11]
2022-02-21T11:23:40.228753-0600 mon.node2 (mon.0) 2000917 : cluster [ERR]     pg 7.f7 is stuck stale for 21h, current state stale+active+clean, last acting [7,12,6]
2022-02-21T11:23:40.228756-0600 mon.node2 (mon.0) 2000918 : cluster [ERR]     pg 7.f8 is down, acting [0,1]
2022-02-21T11:23:40.228760-0600 mon.node2 (mon.0) 2000919 : cluster [ERR]     pg 7.f9 is stuck stale for 21h, current state stale+active+clean, last acting [13,7,3]
2022-02-21T11:23:40.228764-0600 mon.node2 (mon.0) 2000920 : cluster [ERR]     pg 7.fa is stuck stale for 17h, current state stale+peering, last acting [0,14]
2022-02-21T11:23:40.228767-0600 mon.node2 (mon.0) 2000921 : cluster [ERR]     pg 7.fb is stuck stale for 21h, current state stale+active+clean, last acting [13,7,6]
2022-02-21T11:23:40.228771-0600 mon.node2 (mon.0) 2000922 : cluster [ERR]     pg 7.fc is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228776-0600 mon.node2 (mon.0) 2000923 : cluster [ERR]     pg 7.fd is stuck peering for 21h, current state peering, last acting [0,1]
2022-02-21T11:23:40.228780-0600 mon.node2 (mon.0) 2000924 : cluster [ERR]     pg 7.fe is stuck stale for 21h, current state stale+active+clean, last acting [8,3,2]
2022-02-21T11:23:40.228783-0600 mon.node2 (mon.0) 2000925 : cluster [ERR]     pg 7.ff is stuck peering for 22h, current state peering, last acting [0,1]
2022-02-21T11:23:40.228788-0600 mon.node2 (mon.0) 2000926 : cluster [ERR] [WRN] SLOW_OPS: 138 slow ops, oldest one blocked for 61680 sec, osd.0 has slow ops
Server View
Logs
 
Holy heck! Lots of posts! I can't get to it tonight, but for today I will comment that the issue with reaching another node's shell seems too coincidental with the fact that it and stack1 are the only nodes with ceph components showing online instead of degraded. I would wager that if you directly went to Node 5 and looked at the ceph menu, you would observe that it does not timeout and instead shows some form of panick. For some reason, your nodes are very unhappy and as you have described and demonstrated in your networking, I believe that it is a software stack issue.

Cheers,

Tmanok
 
i reboot them all one by one doing apt update and upgrade
then pvecm updatecerts

all but node900 now show systemctl status -- running

ok node 3 now off too..?

1646129997328.png

1646130021051.png

all others are looking ok

still, no mds server
still need to see if the data map exists anywhere to be restored
no managers seem to stay on
1646130158513.png
 
1646130699087.png


1646131083613.png

Looks like mon node7 went critical on space
Mon_Disk_LOW mon stack1 is 1% avail

so a couple of them ran out of space... how?
Where are they filling up? Is the map on pve root filling or just the osd and map being stored on osd?

Wondering how exactly it stores data for mds and mon - and where...?
 
Last edited:
Holy heck! Lots of posts! I can't get to it tonight, but for today I will comment that the issue with reaching another node's shell seems too coincidental with the fact that it and stack1 are the only nodes with ceph components showing online instead of degraded. I would wager that if you directly went to Node 5 and looked at the ceph menu, you would observe that it does not timeout and instead shows some form of panick. For some reason, your nodes are very unhappy and as you have described and demonstrated in your networking, I believe that it is a software stack issue.

Cheers,

Tmanok
node 5 still timeout

1646191896260.png

ceph -s just hangs too

also, node 5 HDD does not show the OSD anymore... as if the drive is initialized and not used for OSD store...
1646192054341.png

node5 systemctl status
Code:
root@node5:~# systemctl status
● node5
    State: running
     Jobs: 1 queued
   Failed: 0 units
    Since: Sun 2022-02-27 00:37:50 CST; 2 days ago
   CGroup: /
           ├─809 bpfilter_umh
           ├─user.slice
           │ └─user-0.slice
           │   ├─session-110.scope
           │   │ ├─961357 /bin/login -f
           │   │ ├─961382 -bash
           │   │ ├─962123 systemctl status
           │   │ └─962124 pager
           │   └─user@0.service …
           │     └─init.scope
           │       ├─961363 /lib/systemd/systemd --user
           │       └─961364 (sd-pam)
           ├─init.scope
           │ └─1 /sbin/init
           └─system.slice
             ├─pvescheduler.service
             │ └─1081 pvescheduler
             ├─systemd-udevd.service
lines 1-24...skipping...
● node5
    State: running
     Jobs: 1 queued
   Failed: 0 units
    Since: Sun 2022-02-27 00:37:50 CST; 2 days ago
   CGroup: /
           ├─809 bpfilter_umh
           ├─user.slice
           │ └─user-0.slice
           │   ├─session-110.scope
           │   │ ├─961357 /bin/login -f
           │   │ ├─961382 -bash
           │   │ ├─962123 systemctl status
           │   │ └─962124 pager
           │   └─user@0.service …
           │     └─init.scope
           │       ├─961363 /lib/systemd/systemd --user
           │       └─961364 (sd-pam)
           ├─init.scope
           │ └─1 /sbin/init
           └─system.slice
             ├─pvescheduler.service
             │ └─1081 pvescheduler
             ├─systemd-udevd.service
             │ └─377 /lib/systemd/systemd-udevd
             ├─cron.service
             │ └─1021 /usr/sbin/cron -f
             ├─pve-firewall.service
             │ └─1041 pve-firewall
             ├─pve-lxc-syscalld.service
             │ └─609 /usr/lib/x86_64-linux-gnu/pve-lxc-syscalld/pve-lxc-syscalld --system /run/pve/lxc-syscalld.sock
             ├─spiceproxy.service
             │ ├─  1113 spiceproxy
             │ └─661844 spiceproxy worker
             ├─pve-ha-crm.service
             │ └─1104 pve-ha-crm
             ├─pvedaemon.service
             │ ├─  1096 pvedaemon
             │ ├─  1097 pvedaemon worker
             │ ├─  1098 pvedaemon worker
             │ ├─  1099 pvedaemon worker
             │ ├─961352 task UPID:node5:000EAB48:017A7957:621EE520:vncshell::root@pam:
             │ └─961353 /usr/bin/termproxy 5900 --path /nodes/node5 --perm Sys.Console -- /bin/login -f root
             ├─systemd-journald.service
             │ └─352 /lib/systemd/systemd-journald
             ├─ssh.service
             │ └─817 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
             ├─ceph-crash.service
             │ └─602 /usr/bin/python3.9 /usr/bin/ceph-crash
             ├─qmeventd.service
             │ └─617 /usr/sbin/qmeventd /var/run/qmeventd.sock
             ├─rrdcached.service
             │ └─896 /usr/bin/rrdcached -B -b /var/lib/rrdcached/db/ -j /var/lib/rrdcached/journal/ -p /var/run/rrdcached.pid -l unix:/var/run/rrdcache>
             ├─watchdog-mux.service
             │ └─616 /usr/sbin/watchdog-mux
             ├─pvefw-logger.service
             │ └─661840 /usr/sbin/pvefw-logger
             ├─rsyslog.service
             │ └─611 /usr/sbin/rsyslogd -n -iNONE
             ├─pveproxy.service
             │ ├─  1105 pveproxy
             │ ├─746029 pveproxy worker
             │ ├─747973 pveproxy worker
             │ └─749316 pveproxy worker
             ├─ksmtuned.service
             │ ├─   625 /bin/bash /usr/sbin/ksmtuned
             │ └─961874 sleep 60
             ├─lxc-monitord.service
             │ └─791 /usr/libexec/lxc/lxc-monitord --daemon
             ├─mnt-pve-ISO_store1.mount
             │ ├─962027 /bin/mount 10.0.1.1,10.0.1.2,10.0.1.5,10.0.1.6,10.0.1.7,10.0.90.0:/ /mnt/pve/ISO_store1 -t ceph -o name=admin,secretfile=/etc/p>
             │ └─962028 /sbin/mount.ceph 10.0.1.1,10.0.1.2,10.0.1.5,10.0.1.6,10.0.1.7,10.0.90.0:/ /mnt/pve/ISO_store1 -o rw name admin secretfile /etc/>
             ├─rpcbind.service
             │ └─600 /sbin/rpcbind -f -w
             ├─chrony.service
             │ ├─849 /usr/sbin/chronyd -F 1
             │ └─859 /usr/sbin/chronyd -F 1
             ├─lxcfs.service …
             │ └─608 /usr/bin/lxcfs /var/lib/lxcfs
             ├─corosync.service
             │ └─1020 /usr/sbin/corosync -f
             ├─system-postfix.slice
             │ └─postfix@-.service
             │   ├─  1013 /usr/lib/postfix/sbin/master -w
             │   ├─  1015 qmgr -l -t unix -u
             │   └─954095 pickup -l -t unix -u -c
             ├─smartmontools.service
             │ └─612 /usr/sbin/smartd -n
             ├─iscsid.service
             │ ├─811 /sbin/iscsid
             │ └─812 /sbin/iscsid
             ├─zfs-zed.service
             │ └─620 /usr/sbin/zed -F
             ├─pve-cluster.service
             │ └─905 /usr/bin/pmxcfs
             ├─dbus.service
             │ └─604 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
             ├─pve-ha-lrm.service
             │ └─1115 pve-ha-lrm
             ├─system-getty.slice
             │ └─getty@tty1.service
             │   └─834 /sbin/agetty -o -p -- \u --noclear tty1 linux
             ├─pvestatd.service
             │ ├─  1040 pvestatd
             │ └─962026 systemctl start mnt-pve-ISO_store1.mount
             ├─dm-event.service
             │ └─360 /sbin/dmeventd -f
             └─systemd-logind.service
               └─614 /lib/systemd/systemd-logind
1646192180731.png
1646192218311.png
 

Attachments

  • 1646192109090.png
    1646192109090.png
    53.9 KB · Views: 0
Last edited:
Looking at logs mon stack 1 critical on space... so I go to var/lib/ceph... see proxmox setup that /var is on root partition.. but root isn't full... kinda lost on why it would be reporting 1% available mon.stack1
 

Attachments

  • Screenshot_20220302-183053_Photos.jpg
    Screenshot_20220302-183053_Photos.jpg
    497.2 KB · Views: 3
systemd-timesyncd gone after the update - chrony installed now and running... might have broken something... but why no mds server running? Any way to restore it? Gotta be something simple I am missing...

I post all those log and screenshots... only issues I am seeing are mon out out of space - but I cannot confirm that is true as file systems look ok... I would like to figure out how to resize the lvm partition of the 80gb ssd on each machine and give more space to root... so this crap does not happen at every dist upgrade... or log overfills the allocated space...

using apt autoremove saved a little space.

but I do not see any node that has root /var/ full or anything...
 
no it was running fine - I did a PM update and upgrade and started having all sorts of issues with root out of space on pve drives.. then the entire ceph fs died and I dont see any file system for ceph left... monitors are not talking and no mds is up.

Within Proxmox I can see all the physical disks assigned as osd. They seem to still have data on them.
I see the syslog trying to mount the ceph FS such as
1647188221335.png

but I cannot figure out how to rescue this cluster... there is only one pool fs I really need to rescue if at all possible... CPool1

the rest are just stored ISO files and data I can download again if needed.

Should be plenty of space left - but I think what happened is I rebooted node900 - which has 7 drives dedicated to OSD and then rebooted other nodes after update before they all came back up - while node 900 was down I think the other nodes started going nuts trying to rebalance and restore and that caused something else to fill up and then when node 900 came back online it could not sync or otherwise as everything was in degraded state... who knows now to be honest..

I just want to know if we can figure out how to rebuild the filesystem, get the monitors talking and get an mds server up that has the file map restored - undeleted if possible - or rebuild from the ODS drive locations.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!