ceph not working monitors and managers lost

Armando Andrade Díaz · Dec 1, 2021

hello, we have three nodes each with one osd monitor and manager, suddenly one of them failed fatal, monitors are in unknow state, and managers are not showing

ceph -s hangs

jsterr · Dec 1, 2021

Armando Andrade Díaz said:
hello, we have three nodes each with one osd monitor and manager, suddenly one of them failed fatal, monitors are in unknow state, and managers are not showing

ceph -s hangs

Post your /etc/pve/ceph.conf and check your ceph-network. Can you ping from on host to the others on the ceph network? If you have jumbo-frames enabled can you ping with ping -M do -s 8972 IP-NodeX ?

GoZippy · Feb 25, 2022

Did you ever resolve this? I am having same issue.

ceph -s jsut sites there and freezes

timeout 500 on gui for ceph status page/dashboard
config shows all the correct hosts for monitors and correct node ip

proxmox node to node connectivity is fine - just ceph MANAGERS are missing and no OSDs are shown.

The disks show OSD are there but ceph is dead to the world without manager working.. no clue what happened... was thinking this might be a related post - sorry to hijack it.. but no one respond to my other requests so thinking maybe you fixed your issue and can help me too.

jsterr · Feb 25, 2022

GoZippy said:
Did you ever resolve this? I am having same issue.

ceph -s jsut sites there and freezes

timeout 500 on gui for ceph status page/dashboard
config shows all the correct hosts for monitors and correct node ip

proxmox node to node connectivity is fine - just ceph MANAGERS are missing and no OSDs are shown.

View attachment 34595

View attachment 34596

View attachment 34597

The disks show OSD are there but ceph is dead to the world without manager working.. no clue what happened... was thinking this might be a related post - sorry to hijack it.. but no one respond to my other requests so thinking maybe you fixed your issue and can help me too.

Post your /etc/pve/ceph.conf and check your ceph-network. Can you ping from on host to the others on the ceph network? If you have jumbo-frames enabled can you ping with ping -M do -s 8972 IP-NodeX ? What about timesync, do you have correct time and a working chrony?

What is ceph service status via cli systemctl status ceph-mon@SRV ceph-mgr@SRV etc.

Tmanok · Feb 25, 2022

Best practice for CEPH is minimum 4 nodes with adequate resources (4GB of RAM per OSD and one CPU core per OSD) and redundant (aggregated) network interfaces for CEPH public and monitor networks. If your CEPH is failing, consider rebooting each node and make sure to back up guests frequently during operation, especially when not following best practices...

Tmanok

GoZippy · Feb 25, 2022

Tmanok said:
Best practice for CEPH is minimum 4 nodes with adequate resources (4GB of RAM per OSD and one CPU core per OSD) and redundant (aggregated) network interfaces for CEPH public and monitor networks. If your CEPH is failing, consider rebooting each node and make sure to back up guests frequently during operation, especially when not following best practices...

Tmanok

#1 - I have 9 nodes
#2 - all nodes are plenty resources
#4 - of course I rebooted all nodes...

this happened after octo update to pacific with the automatic update and upgrade script....

Aside from generic info - your answer has no value to my case posted... ceph just hangs and mds and osd data is like it is wiped out... need to see if I can recover manager mds data.... driver still show they have ods stores on them... so need to rebuild or find out what is cause for managers on the nodes to not turn on...

Tmanok · Feb 27, 2022

GoZippy said:
#1 - I have 9 nodes
#2 - all nodes are plenty resources
#4 - of course I rebooted all nodes...

this happened after octo update to pacific with the automatic update and upgrade script....

Aside from generic info - your answer has no value to my case posted... ceph just hangs and mds and osd data is like it is wiped out... need to see if I can recover manager mds data.... driver still show they have ods stores on them... so need to rebuild or find out what is cause for managers on the nodes to not turn on...

Sorry that you did not appreciate the format and approach in my response, I will put more effort in next time to avoid a tone where it seems that I am making assumptions. Your posts did not mention the number of nodes and the most probable cause for failing ceph is due to underprovisioning or too few nodes.

In this case, something severe must have occurred, such as a network failure or another issue with the manager's during the upgrade as you pointed out. If you could list the series of events in greater detail to help explain what lead to the failure, that would be very helpful.

I will think about further troubleshooting steps, either for now or next time as it seems you need to rebuild your production cluster immediately (understandable given you have 9 nodes and possibly no further shared storage to operate on in the interrim). Please consider posting more about your setup and configuration next time you run into trouble.

Cheers,

Tmanok

GoZippy · Feb 27, 2022

@Tmanok didn't mean to sound unappreciative! I do need the help... so there is that.

I have 9 nodes (8 that are active. Node 6 is out putting in new power supply so it is out of the cluster for now).

They talk to each other fine...

Node (Stack1-node8) all have a single 1TB HDD spinner that I created a Ceph OSD on. They also have a 80GB SSD (need to figure out how to resize root as I was having issues with the root partition of these filling up between software and OS updates since the automatic installer only gave 20GB to each root)...

Node 900 has 8 physical HDD drives 1 dedicated for pve and the other 7 for osd each

My issue is this:
CEPH got timeout (500)

on everything now

I went in and edited ceph.conf to add skip sanity even and tried restarting to no avail.

GoZippy · Feb 27, 2022

current config:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.0.1.1/16
     fsid = cfa7f7e5-64a7-48dd-bd77-466ff1e77bbb
     mon_allow_pool_delete = true
     mon_host = 10.0.1.2 10.0.1.1 10.0.1.6 10.0.1.5 10.0.90.0 10.0.1.7
     ms_bind_ipv4 = true
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.0.1.1/16

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.node2]
     host = node2
     mds_standby_for_name = pve

[mds.node7]
     host = node7
     mds standby for name = pve

[mds.node900]
     host = node900
     mds_standby_for_name = pve

[mds.stack1]
     host = stack1
     mds_standby_for_name = pve

[mon.node2]
     public_addr = 10.0.1.2

[mon.node5]
     public_addr = 10.0.1.5

[mon.node6]
     public_addr = 10.0.1.6

[mon.node7]
     public_addr = 10.0.1.7

[mon.node900]
     public_addr = 10.0.90.0

[mon.stack1]
     public_addr = 10.0.1.1

[mon]
    mon_mds_skip_sanity = true

Monitors appear to be on - but no quorum
View attachment 34703

And then when I look for OSD list - NONE show up.

View attachment 34704

It's like the map and all the OSD references are GONE

I look at logs and see a full monitor and possible full osd here and there - maybe root of the problem...
Node 900 syslog shows:

Code:

Feb 27 16:35:54 node900 pvedaemon[1804215]: got timeout
Feb 27 16:36:01 node900 mount[1808559]: mount error: no mds server is up or the cluster is laggy
Feb 27 16:36:01 node900 systemd[1]: mnt-pve-ISO_store1.mount: Mount process exited, code=exited, status=32/n/a
Feb 27 16:36:01 node900 systemd[1]: mnt-pve-ISO_store1.mount: Failed with result 'exit-code'.
Feb 27 16:36:01 node900 kernel: ceph: No mds server is up or the cluster is laggy
Feb 27 16:36:01 node900 systemd[1]: Failed to mount /mnt/pve/ISO_store1.
Feb 27 16:36:01 node900 pvestatd[1614]: mount error: Job failed. See "journalctl -xe" for details.
Feb 27 16:36:06 node900 pvestatd[1614]: got timeout
Feb 27 16:36:11 node900 pvestatd[1614]: got timeout
Feb 27 16:36:11 node900 pvestatd[1614]: status update time (82.828 seconds)
Feb 27 16:36:16 node900 pvestatd[1614]: got timeout
Feb 27 16:36:19 node900 pmxcfs[1348]: [status] notice: received log
Feb 27 16:36:21 node900 pvestatd[1614]: got timeout
Feb 27 16:36:21 node900 systemd[1]: Reloading.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:21 node900 systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Feb 27 16:36:22 node900 systemd[1]: Mounting /mnt/pve/ISO_store1...
Feb 27 16:36:22 node900 kernel: libceph: mon3 (1)10.0.1.6:6789 socket closed (con state V1_BANNER)
Feb 27 16:36:22 node900 kernel: libceph: mon3 (1)10.0.1.6:6789 socket closed (con state V1_BANNER)
Feb 27 16:36:23 node900 kernel: libceph: mon3 (1)10.0.1.6:6789 socket closed (con state V1_BANNER)
Feb 27 16:36:24 node900 kernel: libceph: mon3 (1)10.0.1.6:6789 socket closed (con state V1_BANNER)
Feb 27 16:36:28 node900 kernel: libceph: mon5 (1)10.0.90.0:6789 socket closed (con state OPEN)
Feb 27 16:36:28 node900 pvedaemon[1803097]: got timeout

over and over again

but there is no log "no content" under ceph log

View attachment 34705

When I jump over to node 2 however:

View attachment 34706

Code:

2022-02-21T11:23:40.212664-0600 mon.node2 (mon.0) 2000851 : cluster [INF] mon.node2 is new leader, mons node2,stack1,node7 in quorum (ranks 0,1,3)
2022-02-21T11:23:40.219565-0600 mon.node2 (mon.0) 2000852 : cluster [DBG] monmap e14: 4 mons at {node2=[v2:10.0.1.2:3300/0,v1:10.0.1.2:6789/0],node7=[v2:10.0.1.7:3300/0,v1:10.0.1.7:6789/0],node900=[v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0],stack1=[v2:10.0.1.1:3300/0,v1:10.0.1.1:6789/0]}
2022-02-21T11:23:40.219633-0600 mon.node2 (mon.0) 2000853 : cluster [DBG] fsmap cephfs:1 {0=node2=up:active} 2 up:standby
2022-02-21T11:23:40.219653-0600 mon.node2 (mon.0) 2000854 : cluster [DBG] osdmap e967678: 14 total, 4 up, 10 in
2022-02-21T11:23:40.220140-0600 mon.node2 (mon.0) 2000855 : cluster [DBG] mgrmap e649: stack1(active, since 5d), standbys: node2, node7
2022-02-21T11:23:40.228388-0600 mon.node2 (mon.0) 2000856 : cluster [ERR] Health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; mon node7 is very low on available space; mon stack1 is low on available space; 1/4 mons down, quorum node2,stack1,node7; 6 osds down; 1 host (7 osds) down; Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale; 138 slow ops, oldest one blocked for 61680 sec, osd.0 has slow ops
2022-02-21T11:23:40.228404-0600 mon.node2 (mon.0) 2000857 : cluster [ERR] [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
2022-02-21T11:23:40.228409-0600 mon.node2 (mon.0) 2000858 : cluster [ERR]     mds.node2(mds.0): 7 slow metadata IOs are blocked > 30 secs, oldest blocked for 78670 secs
2022-02-21T11:23:40.228413-0600 mon.node2 (mon.0) 2000859 : cluster [ERR] [ERR] MON_DISK_CRIT: mon node7 is very low on available space
2022-02-21T11:23:40.228416-0600 mon.node2 (mon.0) 2000860 : cluster [ERR]     mon.node7 has 1% avail
2022-02-21T11:23:40.228422-0600 mon.node2 (mon.0) 2000861 : cluster [ERR] [WRN] MON_DISK_LOW: mon stack1 is low on available space
2022-02-21T11:23:40.228428-0600 mon.node2 (mon.0) 2000862 : cluster [ERR]     mon.stack1 has 8% avail
2022-02-21T11:23:40.228432-0600 mon.node2 (mon.0) 2000863 : cluster [ERR] [WRN] MON_DOWN: 1/4 mons down, quorum node2,stack1,node7
2022-02-21T11:23:40.228437-0600 mon.node2 (mon.0) 2000864 : cluster [ERR]     mon.node900 (rank 2) addr [v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0] is down (out of quorum)
2022-02-21T11:23:40.228443-0600 mon.node2 (mon.0) 2000865 : cluster [ERR] [WRN] OSD_DOWN: 6 osds down
2022-02-21T11:23:40.228449-0600 mon.node2 (mon.0) 2000866 : cluster [ERR]     osd.8 (root=default,host=node900) is down
2022-02-21T11:23:40.228454-0600 mon.node2 (mon.0) 2000867 : cluster [ERR]     osd.9 (root=default,host=node900) is down
2022-02-21T11:23:40.228460-0600 mon.node2 (mon.0) 2000868 : cluster [ERR]     osd.10 (root=default,host=node900) is down
2022-02-21T11:23:40.228466-0600 mon.node2 (mon.0) 2000869 : cluster [ERR]     osd.11 (root=default,host=node900) is down
2022-02-21T11:23:40.228471-0600 mon.node2 (mon.0) 2000870 : cluster [ERR]     osd.12 (root=default,host=node900) is down
2022-02-21T11:23:40.228477-0600 mon.node2 (mon.0) 2000871 : cluster [ERR]     osd.13 (root=default,host=node900) is down
2022-02-21T11:23:40.228483-0600 mon.node2 (mon.0) 2000872 : cluster [ERR] [WRN] OSD_HOST_DOWN: 1 host (7 osds) down
2022-02-21T11:23:40.228488-0600 mon.node2 (mon.0) 2000873 : cluster [ERR]     host node900 (root=default) (7 osds) is down
2022-02-21T11:23:40.228527-0600 mon.node2 (mon.0) 2000874 : cluster [ERR] [WRN] PG_AVAILABILITY: Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale
2022-02-21T11:23:40.228534-0600 mon.node2 (mon.0) 2000875 : cluster [ERR]     pg 7.cd is stuck inactive for 21h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228539-0600 mon.node2 (mon.0) 2000876 : cluster [ERR]     pg 7.ce is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228544-0600 mon.node2 (mon.0) 2000877 : cluster [ERR]     pg 7.cf is stuck stale for 21h, current state stale+active+clean, last acting [6,3,8]
2022-02-21T11:23:40.228550-0600 mon.node2 (mon.0) 2000878 : cluster [ERR]     pg 7.d0 is stuck stale for 21h, current state stale+active+clean, last acting [12,2,6]
2022-02-21T11:23:40.228555-0600 mon.node2 (mon.0) 2000879 : cluster [ERR]     pg 7.d1 is stuck stale for 21h, current state stale+active+clean, last acting [9,1,2]
2022-02-21T11:23:40.228561-0600 mon.node2 (mon.0) 2000880 : cluster [ERR]     pg 7.d2 is stuck stale for 21h, current state stale+active+clean, last acting [3,9,2]
2022-02-21T11:23:40.228567-0600 mon.node2 (mon.0) 2000881 : cluster [ERR]     pg 7.d3 is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228574-0600 mon.node2 (mon.0) 2000882 : cluster [ERR]     pg 7.d4 is stuck stale for 21h, current state stale+active+clean, last acting [8,6,1]
2022-02-21T11:23:40.228580-0600 mon.node2 (mon.0) 2000883 : cluster [ERR]     pg 7.d5 is stuck stale for 21h, current state stale+active+clean, last acting [13,6,7]
2022-02-21T11:23:40.228585-0600 mon.node2 (mon.0) 2000884 : cluster [ERR]     pg 7.d6 is stuck stale for 21h, current state stale+active+clean, last acting [11,1,3]
2022-02-21T11:23:40.228591-0600 mon.node2 (mon.0) 2000885 : cluster [ERR]     pg 7.d7 is stuck stale for 21h, current state stale+active+clean, last acting [8,2,6]
2022-02-21T11:23:40.228597-0600 mon.node2 (mon.0) 2000886 : cluster [ERR]     pg 7.d8 is stuck stale for 21h, current state stale+active+clean, last acting [11,7,6]
2022-02-21T11:23:40.228602-0600 mon.node2 (mon.0) 2000887 : cluster [ERR]     pg 7.d9 is stuck stale for 21h, current state stale+active+clean, last acting [2,6,11]
2022-02-21T11:23:40.228608-0600 mon.node2 (mon.0) 2000888 : cluster [ERR]     pg 7.da is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228613-0600 mon.node2 (mon.0) 2000889 : cluster [ERR]     pg 7.db is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228619-0600 mon.node2 (mon.0) 2000890 : cluster [ERR]     pg 7.dc is stuck peering for 21h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228624-0600 mon.node2 (mon.0) 2000891 : cluster [ERR]     pg 7.dd is stuck stale for 18h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228630-0600 mon.node2 (mon.0) 2000892 : cluster [ERR]     pg 7.de is stuck stale for 21h, current state stale+active+clean, last acting [2,3,10]
2022-02-21T11:23:40.228635-0600 mon.node2 (mon.0) 2000893 : cluster [ERR]     pg 7.df is stuck stale for 21h, current state stale+active+clean, last acting [3,1,14]
2022-02-21T11:23:40.228641-0600 mon.node2 (mon.0) 2000894 : cluster [ERR]     pg 7.e0 is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228646-0600 mon.node2 (mon.0) 2000895 : cluster [ERR]     pg 7.e1 is stuck peering for 21h, current state peering, last acting [0,1]
2022-02-21T11:23:40.228652-0600 mon.node2 (mon.0) 2000896 : cluster [ERR]     pg 7.e2 is stuck stale for 21h, current state stale+active+clean, last acting [9,2,6]
2022-02-21T11:23:40.228658-0600 mon.node2 (mon.0) 2000897 : cluster [ERR]     pg 7.e3 is stuck stale for 21h, current state stale+active+clean, last acting [7,9,1]
2022-02-21T11:23:40.228663-0600 mon.node2 (mon.0) 2000898 : cluster [ERR]     pg 7.e4 is stuck stale for 21h, current state stale+active+clean, last acting [8,7,1]
2022-02-21T11:23:40.228669-0600 mon.node2 (mon.0) 2000899 : cluster [ERR]     pg 7.e5 is stuck peering for 22h, current state peering, last acting [0,6]
2022-02-21T11:23:40.228675-0600 mon.node2 (mon.0) 2000900 : cluster [ERR]     pg 7.e6 is stuck stale for 21h, current state stale+active+clean, last acting [3,11,6]
2022-02-21T11:23:40.228680-0600 mon.node2 (mon.0) 2000901 : cluster [ERR]     pg 7.e7 is down, acting [0,7]
2022-02-21T11:23:40.228685-0600 mon.node2 (mon.0) 2000902 : cluster [ERR]     pg 7.e8 is stuck stale for 21h, current state stale+active+clean, last acting [8,3,6]

So there are several things going on...

Need to sort out why the OSD are not showing up

Need to try to recover the crush map and backup before I do anything else - not sure how...

Need to figure out if I can resize that 80GB SSD LVM partition and add a little more space on the SSD to make new partition I can use for another OSD to help with recovery at least on the nodes where it appears osd is full.

Need to figure out what happened to mds server... none seem to start or be there...

GoZippy · Feb 27, 2022

I see host node 900 is down (bulk of the osd storage) so that may be one big thing that if I can get back online will help rebalance things again..

GoZippy · Feb 28, 2022

Code:

root@node7:/etc/pve# ceph auth ls
2022-02-27T17:17:55.616-0600 7f907359e700  0 monclient(hunting): authenticate timed out after 300
2022-02-27T17:22:55.618-0600 7f907359e700  0 monclient(hunting): authenticate timed out after 300
2022-02-27T17:27:55.615-0600 7f907359e700  0 monclient(hunting): authenticate timed out after 300
2022-02-27T17:32:55.617-0600 7f907359e700  0 monclient(hunting): authenticate timed out after 300

GoZippy · Feb 28, 2022

still stuck... darn it

Tmanok · Feb 28, 2022

Hi GoZippy,

GoZippy said:
They talk to each other fine...

Currently we only know that the PVE corosync talks fine, I would like to see the output of "ip a" and each node's Network page to help you verify that. Additionally if you could try the following pings on every node:

10.0.1.2
10.0.1.5
10.0.1.6
10.0.1.7
10.0.90.0
10.0.1.1

GoZippy said:
Monitors appear to be on - but no quorum

Not only no quorum, but status "unknown" I wouldn't exactly call that "on". Double check your screenshot, it states they are unknown which happens normally when networking has failed or worse when the software stack is encountering an issue. Additionally, not all of the configured monitors are displayed in that same screenshot of your monitors.

GoZippy said:
proxmox node to node connectivity is fine - just ceph MANAGERS are missing and no OSDs are shown.

Ok between your managers being missing and not all of your monitors are showing in one of your screenshots, something tells me the issue is related to the software stack or networking stack. By the way, if you monitors are not working (currently the case) your OSDs won't be displayed and when CEPH stack e.g. monitors and managers are not working, the CRUSH algorithm will be halted. First verify connectivity, then try:

What does
Code:
```
systemctl status
```
show for each of the ceph processes on each of the nodes? They may ask you to run journalctl for each process too. You don't have to post all of that here, but pastebin or attach TXT files might be helpful.
You said that the majority of the OSDs are on "node9000". Can you list in your next post the simple layout of your nodes and OSDs? e.g.
NodeA: OSD1 OSD2 OSD3
NodeB: OSD4 OSD5 OSD6 etc

Thanks, if we can solve the networking, the cluster may be able to heal. It would still be great to hear about the series of events leading up to the failure.

Tmanok

GoZippy · Mar 1, 2022

Tmanok said:
Hi GoZippy,

Currently we only know that the PVE corosync talks fine, I would like to see the output of "ip a" and each node's Network page to help you verify that. Additionally if you could try the following pings on every node:

10.0.1.2

10.0.1.5

10.0.1.6

10.0.1.7

10.0.90.0

10.0.1.1

10.0.1.1

Code:

root@stack1:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:5c:ea:b8 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr1 state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:ea:b9 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
4: enp1s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr2 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:10 brd ff:ff:ff:ff:ff:ff
5: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr3 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:11 brd ff:ff:ff:ff:ff:ff
6: enp1s0f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr4 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:12 brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr5 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:13 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:5c:ea:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.1/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe5c:eab8/64 scope link
       valid_lft forever preferred_lft forever
9: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:ea:b9 brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.1/16 scope global vmbr1
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe5c:eab9/64 scope link
       valid_lft forever preferred_lft forever
10: vmbr2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:10 brd ff:ff:ff:ff:ff:ff
11: vmbr3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:11 brd ff:ff:ff:ff:ff:ff
12: vmbr4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:12 brd ff:ff:ff:ff:ff:ff
13: vmbr5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:9d:13 brd ff:ff:ff:ff:ff:ff

10.0.1.2

Code:

root@node2:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:5c:9f:e0 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:9f:e1 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
4: enp1s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:63:d1 brd ff:ff:ff:ff:ff:ff
5: enp1s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:63:d2 brd ff:ff:ff:ff:ff:ff
6: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:63:d3 brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:63:d4 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:5c:9f:e0 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.2/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe5c:9fe0/64 scope link
       valid_lft forever preferred_lft forever
root@node2:~#

Notice - I am only using one network interface right now per node. There are no VMs live so not worried about data (would like to set that up later to isolate ceph and corosync from IP stuff).

GoZippy · Mar 1, 2022

10.0.1.3
View attachment 34761

Code:

root@node3:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:5c:d5:90 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:d5:91 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
4: enp1s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b7 brd ff:ff:ff:ff:ff:ff
5: enp1s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b8 brd ff:ff:ff:ff:ff:ff
6: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b9 brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:ba brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:5c:d5:90 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.3/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe5c:d590/64 scope link
       valid_lft forever preferred_lft forever
root@node3:~#

node4
View attachment 34762

Code:

root@node4:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:51:3b:b0 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr1 state DOWN group default qlen 1000
    link/ether 00:0e:b6:51:3b:b1 brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
4: enp1s0f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr2 state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b2 brd ff:ff:ff:ff:ff:ff
5: enp1s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr3 state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b3 brd ff:ff:ff:ff:ff:ff
6: enp1s0f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr4 state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b4 brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr5 state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b5 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:51:3b:b0 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.4/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe51:3bb0/64 scope link
       valid_lft forever preferred_lft forever
9: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:51:3b:b1 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20e:b6ff:fe51:3bb1/64 scope link
       valid_lft forever preferred_lft forever
10: vmbr2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b2 brd ff:ff:ff:ff:ff:ff
11: vmbr3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b3 brd ff:ff:ff:ff:ff:ff
12: vmbr4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b4 brd ff:ff:ff:ff:ff:ff
13: vmbr5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:70:b5 brd ff:ff:ff:ff:ff:ff
root@node4:~#

node5
View attachment 34763

When I went to open shell I got this generic host key error
View attachment 34764

So I opened node 5 proxmox web console direct over its own ip
10.0.1.5:8006

Entered shell just fine and ran
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "10.0.1.5" on both nodes with issues... then
" pvecm updatecerts" on both 10.0.1.5 and 10.0.1.7 which is what I was connected to earlier and it would not work... to ensure it updates at both ends. Login from node 7 can open shell on all of them except 5.

anyhow not an important node but I got at least that easy one done...
.

Code:

root@node5:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:5c:a3:e8 brd ff:ff:ff:ff:ff:ff
3: enp2s0f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr3 state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:a3:e9 brd ff:ff:ff:ff:ff:ff
4: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr2 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b3 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
5: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr1 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b4 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
6: enp1s0f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr4 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b5 brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master vmbr5 state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b6 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:5c:a3:e8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.5/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe5c:a3e8/64 scope link
       valid_lft forever preferred_lft forever
9: vmbr1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b4 brd ff:ff:ff:ff:ff:ff
10: vmbr2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b3 brd ff:ff:ff:ff:ff:ff
11: vmbr3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:a3:e9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20e:b6ff:fe5c:a3e9/64 scope link
       valid_lft forever preferred_lft forever
12: vmbr4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b5 brd ff:ff:ff:ff:ff:ff
13: vmbr5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:54:b6 brd ff:ff:ff:ff:ff:ff
root@node5:~#

GoZippy · Mar 1, 2022

node6 - removed from cluster (fixing server - will add back asap)

node 7
View attachment 34765

Code:

root@node7:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:5c:a9:c8 brd ff:ff:ff:ff:ff:ff
3: enp2s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:5c:a9:c9 brd ff:ff:ff:ff:ff:ff
4: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:99:0a brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:99:0b brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
6: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:99:0c brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:a8:99:0d brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:33:43:45:7e:c7 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.7/16 brd 10.0.255.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::5433:43ff:fe45:7ec7/64 scope link
       valid_lft forever preferred_lft forever

node 8
View attachment 34766

Code:

root@node8:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp2s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 00:0e:b6:44:e7:c8 brd ff:ff:ff:ff:ff:ff
3: enp2s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:44:e7:c9 brd ff:ff:ff:ff:ff:ff
4: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:bb:6a brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
5: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:bb:6b brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
6: enp1s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:bb:6c brd ff:ff:ff:ff:ff:ff
7: enp1s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:0e:b6:87:bb:6d brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:0e:b6:44:e7:c8 brd ff:ff:ff:ff:ff:ff
    inet 10.0.1.8/16 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::20e:b6ff:fe44:e7c8/64 scope link
       valid_lft forever preferred_lft forever
root@node8:~#

node 900
View attachment 34767

Code:

root@node900:~# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP group default qlen 1000
    link/ether 78:2b:cb:47:98:2d brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
3: eno2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether 78:2b:cb:47:98:2e brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 56:51:91:97:b0:64 brd ff:ff:ff:ff:ff:ff
    inet 10.0.90.0/16 brd 10.0.255.255 scope global vmbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::5451:91ff:fe97:b064/64 scope link
       valid_lft forever preferred_lft forever
5: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ea:17:4a:2b:30:0d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e817:4aff:fe2b:300d/64 scope link
       valid_lft forever preferred_lft forever
root@node900:~#

ping from node to node not responsive - pretty sure that is set to not to respond on purpose to generic ping requests.

GoZippy · Mar 1, 2022

Tmanok said:
Not only no quorum, but status "unknown" I wouldn't exactly call that "on". Double check your screenshot, it states they are unknown which happens normally when networking has failed or worse when the software stack is encountering an issue. Additionally, not all of the configured monitors are displayed in that same screenshot of your monitors.

Ok between your managers being missing and not all of your monitors are showing in one of your screenshots, something tells me the issue is related to the software stack or networking stack. By the way, if you monitors are not working (currently the case) your OSDs won't be displayed and when CEPH stack e.g. monitors and managers are not working, the CRUSH algorithm will be halted. First verify connectivity, then try:

What does

Code:

systemctl status

show for each of the ceph processes on each of the nodes? They may ask you to run journalctl for each process too. You don't have to post all of that here, but pastebin or attach TXT files might be helpful.

GoZippy · Mar 1, 2022

OSD Layout

Node 'node900'
Node 'node8'
Node 'node7'
Node 'node5'
Node 'node4'
Node 'node3'
Node 'node2'
Node 'stack1'

GoZippy · Mar 1, 2022

I guess I forgot to add an OSD on node 5 - or the drive was purged or something... not sure... "node 5" no longer has OSD on that 1TB HDD - like I have on all the others... I may not have set it up - skipped it or something - but pretty sure I had it in there... but the other nodes are all lining up...

node 5 then should have OSD 4 and the missing node 6 I took out of the cluster would have OSD.5 so it makes sense.. not sure what happened to node 5 now...

GoZippy · Mar 1, 2022

df -h

ceph not working monitors and managers lost

Active Member

Renowned Member

Active Member

Renowned Member

Renowned Member

Active Member

Renowned Member

Active Member

Attachments

Active Member

Active Member

Active Member

Active Member

Renowned Member

Active Member

Attachments

Active Member

Active Member

Active Member

Attachments

Active Member

Active Member

Active Member

We value your privacy