Ceph failing on new 3 Server Cluster.

Mr-Crain

New Member
Mar 3, 2025
3
0
1
Noob here. I've built a new 3 Server Cluster with 3 M630s (VRTX Chassis). Just a few hours after the Cluster is built, Ceph is configured, OSDs are working, Pools are made, all possible status(s) are green... but then a couple hours later Ceph just begins freezing. I can no longer add a monitor, no status reports, no OSD list, just "got timeout (500)".

I'm going to attach my build notes if anyone wouldn't mind taking a look. https://docs.google.com/document/d/...ouid=108617046164815634889&rtpof=true&sd=true

Thanks in advance and I'm down to run any checks you guys can think of, or hell even a complete rebuild if you guys think I did something incorrectly.
 
Check if the network configured for Ceph is working. Did you enable the firewall without configuring exceptions for Ceph (macro)?

If you use a large MTU, make sure to use ping with a large payload to fill up the packet so that in the end with the ICMP and IP overhead, it will be the size of the MTU.
For example, with an MTU of 9000 and IPv4:
Code:
ping -M do -s 8972 {target host}
 
Check if the network configured for Ceph is working. Did you enable the firewall without configuring exceptions for Ceph (macro)?

If you use a large MTU, make sure to use ping with a large payload to fill up the packet so that in the end with the ICMP and IP overhead, it will be the size of the MTU.
For example, with an MTU of 9000 and IPv4:
Code:
ping -M do -s 8972 {target host}
The network for Ceph is indeed working (as in I just checked the Port Group... it's all up/up and each Server can ping each other to/from that subnet.) My Datacenter and each Node, ALL have Firewall Disabled. At least from the GUI they are "unchecked". I would assume everything would work with the firewall off? Or should I look into the Ceph (macro) steps you mentioned?
 
If ceph -s is timing out, and the network works, and the firewall is disabled (Datacenter -> Firewall -> Options), then check the state the MONs are in. The Ceph documentation explains how:
https://docs.ceph.com/en/latest/rad...hooting-mon/#using-the-monitor-s-admin-socket

The command you want to run is this one:
Code:
ceph --admin-daemon <full_path_to_asok_file> <command>

You can also check if you can establish connections to Ceph services on another host.

Code:
ss -tulpn | grep ceph
will list the IPs and ports on which the Ceph services are listening on. Then you can use
Code:
nc -z {ip} {port}
to open a connection. If it does not return an error, it was able to open the connection.

Also, check the journals/logs of the Ceph services for anything that might indicate what is going on.
Code:
journalctl -u ceph-mon@$(hostname).service
journalctl -u ceph-osd@{OSD ID}.service
 
If ceph -s is timing out, and the network works, and the firewall is disabled (Datacenter -> Firewall -> Options), then check the state the MONs are in. The Ceph documentation explains how:
https://docs.ceph.com/en/latest/rad...hooting-mon/#using-the-monitor-s-admin-socket

The command you want to run is this one:
Code:
ceph --admin-daemon <full_path_to_asok_file> <command>

You can also check if you can establish connections to Ceph services on another host.

Code:
ss -tulpn | grep ceph
will list the IPs and ports on which the Ceph services are listening on. Then you can use
Code:
nc -z {ip} {port}
to open a connection. If it does not return an error, it was able to open the connection.

Also, check the journals/logs of the Ceph services for anything that might indicate what is going on.
Code:
journalctl -u ceph-mon@$(hostname).service
journalctl -u ceph-osd@{OSD ID}.service
You Sir, are the man. I completely wiped and started over... I think it WAS the firewall stuff... all though I had the firewall disabled on my last go around, after enabling it at the Datacenter level, putting the Macro AND a bunch of Individual statements in their, she has been stable as can be! Thank you much!