Ceph storage goes offline when first server in cluster reboots

Hi there

I've a three equal nodes to run proxomx cluster each has 4 OSDs used in a ceph storage.
PRX01, PRX02 and PRX03

When it comes to an update I sometimes have to reboot a node especially in case a kernel update was involved.

So I set the node into maintenance mode to let the VMs migrate to another node first.
After all VMs have been migratated I doing a reboot of these node.
Once the node is up again, I disable maintenance mode for this node and wait until the VMs migrated back until I proceed with the other nodes on by on.

So far everything went fine.

Only in case I need to reboot the node PRX01something goes terribly wrong.
The whole ceph cluster becomes unavailabel until reboot has finisched.

Does anyone have an idea why?
What info from my configuration do you need to help me?
 
you'd need to provide more details about your setup and ideally log files..

how many monitors do you have? how is your pool set up (replication settings, anything you customized?)? how many OSDs are there, and how are they distributed across the nodes?

what does "ceph -s" say when the cluster works, and what does it say when it doesn't?
 
you'd need to provide more details about your setup and ideally log files..

how many monitors do you have? how is your pool set up (replication settings, anything you customized?)? how many OSDs are there, and how are they distributed across the nodes?

what does "ceph -s" say when the cluster works, and what does it say when it doesn't?
how many monitors do you have?
- I have three monitors, one on each node.

how is your pool set up (replication settings, anything you customized?
Pool#12
Name.mgrVMPool
Size/min3/23/2
of placement groups1128
opt # PGs1128
Autoscale Modeonon
Crusch rulereplicated_rule (0)replicated_rule (0)
used [%]44,45 MiB (0,00%)9.41 TiB (49,60%)

how many OSDs are there, and how are they distributed across the nodes?
- there are 12 OSDs available, 4 on each PVE node.

what does "ceph -s" say when the cluster works?
...
cluster:
id: e514f756-xxxxxxxx-aa96-9304de459fd1
health: HEALTH_OK

services:
mon: 3 daemons, quorum prx02,prx03,prx01 (age 42h)
mgr: prx02(active, since 42h), standbys: prx03, prx01
osd: 12 osds: 12 up (since 42h), 12 in (since 9M)

data:
pools: 2 pools, 129 pgs
objects: 844.08k objects, 3.2 TiB
usage: 9.4 TiB used, 12 TiB / 21 TiB avail
pgs: 129 active+clean

io:
client: 120 KiB/s rd, 7.7 MiB/s wr, 24 op/s rd, 139 op/s wr
....


See also my ceph.conf:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.xxx.1.0/24
fsid = e514f756-b1ce-4429-aa96-9304de459fd1
mon_allow_pool_delete = true
mon_host = 10.xxx.1.20 10.xxx.1.30 10.xxx.1.10
ms_bind_ipv4 = true
ms_bind_ipv6 = false
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.xxx.1.0/24

[client]
keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mon.prx01]
public_addr = 10.xxx.1.10

[mon.prx02]
public_addr = 10.xxx.1.20

[mon.prx03]
public_addr = 10.xxx.1.30
 
in case you don't want to provoke another outage (understandable ;)) could you maybe provide the journal of one of the other nodes, starting slightly before you trigger the shutdown of the first node? anything particular about your network setup (going over a switch? full mesh? ... ?)?
 
forgive me my linux know how is rappidly growing but not as fast as I wish :)

How to get the requested journal data?
journalctl --since "25025-05-13 16:05" --until "2025-05-13 16:15" thats where the reboot of prx01 happened

But what output format and how to geht the export to be uploaded?
 
journalctl --since "2025-05-13 16:05" --until "2025-05-13 16:15" > log.txt and then you can attach the log.txt file here (you can download it using scp for example)
 
Thanks for the hint :)

I've collected the logs from all three notes.

PRX01 was rebooted, unfortunatly I didn't find a not in the journal of PRX02 or PRX03.
But as far as I understand the log from PRX01, ceph was shutdown for ALL Nodes during reboot.
 

Attachments

okay, that looks fine so far.. could you also post "ceph osd crush dump" (should be the same on all nodes) and /var/log/ceph/ceph.log of nodes 2 and 3 for the problematic reboot? the lines in that file start with a timestamp in unix epoch format, you can convert that with date: date --date=@XXXXXX, e..g.
Code:
$ date --date=@1747381287
Fri May 16 09:41:27 AM CEST 2025
 
those look okay as well AFAICT..

The whole ceph cluster becomes unavailabel until reboot has finisched.

how exactly did you determine this? the ceph logs only show 1 mon and 4 osds going down, but other than the PGs being undersized (which is expected and okay, they remained active!) ceph doesn't complain about anything as a result..
 
  • Like
Reactions: Johannes S