Died disk, osd is down and out, how to repair?

rainer042 · Aug 2, 2023

Hello,

recently two disks on two different servers of a hyperconverged pve cluster died. ceph rebalanced and is healthy again. So I will get two new disks, insert them into the nodes and then.....?

At the moment both osds are marked down and out in the output ceph osd tree. Both are still part of the crush map.

My plan would be to run ceph-volume lvm create --bluestore --osd-id {original-id} --data /dev/sdx for each of the new unused disks I see. Possible they would be marked in afterwards. If this does not happen I could tell ceph to mark both in by running ceph osd in <osd_id>.
Afterwards a simple start of the osd(s) in question should make them up again and ceph should start moving data to the new disks.

Is this workflow ok to solve the problem, or which way should I take instead to get the osds up and running again?

Philipp Hufnagl · Aug 2, 2023

could I see your pve ceph status and your ceph od tree please?

rainer042 · Aug 2, 2023

Here are the requested data:

Code:

# ceph -s
  cluster:
    id:     xyz
health: HEALTH_WARN
2 devices have ident light turned on

services:
mon: 3 daemons, quorum pve01,pve05,pve11 (age 5w)
mgr: pve05(active, since 5w), standbys: pve11, pve01
osd: 72 osds: 70 up (since 29h), 70 in (since 29h)

data:
pools:   3 pools, 2081 pgs
objects: 4.35M objects, 16 TiB
usage:   48 TiB used, 13 TiB / 61 TiB avail
pgs:     2081 active+clean

io:
client:   24 KiB/s rd, 4.7 MiB/s wr, 2 op/s rd, 690 op/s wr

# ceph osd tree out
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         62.87695  root default                            
-7          5.23975      host pve03                          
16    ssd   0.87329          osd.16    down         0  1.00000
-17          5.23975      host pve08                          
46    ssd   0.87329          osd.46    down         0  1.00000


# ceph osd tree down
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         62.87695  root default                            
-7          5.23975      host pve03                          
16    ssd   0.87329          osd.16    down         0  1.00000
-17          5.23975      host pve08                          
46    ssd   0.87329          osd.46    down         0  1.00000

All other OSDs (total: 12 hosts, each 6 OSDs) are in and up. All use "bluestore"- osd-objectstore .

Philipp Hufnagl · Aug 3, 2023

You can not use an osd id of an existing drive for replacing. You would get an error simmilar to

Code:

RuntimeError: The OSD ID <OSD-ID> is already in use or does not exist.

Is there a certain reason why you want the new disk to map the exact ID of the old one?

If not I would advice you to replace the disks in the GUI by following procedure:

Go to Ceph > OSD
Click "Manage Global Flags" and set norebalance, norecover and nobackfill
Click "Create: OSD" and add the disks
Select the old disks and click "Out" (if needed)
Select the old disks and click "More" > "Destroy"
IMPORTANT: Click "Manage Global Flags" and unset norebalance, norecover and nobackfill

If there is a specific reason why the OSD IDs have to match you can first remove the old disks then use your suggested command

rainer042 · Aug 3, 2023

Well the reason for my proposal is that this is the way I replace broken disks in a pure Nautilus cluster I run, when an osd has a disk failure. Then the ceph-volume call from above (and a new disk) are enough to "repair" the osd. Its then "in" and "up".
The small difference here is that these OSDs are usually in the the state "in" even with broken disk, whereas now they are "out" and I am unsure if "out" is more than just telling ceph that this osd does currently not belong to the cluster, a setting that can be changed back to "in" with out losing or lacking any other (meta)data belonging to the osd .

Philipp Hufnagl · Aug 3, 2023

rainer042 said:
Well the reason for my proposal is that this is the way I replace broken disks in a pure Nautilus cluster I run, when an osd has a disk failure. Then the ceph-volume call from above (and a new disk) are enough to "repair" the osd. Its then "in" and "up".
The small difference here is that these OSDs are usually in the the state "in" even with broken disk, whereas now they are "out" and I am unsure if "out" is more than just telling ceph that this osd does currently not belong to the cluster, a setting that can be changed back to "in" with out losing or lacking any other (meta)data belonging to the osd .

I am not sure if i understand the question. Out just tells your cluster not to use the OSD

rainer042 · Aug 3, 2023

Well than you gave the answer

. If osd out does not change anything but to make an OSD be not part of a cluster then I can use my initial solution and simply switch it to in again.

Have a nice day
Rainer

Philipp Hufnagl · Aug 4, 2023

Sure. You still should set the flags before and unset after. Otherwise ceph will try to adopt to a very temporary situation and cause unnecessary traffic and disk wear

Search

Search

Died disk, osd is down and out, how to repair?

rainer042

Active Member

Philipp Hufnagl

Active Member

rainer042

Active Member

Philipp Hufnagl

Active Member

rainer042

Active Member

Philipp Hufnagl

Active Member

rainer042

Active Member

Philipp Hufnagl

Active Member