Died disk, osd is down and out, how to repair?

rainer042

Member
Dec 3, 2019
35
3
13
123
Hello,

recently two disks on two different servers of a hyperconverged pve cluster died. ceph rebalanced and is healthy again. So I will get two new disks, insert them into the nodes and then.....?

At the moment both osds are marked down and out in the output ceph osd tree. Both are still part of the crush map.

My plan would be to run ceph-volume lvm create --bluestore --osd-id {original-id} --data /dev/sdx for each of the new unused disks I see. Possible they would be marked in afterwards. If this does not happen I could tell ceph to mark both in by running ceph osd in <osd_id>.
Afterwards a simple start of the osd(s) in question should make them up again and ceph should start moving data to the new disks.

Is this workflow ok to solve the problem, or which way should I take instead to get the osds up and running again?
 
could I see your pve ceph status and your ceph od tree please?
 
Here are the requested data:

Code:
# ceph -s
  cluster:
    id:     xyz
health: HEALTH_WARN
2 devices have ident light turned on

services:
mon: 3 daemons, quorum pve01,pve05,pve11 (age 5w)
mgr: pve05(active, since 5w), standbys: pve11, pve01
osd: 72 osds: 70 up (since 29h), 70 in (since 29h)

data:
pools:   3 pools, 2081 pgs
objects: 4.35M objects, 16 TiB
usage:   48 TiB used, 13 TiB / 61 TiB avail
pgs:     2081 active+clean

io:
client:   24 KiB/s rd, 4.7 MiB/s wr, 2 op/s rd, 690 op/s wr

# ceph osd tree out
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         62.87695  root default                            
-7          5.23975      host pve03                          
16    ssd   0.87329          osd.16    down         0  1.00000
-17          5.23975      host pve08                          
46    ssd   0.87329          osd.46    down         0  1.00000


# ceph osd tree down
ID   CLASS  WEIGHT    TYPE NAME       STATUS  REWEIGHT  PRI-AFF
-1         62.87695  root default                            
-7          5.23975      host pve03                          
16    ssd   0.87329          osd.16    down         0  1.00000
-17          5.23975      host pve08                          
46    ssd   0.87329          osd.46    down         0  1.00000

All other OSDs (total: 12 hosts, each 6 OSDs) are in and up. All use "bluestore"- osd-objectstore .
 
Last edited:
You can not use an osd id of an existing drive for replacing. You would get an error simmilar to
Code:
RuntimeError: The OSD ID <OSD-ID> is already in use or does not exist.

Is there a certain reason why you want the new disk to map the exact ID of the old one?

If not I would advice you to replace the disks in the GUI by following procedure:
  • Go to Ceph > OSD
  • Click "Manage Global Flags" and set norebalance, norecover and nobackfill
  • Click "Create: OSD" and add the disks
  • Select the old disks and click "Out" (if needed)
  • Select the old disks and click "More" > "Destroy"
  • IMPORTANT: Click "Manage Global Flags" and unset norebalance, norecover and nobackfill
If there is a specific reason why the OSD IDs have to match you can first remove the old disks then use your suggested command
 
Last edited:
Well the reason for my proposal is that this is the way I replace broken disks in a pure Nautilus cluster I run, when an osd has a disk failure. Then the ceph-volume call from above (and a new disk) are enough to "repair" the osd. Its then "in" and "up".
The small difference here is that these OSDs are usually in the the state "in" even with broken disk, whereas now they are "out" and I am unsure if "out" is more than just telling ceph that this osd does currently not belong to the cluster, a setting that can be changed back to "in" with out losing or lacking any other (meta)data belonging to the osd .
 
Well the reason for my proposal is that this is the way I replace broken disks in a pure Nautilus cluster I run, when an osd has a disk failure. Then the ceph-volume call from above (and a new disk) are enough to "repair" the osd. Its then "in" and "up".
The small difference here is that these OSDs are usually in the the state "in" even with broken disk, whereas now they are "out" and I am unsure if "out" is more than just telling ceph that this osd does currently not belong to the cluster, a setting that can be changed back to "in" with out losing or lacking any other (meta)data belonging to the osd .
I am not sure if i understand the question. Out just tells your cluster not to use the OSD
 
Well than you gave the answer :). If osd out does not change anything but to make an OSD be not part of a cluster then I can use my initial solution and simply switch it to in again.

Have a nice day
Rainer
 
Sure. You still should set the flags before and unset after. Otherwise ceph will try to adopt to a very temporary situation and cause unnecessary traffic and disk wear
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!