[SOLVED] CEPH OSD(s) failing to initialize after a hardware change

pille99 · May 28, 2026

the issue, as i see, the osd service doesnt start. you need reset the counter

systemctl reset-failed "servicename"
systemctl start service

i had the same issue last weak, the result from some network changes. never got it fully up again. dont waste time - create new and restore.

than it seems you have a mapping issue too.

ResidentDataHoarder · May 28, 2026

pille99 said:
the issue, as i see, the osd service doesnt start. you need reset the counter

systemctl reset-failed "servicename"
systemctl start service

i had the same issue last weak, the result from some network changes. never got it fully up again. dont waste time - create new and restore.

than it seems you have a mapping issue too.

So, should I just wipe the OSDs on that node then and re-add them? I just want to clarify before attempting that. I am in a 2/3 for that setup, and CephFS did come back up, and I am able to access everything, those 15 OSDs are still down though. So to be completely clear, destroy them and re-add?

pille99 · May 28, 2026

i would stongly advice. as you wrote yourself - you fugg around since days with it.
i had exactly the same experience. i played around 3 days. nothing worked. finally i reinstalled proxmomx completely - my cluster wasnt productive yet and, as i had a feeling, i made a backup just right before.
and my personally opinion: if you play to much around on a prod system, you can never be sure to a later time what the server will do. i dont trust that server anymore, so i reinstalled it (but i am not much mad because i could fix an issue and increased the performance dramatically)

but wait for a staff member, for advice, how to remove and re-add the node properly (only my advice).

guruevi · May 28, 2026

Yes, you typically do not recover a node partially. What you did is effectively replace the node but put old hard drives in and now you’re trying to recover those in a clustered file system. At this point, the data on the ‘dead’ OSD is out of sync, it does not matter, forcefully trying to re-insert them will cause you more trouble.

So yes, wipe the disks, I would even say, wipe the Proxmox at this point and start fresh and treat it as a new node.

Nemesiz · May 28, 2026

ResidentDataHoarder said:

Code:

May 27 07:29:29 pve5 kernel: mpt3sas_cm1: handle(0x27) sas_address(0x5000cca27a3838c9) port_type(0x1)
May 27 07:29:29 pve5 kernel: scsi 9:0:2:0: Direct-Access     HGST     HUH721212AL4200  AB01 PQ: 0 ANSI: 6
May 27 07:29:29 pve5 kernel: scsi 9:0:2:0: SSP: handle(0x0027), sas_addr(0x5000cca27a3838c9), phy(5), device_name(0x5000cca27a3838cb)
May 27 07:29:29 pve5 kernel: scsi 9:0:2:0: enclosure logical id (0x50030480186998ff), slot(5)
May 27 07:29:29 pve5 kernel: scsi 9:0:2:0: enclosure level(0x0000), connector name(     )
May 27 07:29:29 pve5 kernel: scsi 9:0:2:0: qdepth(254), tagged(1), scsi_level(7), cmd_que(1)
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: Attached scsi generic sg6 type 0
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: [sdc] 2929721344 4096-byte logical blocks: (12.0 TB/10.9 TiB)
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: [sdc] Write Protect is off
May 27 07:29:29 pve5 kernel:  end_device-9:0:2: add: handle(0x0027), sas_addr(0x5000cca27a3838c9)
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: [sdc] Mode Sense: f7 00 10 08
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: [sdc] Write cache: disabled, read cache: enabled, supports DPO and FUA
May 27 07:29:29 pve5 kernel: sd 9:0:2:0: [sdc] Attached SCSI disk
...
May 27 07:29:29 pve5 lvm[10465]: PV /dev/sdc online, VG ceph-5909443b-69ac-4d66-899c-f6d7345db3e5 is complete.
...
May 27 07:29:29 pve5 systemd[1]: Started lvm-activate-ceph-5909443b-69ac-4d66-899c-f6d7345db3e5.service - [systemd-run] /usr/sbin/lvm vgchange -aay --autoactivation event ceph-5909443b-69ac-4d66-899c-f6d7345db3e5.
...
May 27 07:29:29 pve5 lvm[10473]:   1 logical volume(s) in volume group "ceph-5909443b-69ac-4d66-899c-f6d7345db3e5" now active
May 27 07:29:29 pve5 systemd[1]: lvm-activate-ceph-5909443b-69ac-4d66-899c-f6d7345db3e5.service: Deactivated successfully.
lines 305-457/457 (END)

One of your OSD should start up. Try to start 'systemctl restart ceph-osd.target'

ResidentDataHoarder · May 28, 2026

Nemesiz said:
One of your OSD should start up. Try to start 'systemctl restart ceph-osd.target'

It should, but unfortunately it doesn't. Though it's not the preferred option, I may just go the scorched earth route on that node at this point.

Nemesiz · May 28, 2026

After inserting sdc disk and LVM activation what you see ? What pvdisplay show?

ResidentDataHoarder · May 28, 2026

pille99 said:
i would stongly advice. as you wrote yourself - you fugg around since days with it.
i had exactly the same experience. i played around 3 days. nothing worked. finally i reinstalled proxmomx completely - my cluster wasnt productive yet and, as i had a feeling, i made a backup just right before.
and my personally opinion: if you play to much around on a prod system, you can never be sure to a later time what the server will do. i dont trust that server anymore, so i reinstalled it (but i am not much mad because i could fix an issue and increased the performance dramatically)

but wait for a staff member, for advice, how to remove and re-add the node properly (only my advice).

So after all that, wiping and re-adding I am still getting the same error on node 5.
It does not want to let me add OSDs

Code:

command 'ceph-volume lvm create --crush-device-class hdd --data /dev/sdc' failed: exit code 1

This is a fresh install, I went through and detroyed any mappings and did a wipefs on every OSD for that node

Code:

sgdisk --zap-all /dev/sdc && wipefs -a /dev/sdc

I have to be missing something somewhere
I am on ceph squid atm, I don't want to upgrade until after getting everything back in order.

jsterr · May 28, 2026

Are there any differences regarding kernel on those nodes? the ones that work and the one that does not? Do you have any errors on node-site regarding the disks (not ceph related) but storage-device related (host bus adapter etc.)?

ResidentDataHoarder · May 28, 2026

jsterr said:
Are there any differences regarding kernel on those nodes? the ones that work and the one that does not? Do you have any errors on node-site regarding the disks (not ceph related) but storage-device related (host bus adapter etc.)?

Everything is on the latest version no apparent errors with the HBA or anything adjacent.
If you have any specific commands you want me to try please let me know, I'll check them as soon as I get back, have to go to my uncles for a few, 30 minutes tops.

ResidentDataHoarder · May 28, 2026

May potentially be this? The device type shows unknown in LVM. As mentioned the wiring changed to the backplane, instead of using 4 individual sas cables to each port on the backplane I'm using the stock supermicro cables that take 1 link from the HBA and split it (i think)?

ResidentDataHoarder · May 28, 2026

I am back, I did forget to mention, I did try this workaround already

M

Thread 'Ceph OSD creation error'

Nov 21, 2023

Setting up ceph on a three node cluster, all three nodes are fresh hardware and installs of PVE. Getting an error on all three nodes when trying to create the OSD either via GUI or CLI.

create OSD on /dev/sdc (bluestore)
wiping block device /dev/sdc
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.52456 s, 400 MB/s
--> UnboundLocalError: cannot access local variable 'device_slaves' where it is not associated with a value
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 4691d91a-6fd9-42b1-bab9-5f9042b21925 --crush-device-class ssd --data /dev/sdc'...

janus57 · May 28, 2026

Hi,

ResidentDataHoarder said:
the wiring changed to the backplane, instead of using 4 individual sas cables to each port on the backplane I'm using the stock supermicro cables that take 1 link from the HBA and split it (i think)?

You should take a look at the manual of your hardware because if your backplane doesn't have an expander, then your single cable may not work like you think.

If you have doubts, you can post the part number.

Best regards,

ResidentDataHoarder · May 28, 2026

I am losing my sanity with this.

I noticed the DMS was still there for pve5
If I try to purge out everything related to "pve5" in ceph, it tells me it doesn't exist. Then if I try to go back and re-add it, it errors out saying its already in ceph config

Code:

MDS 'pve5' already referenced in ceph config, abort!

ResidentDataHoarder · May 28, 2026

ResidentDataHoarder said:
I am back, I did forget to mention, I did try this workaround already

M

Thread 'Ceph OSD creation error'

Nov 21, 2023

Setting up ceph on a three node cluster, all three nodes are fresh hardware and installs of PVE. Getting an error on all three nodes when trying to create the OSD either via GUI or CLI.

create OSD on /dev/sdc (bluestore)
wiping block device /dev/sdc
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.52456 s, 400 MB/s
--> UnboundLocalError: cannot access local variable 'device_slaves' where it is not associated with a value
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 4691d91a-6fd9-42b1-bab9-5f9042b21925 --crush-device-class ssd --data /dev/sdc'...

mbeard

ceph osd adding

Replies: 6

Forum: Proxmox VE: Installation and configuration

I had assumed if the kernel can see them, it should be good. If that assumption is incorrect, please do tell me so I know in the future not to make that assumption. Also LVM sees them after the wipe and reinstall.

this is the chassis model CSE-847BE1C-R1K28LPB, I'll look up the SAS3 backplane right now.

ResidentDataHoarder · May 28, 2026

Supermicros page for that chassis doesn't tell me the exact model number, it just states the following:
Front: 24-Port 4U SAS3 12Gbps single-expander backplane
Rear:

12-Port 2U Backplane with Expander support up to 8 x 3.5" SAS3/SATA3 HDD/SDD and 4 x SAS3/SATA3/NVMe Storage Devices,HF,RoHS/REACH

https://www.supermicro.com/en/products/chassis/4U/847/SC847BE1C4-R1K23LPB

--edit--
Looks like supermicro has several models and revisions. I'll have to open the chassis and check at this point.

ResidentDataHoarder · May 28, 2026

I feel like there has to be an orphaned configuration somewhere. It keeps telling me there are references to pve5 when I both don't see any in the WebUI or from where I know to look in the terminal given my current knowledge.

ResidentDataHoarder · May 28, 2026

If anyone knows how to cover all my bases for this; what I would like to try:

purge all references to pve5 from the cluster
remove ceph and any configurations from pve5 only
reinstall ceph from scratch on pve5

2 and 3 I can handle on my own, I'm just unsure of an efficient way to purge out any pve5 references.

ResidentDataHoarder · May 28, 2026

Small update, there are old references in the auth list here is just one for reference:

JSON:

        {
            "entity": "osd.8",
            "key": "[REDACTED]"
            "caps": {
                "mon": "allow profile osd",
                "osd": "allow *"
            }
        },

Don't know if this helps, but adding this

Code:

root@pve5:~# ceph-volume lvm create --data /dev/sdc --bluestore --crush-device-class hdd
--> Incompatible flags were found, some values may get ignored
--> Cannot use None (None) with --bluestore (bluestore)
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 21b9dd41-1eca-46f5-96ee-a5bbe4dd08b4
Running command: vgcreate --force --yes ceph-b92d354b-f77f-449a-af18-2b5f37bf512b /dev/sdc
 stdout: Physical volume "/dev/sdc" successfully created.
 stdout: Volume group "ceph-b92d354b-f77f-449a-af18-2b5f37bf512b" successfully created
--> Was unable to complete a new OSD, will rollback changes
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.8 --yes-i-really-mean-it
 stderr: purged osd.8
--> No OSD identified by "8" was found among LVM-based OSDs.
--> Proceeding to check RAW-based OSDs.
No OSD were found.
root@pve5:~# CEPH_VOLUME_DEBUG=1 ceph-volume --log-level debug lvm create --data /dev/sdc --bluestore --crush-device-class hdd 2>&1 | tee /tmp/ceph-volume-create-debug.log
--> Incompatible flags were found, some values may get ignored
--> Cannot use None (None) with --bluestore (bluestore)
Traceback (most recent call last):
  File "/usr/sbin/ceph-volume", line 33, in <module>
    sys.exit(load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')())
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 62, in __init__
    self.main(self.argv)
    ~~~~~~~~~^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 174, in main
    terminal.dispatch(self.mapper, subcommand_args)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
    ~~~~~~~~~~~~~^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/main.py", line 47, in main
    terminal.dispatch(self.mapper, self.argv)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
    ~~~~~~~~~~~~~^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/create.py", line 80, in main
    self.args = parser.parse_args(self.argv)
                ~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/usr/lib/python3.13/argparse.py", line 1912, in parse_args
    args, argv = self.parse_known_args(args, namespace)
                 ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/argparse.py", line 1922, in parse_known_args
    return self._parse_known_args2(args, namespace, intermixed=False)
           ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/argparse.py", line 1951, in _parse_known_args2
    namespace, args = self._parse_known_args(args, namespace, intermixed)
                      ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/argparse.py", line 2202, in _parse_known_args
    start_index = consume_optional(start_index)
  File "/usr/lib/python3.13/argparse.py", line 2126, in consume_optional
    take_action(action, args, option_string)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/argparse.py", line 2012, in take_action
    argument_values = self._get_values(action, argument_strings)
  File "/usr/lib/python3.13/argparse.py", line 2532, in _get_values
    value = self._get_value(action, arg_string)
  File "/usr/lib/python3.13/argparse.py", line 2565, in _get_value
    result = type_func(arg_string)
  File "/usr/lib/python3/dist-packages/ceph_volume/util/arg_validators.py", line 90, in __call__
    return self._format_device(self._is_valid_device())
                               ~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/lib/python3/dist-packages/ceph_volume/util/arg_validators.py", line 99, in _is_valid_device
    raise RuntimeError("Device {} has a filesystem.".format(self.dev_path))
RuntimeError: Device /dev/sdc has a filesystem.

Additional context the filesystem it says already exists referenced in the log is the one that ceph is attempting (and fialing) to create.

ResidentDataHoarder · May 28, 2026

Also IDR if I mentioned already, not sure if it matters, but this is for my pool using CephFS the other 2 pools do not use CephFS

[SOLVED] CEPH OSD(s) failing to initialize after a hardware change

Active Member

New Member

Active Member

Renowned Member

Renowned Member

New Member

Renowned Member

New Member

Famous Member

New Member

New Member

New Member

Renowned Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

We value your privacy