Unexpected Ceph Outage - What did I do wrong?

LordDongus

New Member
Sep 17, 2024
3
0
1
Hello,

I have recently started a side project. I have a requirement to have "cold storage" for old ESXi virtual machines. I thought this would be a good excuse to the boss for me to reuse some older HCI hardware for a ProxMox + Ceph cluster. My thought is it would NOT be a single server point of failure if I went with something such as ZFS.

This is my setup:
- 3x UCS HX240-M5L servers each 12x 8TB 3.5" HDDs, 1x 3.2TB SSD cache drive
- Redundant 40G NICs to FabricInterconnects (FIs)
- 1x OSD per HDD, then the 240GB DB/WAL for each HDD on the cache drive
- min_size is 2, default_size is 3
- Ceph 18.2.4 / PVE kernel 6.8.12.2

Things have been stable/idling, but I wanted to test failure scenarios before putting anything in the cluster. This would include simulating a drive failure, node failure, unexpected power outages, etc. For the first test, I physically removed two drives in node 2 (one at a time) to start testing how fault tolerance is done. However, after the first drive removal Ceph did not react. The cluster was HEALTHY_OK, even after 15 minutes. All OSDs were online. Curious, I reinserted that drive, and tried removing another one. Same thing - no reaction even after 10-15 minutes - drives were in/in the cluster still. I put the drive back in. "lsblk" showed the drives were there, but the partition(s) were missing. I ended up following Red Hat's guidance on failed drives and I was going to recreate those OSDs, so I used "osd destroy" on those two OSDs.

Then the entire cluster freaked out. It took the entire node 2 offline, then storage became no longer accessible on the cluster. It started rebuilding, then it put all the OSDs across all nodes into a down state. The Datacenter -> Ceph option doesn't even show info anymore. Its completely blank. So, I guess it's pretty much hosed at this point. I'm ok with rebuilding the cluster, as this is a lab/testing setup to help train me and understand. I clearly am missing something, my understanding was 2 OSD failures could have been tolerated.

My question is -
1) Is this expected failure? (Reseating a drive causes the partition to not be recognized, then a lose of 2 OSDs+destroy offlines the entire cluster)
2) If not, what noob mistake did I make? I'm motivated to learn Ceph as I have an enterprise storage background, but the learning curve is definitely steep, so I don't doubt I missed something.
 
Last edited:
This is not expected.

Is there any IO going on after you physically removed the drive?
The OSD orocess will only die after it cannot access the drive which it will only try to do with IO.

As for the complete failure there is way too less information to debug this.
How.many.MONs did you deploy?
 
Hi LordDongus

I tested this scenario and I can give you some details.

What's happening after removing hdd 'accidently'

* OSD process will not notice it if no active IO
* Layers LVM and LUKS will still stand as it is

After removing disk active OSD will leave problem messages and crash. Systemd will try to restart few times but due osd-0/block missing it will fail and stops bringing it back.

What you should do?

If you are using LUKS then you must close it.

1. Get LUKS name
# ls -la /var/lib/ceph/osd/ceph-X/block
lrwxrwxrwx 1 ceph ceph 50 Oct 29 23:10 /var/lib/ceph/osd/ceph-X/block -> /dev/mapper/10Mumd-IQUS-W2NA-7IDe-Pq5J-Rkw7-m1f5hv

2. Get LVM name
# cryptsetup status 10Mumd-IQUS-W2NA-7IDe-Pq5J-Rkw7-m1f5hv
...
device: /dev/mapper/ceph--059903c9--2652--4a70--92b2--92b091839d9e-osd--block--5b3f39e9--5557--4f18--8269--2abfe40581a0
...

3. Close them both
# cryptsetup close 10Mumd-IQUS-W2NA-7IDe-Pq5J-Rkw7-m1f5hv
# cryptsetup close ceph--059903c9--2652--4a70--92b2--92b091839d9e-osd--block--5b3f39e9--5557--4f18--8269--2abfe40581a0
and unmount tmpfs # umount /var/lib/ceph/osd/ceph-X

4. Now you are ready to put HDD back in and run # vgchange -ay to rescan and activate LVM mapping and activate ceph-volume activation using systemd # systemctl start ceph-volume@lvm-X-UUID.service

If OSD is till not kicking in then check does tmpfs mounted to /var/lib/ceph/osd/ceph-X and does block links to existing mapping.

If you put back disk without closing LUKS kernel will give another name to the disk sdX and LVM will fail to activate because same ID already exist in the system.


If your OSD is not encrypted with LUKS then in IO error LVM should release mapping automatically.



In some crash situation don't put all OSD into down. Just lock ceph from rebalancing ( nobackfill norebalance ) Fix the problem with OSD, remove flags and let ceph do its job.
 
This is not expected.

Is there any IO going on after you physically removed the drive?
The OSD orocess will only die after it cannot access the drive which it will only try to do with IO.

As for the complete failure there is way too less information to debug this.
How.many.MONs did you deploy?
No IO was active at the time. There was only a single, powered off VM at the time of testing. My goal was simply to use these "failure" tests as a way to learn more about ceph.

There are 3 mons, one on each node. Happy to provide any insight or further details, just let me know what you'd like to know. Sorry, new to ceph, so I'm not sure what would else would be helpful information.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!