Hello,
I have recently started a side project. I have a requirement to have "cold storage" for old ESXi virtual machines. I thought this would be a good excuse to the boss for me to reuse some older HCI hardware for a ProxMox + Ceph cluster. My thought is it would NOT be a single server point of failure if I went with something such as ZFS.
This is my setup:
- 3x UCS HX240-M5L servers each 12x 8TB 3.5" HDDs, 1x 3.2TB SSD cache drive
- Redundant 40G NICs to FabricInterconnects (FIs)
- 1x OSD per HDD, then the 240GB DB/WAL for each HDD on the cache drive
- min_size is 2, default_size is 3
- Ceph 18.2.4 / PVE kernel 6.8.12.2
Things have been stable/idling, but I wanted to test failure scenarios before putting anything in the cluster. This would include simulating a drive failure, node failure, unexpected power outages, etc. For the first test, I physically removed two drives in node 2 (one at a time) to start testing how fault tolerance is done. However, after the first drive removal Ceph did not react. The cluster was HEALTHY_OK, even after 15 minutes. All OSDs were online. Curious, I reinserted that drive, and tried removing another one. Same thing - no reaction even after 10-15 minutes - drives were in/in the cluster still. I put the drive back in. "lsblk" showed the drives were there, but the partition(s) were missing. I ended up following Red Hat's guidance on failed drives and I was going to recreate those OSDs, so I used "osd destroy" on those two OSDs.
Then the entire cluster freaked out. It took the entire node 2 offline, then storage became no longer accessible on the cluster. It started rebuilding, then it put all the OSDs across all nodes into a down state. The Datacenter -> Ceph option doesn't even show info anymore. Its completely blank. So, I guess it's pretty much hosed at this point. I'm ok with rebuilding the cluster, as this is a lab/testing setup to help train me and understand. I clearly am missing something, my understanding was 2 OSD failures could have been tolerated.
My question is -
1) Is this expected failure? (Reseating a drive causes the partition to not be recognized, then a lose of 2 OSDs+destroy offlines the entire cluster)
2) If not, what noob mistake did I make? I'm motivated to learn Ceph as I have an enterprise storage background, but the learning curve is definitely steep, so I don't doubt I missed something.
I have recently started a side project. I have a requirement to have "cold storage" for old ESXi virtual machines. I thought this would be a good excuse to the boss for me to reuse some older HCI hardware for a ProxMox + Ceph cluster. My thought is it would NOT be a single server point of failure if I went with something such as ZFS.
This is my setup:
- 3x UCS HX240-M5L servers each 12x 8TB 3.5" HDDs, 1x 3.2TB SSD cache drive
- Redundant 40G NICs to FabricInterconnects (FIs)
- 1x OSD per HDD, then the 240GB DB/WAL for each HDD on the cache drive
- min_size is 2, default_size is 3
- Ceph 18.2.4 / PVE kernel 6.8.12.2
Things have been stable/idling, but I wanted to test failure scenarios before putting anything in the cluster. This would include simulating a drive failure, node failure, unexpected power outages, etc. For the first test, I physically removed two drives in node 2 (one at a time) to start testing how fault tolerance is done. However, after the first drive removal Ceph did not react. The cluster was HEALTHY_OK, even after 15 minutes. All OSDs were online. Curious, I reinserted that drive, and tried removing another one. Same thing - no reaction even after 10-15 minutes - drives were in/in the cluster still. I put the drive back in. "lsblk" showed the drives were there, but the partition(s) were missing. I ended up following Red Hat's guidance on failed drives and I was going to recreate those OSDs, so I used "osd destroy" on those two OSDs.
Then the entire cluster freaked out. It took the entire node 2 offline, then storage became no longer accessible on the cluster. It started rebuilding, then it put all the OSDs across all nodes into a down state. The Datacenter -> Ceph option doesn't even show info anymore. Its completely blank. So, I guess it's pretty much hosed at this point. I'm ok with rebuilding the cluster, as this is a lab/testing setup to help train me and understand. I clearly am missing something, my understanding was 2 OSD failures could have been tolerated.
My question is -
1) Is this expected failure? (Reseating a drive causes the partition to not be recognized, then a lose of 2 OSDs+destroy offlines the entire cluster)
2) If not, what noob mistake did I make? I'm motivated to learn Ceph as I have an enterprise storage background, but the learning curve is definitely steep, so I don't doubt I missed something.
Last edited: