Ceph OSD failure causing Proxmox node to crash

Perhaps i have wrong idea about the minimum size. What is it exactly?

What i understand as of now this is the minimum size that cluster will acknowledge write operation. so with min size of 2 cluster ensures that there are always 2 replicas.
Hi,
like you wrote: with min size 1 your cluster is operable with two failed OSDs. "min size 2" can't ensures that there are always 2 replicas - it's mean only, that you can't write to an cluster with two failed disks.
With min size 1 if a single copy dies somehow as in my case, there are no more replica left to rebuild.
This mean a third OSD fail - so you have an big data-lost!

If an OSD fail, this OSD will removed (or better weigh 0) from the crush map. E.G. all primary PGs (and after that - also the secondary) was rebuilded on one remaining OSD like the new crushmap calculate. For the ex-primarys an secondary PG is taken to rebuild the new primary. If now another OSD fail, the crushmap will be recalculate and the play begin from start.
min size 2 only prevents writes (also changes) during an rebuild of an cluster with two failed disks. Depends how big are your disks and how fast your recovery is that you need (or think to need) that.
but then you should also not use "osd max backfills = 1 + osd recovery max active = 1" because this slow down the recovery-process!
Perhaps it's an better idea to monitor the cluster and if the second OSD fails, switch back max_backfills and max_active to the normal values.
In this case the VMs are running, but very slow - so it's better than "min size 1" (nothing run) and a rebuild will be much faster!

I am thinking of couple other clusters we are managing which are destined to grow upto petabyte. With so many osds and nodes min size greater than 1 makes sense. They are sitting on ubuntu but in the plan to move to proxmox.
AFAIK, there are the config online (or parts of it) from very big ceph-installations (CERN).

Udo
 
Last edited:
hello udo and wasim
wasim if i understand correctly, you have 3 nodes with 4 osds per node ?
it is necessary replica 1 than 2?
Hi Konstantinos,
with firefly an replica of 3 is default. Also default is the dependencies on hosts - this mean you need minimum three OSD-hosts, but be able to reboot one for maintenance.
if two osd turn out the same time that means you loose data??
yes - with replica 2 you will loose much (all) of your VMs if two OSDs fails.
as remember well inside documentation ceph it is necessary to delete manually these disk and replace with brand new disks and put it on production.
First your cluster should be big enough to resync in an healthy state without interaction - mean you have enough remaining OSDs and the disks are not full before the disaster begins (they should also not be full after that!! VERY important!!).
by the way there is something that i must now ?
i ask because i prepare to covert all my data center to ceph storage.
thanks both
Before you start to switch all data center storage I would highly suggest, play with real hardware not only three virtual hosts.

Udo
 
in theory this can't happens... but you see it happens. With replica 3 two OSDs can die and the cluster can rebuild.
Perhaps you should not erase your pool to fast. Try to find help on the ceph-user mailingliste (ok, ofter there are not so much response on this list).
Udo

There may be light at the end of the tunnel after all! I became somewhat permanent resident in Ceph IRC channel seeking help for this issue. In case somebody knew how to tackle this exactly. Finally just now somebody suggested "not so common" solution that worked for me. After trying it for few minutes 1 out of 3 bad PG is now cured!! I will continue to fix till cluster is healthy again and post the details here. He pretty much suggested to search for the bad PG directory in ALL OSDs in the cluster not just the one PG map says then remove them completely. Then force_create PG and restart all OSDs. After repeating it twice one of inactive PG is now active and clean.
 
UDO! I am happy to inform that my Ceph cluster is now back to HEALTH_OK!!

This is a time of celebration. :) Look how pretty

ceph-ok.png

After 27 pages of documentation during the battle to fix this issue it is now done. As soon as cluster recreated those 3 PGs the issue of blocked request > 32 sec simply vanished and can access RBD storage from Proxmox.

Step #1 Delete entire defective PG directory on all nodes. Must check all nodes, not just the one pg map says.
Code:
# rm -rf /var/lib/ceph/osd/current/<pg_folder>

Step #2
[code]# ceph pg force_create_pg <pgid>

Step#
Restart all OSDs.

Keep in mind, only do this when all else fails as it will delete the defective PGs for good and recreate fresh. 

Udo, cannot thank you enough for all the info you provided. They were extremely valulable. 

Although the cluster is back to healthy again, we will not be continue using the pool. So we will move leftover good VM images to new pools. What PG do you usually use for your pool Udo?
 
What PG do you usually use for your pool Udo?
Hi Wasim,
glad to hear good news ;-)

For the PGs there are an new PG-calculator online: https://ceph.com/pgcalc/

If you have to less PGs, the data contribution is not so good (and many VMs are on the same PGs) to much PGs have an Memory-Impact and with more than 300 PGs on one OSD you will get an warning with the actual ceph-Version!
You can expand the PGs (stress) but never reduce!

Udo
 
For the PGs there are an new PG-calculator online: https://ceph.com/pgcalc/

So far in all ceph cluster i have been using 1024 PG count per pool. Seems to be ok based on the PG calc. But i never took % of data into consideration. What is it exactly? If i have 3 pools, is it how much overall data each pool going to hold?

For example, cluster wide pool 1 will hold 50% data, pool 2 25% and pool 3 25%.
Or
pool 1 100%, pool 2 100% and pool 3 100% of its own data in their respective pool?
 
So far in all ceph cluster i have been using 1024 PG count per pool. Seems to be ok based on the PG calc. But i never took % of data into consideration. What is it exactly? If i have 3 pools, is it how much overall data each pool going to hold?

For example, cluster wide pool 1 will hold 50% data, pool 2 25% and pool 3 25%.
Or
pool 1 100%, pool 2 100% and pool 3 100% of its own data in their respective pool?

Hi,
part1 is right.

See also the description on the website:
Code:
This value represents the approximate percentage of data which will be contained in this pool for that specific OSD set. Examples are pre-filled below for guidance.
If you select as use case "OpenStack", you see an good example.

Udo
 
Udo or anybody using journal less Ceph yet? Our current version is firefly but not taking advantage of journal less Ceph as we are still using journal.
Since this particular cluster is getting retrofit again seems like a good idea to do Firefly journal less experiment on it. :)
Any idea how can i convert existing cluster? Any issue anybody having with fully journal less Ceph?
 
Udo or anybody using journal less Ceph yet? Our current version is firefly but not taking advantage of journal less Ceph as we are still using journal.
Hi Wasim,
no I'm using Intel DC S3700 SSDs as journal for the 60 OSDs (in some days 84 OSDs).
But due an new EC-Pool I have also an SSD cache tier, and will try in the next time to use the SSD cache tier for the production also.
If this run well I will partly change the journal-osd to partition-based journal on the OSD-hdd to put the journal ssd in the cache tier pool.
Since this particular cluster is getting retrofit again seems like a good idea to do Firefly journal less experiment on it. :)
Any idea how can i convert existing cluster? Any issue anybody having with fully journal less Ceph?
All i have heard the "journal less" - I think you mean leveldb backend? - isn't very fast and also not so stable?!

Udo
 
All i have heard the "journal less" - I think you mean leveldb backend? - isn't very fast and also not so stable?!
Udo
Thats what i meant. Technically it should make performance faster than journal based Ceph since it cuts down half of IO. Dont have enough data on stability since i dont know many people who are using it. I think we better off wait few more releases before it becomes norm. The existing journal based Ceph is matured enough, stable and working just fine so long we keep out human errors. :)


After this cluster settles down from all the data transfer/rebalancing/shuffling/adding osd etc., will do some benchmarking and performance tuning.
 
On Ceph IRC somebody suggested that instead of Down/Out an OSD first to remove it, do Crush remove first that way cluster does not go through rebalancing twice. Good idea? It seems like there might be an issue by removing the OSD abruptly.
 
On Ceph IRC somebody suggested that instead of Down/Out an OSD first to remove it, do Crush remove first that way cluster does not go through rebalancing twice. Good idea? It seems like there might be an issue by removing the OSD abruptly.
Hi,
this don't help agains rebalancing twice. If you remove the osd from crush, the content will "moved" to other OSDs. After recreation, the new crushmap looks like the old one and the data was moved again.

But it's helps if something goes wrong (if your data allready on the OSD) - then you can add the OSD, start and ceph will find the data again?!

I have changed all 60 OSDs in my cluster from XFS to ext4 (many from then moved also to different hosts, to sort the numbers again).
For OSDs on the same host, I use "reweight 0" to empty the OSD. stop the OSD. zap the disk, remove the OSD - also from the crush map (rebalancing start) and create the disks again with ceph-deploy (ceph-deploy takes automaticly the osd-number which was deleted before).
Like
Code:
ceph osd reweight 3 0
# if finisched
service ceph stop osd.3
umount /var/lib/ceph/osd/ceph-3
ceph-deploy --fs-type ext4 zap ceph-01:sde
ceph auth del osd.3
ceph osd crush remove osd.3
ceph osd rm 3
# recreate again
ceph-deploy --overwrite-conf disk --fs-type ext4 prepare ceph-01:sde:/dev/disk/by-partlabel/journal-3
journal looks different for other installations!! If you omit ":/dev/disk/by-partlabel/journal-3" ceph-deploy create an journal on the OSD.

Udo
 
We added our 6th node into the Ceph cluster today and started converting all XFS based OSD to ext4. We are doing one OSD at a time. Step down by rewighting, remove OSD, then recreate OSD with ext4. Then step up to reweighting again.
Anything i should watch for in ext4 Udo?
 
I am intending to move OSD journal to SSD. Found the following steps on Ceph list:
Code:
create N partition on your SSD for your N OSDs
ceph osd set noout
sudo service ceph stop osd.$ID
ceph-osd -i osd.$ID --flush-journal
rm -f /var/lib/ceph/osd/<osd-id>/journal
ln -s  /var/lib/ceph/osd/<osd-id>/journal /dev/<ssd-partition-for-your-journal>
ceph-osd -i osd.$ID —mkjournal
sudo service ceph start osd.$ID
ceph osd unset noout

This should work or am i better off remove OSD then recreate with --journal option in CLI using pveceph?
 
I am intending to move OSD journal to SSD. Found the following steps on Ceph list:
Code:
create N partition on your SSD for your N OSDs
ceph osd set noout
sudo service ceph stop osd.$ID
ceph-osd -i osd.$ID --flush-journal
rm -f /var/lib/ceph/osd/<osd-id>/journal
ln -s  /var/lib/ceph/osd/<osd-id>/journal /dev/<ssd-partition-for-your-journal>
ceph-osd -i osd.$ID —mkjournal
sudo service ceph start osd.$ID
ceph osd unset noout

This should work or am i better off remove OSD then recreate with --journal option in CLI using pveceph?
Hi Wasim,
no that works without trouble (if you have an different osd_journal path defined in ceph.conf, you must take care about this). Normaly it's very fast to do this (and no data will be moved). You need app. one minute for an osd.
One line is not rght during copy - you need --
Code:
ceph-osd -i 22 --mkjournal
This command give an warning, because the partition don't match, but work.

BTW: I named the partition with parted and use the path with partlabel:
Code:
Model: ATA INTEL SSDSC2BA20 (scsi)
Disk /dev/sda: 200GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name        Flags
 1      1049kB  15,0GB  15,0GB               journal-67
 2      15,0GB  30,0GB  15,0GB               journal-68
 3      30,0GB  45,0GB  15,0GB               journal-69
 4      45,0GB  60,0GB  15,0GB               journal-70
 5      60,0GB  75,0GB  15,0GB               journal-71
 6      75,0GB  90,0GB  15,0GB               journal-72
 7      90,0GB  105GB   15,0GB               journal-73
 8      105GB   120GB   15,0GB               journal-74
 9      120GB   135GB   15,0GB               journal-75
10      135GB   150GB   15,0GB               journal-76
11      150GB   165GB   15,0GB               journal-77
12      165GB   180GB   15,0GB               journal-78

ls -l /var/lib/ceph/osd/ceph-70/journal
lrwxrwxrwx 1 root root 33 Jan 22 12:58 /var/lib/ceph/osd/ceph-70/journal -> /dev/disk/by-partlabel/journal-70

Be sure that your SSD is the right one for journaling ;-) http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

Udo
 
Thanks for the info Udo!

I would have gone to SSD journal from the very beginning for all our cluster. But the only thing that kept me from it is the risk of mass OSD damage due to single SSD failure. All these clusters are destined to grow over 150 OSDs. Did not want to get stuck with SSD based journal.
Udo, i see all your OSDs journal are on one SSD. If they are not on RAID, what would the scenario look like if one of the SSD in one node fails completely?
Simply replace the SSD, partition, create journal and life goes on? or some sort of data loss Will occur?

I am thinking death of one SSD will be like entire node is down. In which case rest of the nodes and OSDs will start rebalancing. What will happen to the datat that were on SSD did not get transferred to any OSDs?
 
Udo, i see all your OSDs journal are on one SSD. If they are not on RAID, what would the scenario look like if one of the SSD in one node fails completely?
Hi,
no it's not an raid1 (you should not use an ssd- raid1 for journaling). This is the reason, why I use monitoring to check the ssd-health (lifetime) and use an DC grade SSD.
Normaly you should use one SSD for 4 OSDs, but I start with one SSD and the write performance is not the problem (more the single thread read performance).
Simply replace the SSD, partition, create journal and life goes on? or some sort of data loss Will occur?

I am thinking death of one SSD will be like entire node is down. In which case rest of the nodes and OSDs will start rebalancing. What will happen to the datat that were on SSD did not get transferred to any OSDs?
If the journal-SSD fails, the whole node is down. But this can happens, also with other issues (network/powersupply/mainboard/kernel panic). Therefor I use such an nice redundant system ;-)
Dataloss should not happens - writes are only acknoledged, if all copies are written to the journal. If one node fail, the other replicas are still fine.

Of course, you should be fast to change the SSD.

Udo
 
Do i need to set filestore xattr use omap = true now that i have OSDs on ext4?
This config does no appear to be exist any more. In this bur tracker someone pointed out it will be taken out: http://tracker.ceph.com/issues/7408

But the documentation still says set it to true for ext4. I tried to find it in the cluster, but nothing comes up.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!