Beginner Ceph Questions

Graham Payton · Jun 20, 2016

Hi Everyone

So I have been struggling on with my test Proxmox Cluster. I now have it installed with Proxmox VE 4.2 on 3 x servers. The corosync network has been set up on it's own NICs and VLAN. Linux bonds have been set up for Ceph and Ceph has been installed on the cluster.

Now I have gone to mount the default pool from Ceph as storage and noticed that I cannot use it for anything other than Images or Containers. Is that correct? And if so then why can't I use it for ISOs? I really want a replicated and centralised ISO store so I know they are always up to date.

Now I just have to get my head around creating the Ceph pools as well, so anyone with any pointers there would be really helpful.

Each server is using 6 x 1TB Samsung 850 Pro SSDs and there are 3 x servers, so a total of 18 OSDs. (I am aware of the issues with these disks and Ceph that have been documented)

I was hoping to have 3 x simple pools

1. Small of about 100GB for ISO files
2. (50% of space minus ISOs) for VMs
3. (50% of space minus ISOs) for Backups

I have looked at the PGCalc page on the Ceph website but I am clearly missing something and am totally stuck. When I look at the status page of the Ceph tab on Proxmox it tells me that I have 16.6TB of space available, but is that on the default rbd pool as that page shows nothing used and 0%. How can that be if the replication is set at 3, surely shouldn't it show me a 3rd of that space? And where does it show available space per pool?

Any help would again be greatly received

Thank you

alexskysilk · Jun 20, 2016

Based on your use description, allow me to suggest not to bother with multiple pools. since all your disks are the same it doesnt really buy you anything.

In order to make a shared filesystem for iso, you have 3 options:
1. set aside space on a non-ceph resources, mount it and share via NFS. the downside is that if the node sharing the ISOs is down, so is the share.
2. create a VM with 100GB, and use it to share the space.
3. configure CephFS and share it. this is outside the scope of Proxmox and you'll need to do some reading.

As for backups- remember that Ceph already provides you with a copy of your data, and a copy on the same machine # backup. Fee free to use snapshots for version control but you really want a separate device for backup, since the whole point is recovery if your entire cluster is lost.

As for you PG plan- 100-200pgs per osd is the rule of thumb, and round up your result to the nearest power of 2. in your example, 18X100=1800, rounded up to 2048.

Graham Payton · Jun 21, 2016

Thanks for the speedy reply alexskysilk

That makes good sense all around. I could go with the HA VM for the ISOs and then share that out via iSCSI and mount it as storage that way.

Backups again sounds good on separate system. I'd need to get another one for this.

The PG number you give me is different to the one I get from the Ceph PGCalc page, http://ceph.com/pgcalc/

I have attached a screenshot of what that gives me:

Obviously with the default rbd pool that is setup when installing Ceph on Proxmox the PG count is all wrong and I get errors. So do I just create a second pool with the correct settings and then delete the original one? Or can I edit the original one to fix it?

These are the options I get when creating a new pool:

I thought size would be 3 for 3 copies of it? Is that right? Or would 3 be the original plus 3 replicas giving a total of 4?

Min. Size appears to be how many replicas you must have to keep it running? Is that correct too? In which case I'd probably put 2 for that. It could hapily run on just one copy, but then you are not getting any replication at all.

And then obviously the pg_num is my PG count

Would I have to delete the old pool first?

Many Thanks

alexskysilk · Jun 21, 2016

The calculator suggests accounting for PGs for data blocks only, so with a 2 replica pool the PG count would be 18 OSD x 100PGs / 2 Replicas = 900, or 1024 rounded up. (rounded up, 3 replicas would still yield the same number.)

Dont share your ISOs via iSCSI; when you need them you'll need them network accessible and not via block. use NFS or SMB to share them out.

As for the default pool, delete the old one and create a new one. For the life of me I dont know why the devs have it created by default... just remember to copy your keyring.

Re: Replicas- the number listed under "size" represents the total number of replicas. the number in the "min" represents the minimum number of replicas the system would consider; 1 is ok here, but be aware that this allows the filesystem to run without fault tolerance (as you pointed out.)

Graham Payton · Jun 21, 2016

Hi alexskysilk

Thanks again for you reply and assistance.

I think my brain just isn't getting the whole PG count thing properly. I followed their calculation listed manually on the website as well and didn't get any where near what the website was telling me. I was using the "Target PGs per OSD" of 200 because we have an extra 8 bays in the server and could easily double the storage.

Just so I can get it clear in my head as well. If I have size of 3, then that is the original data plus 2 copies? But will be referred to as 3 replicas?

I guess I could let the minimum size be 1, it won't be fault tolerant then as we discussed but may get me out of a hole if 2 of the 3 cluster servers have to be shut down for anything.

When I installed Ceph I didn't have to do anything with a keyring, I'll have to read up on that bit. Won't the GUI just do that as part of the job it performs? Or is this a case of use the GUI then have to dive into command line and do some extra bits?

alexskysilk · Jun 21, 2016

The calculation for the Ceph PG calculator is:

(Target PGs per OSD) * (OSD #) * (%Data)
-----------------
(Size)

Lets go through the calculation given your values:

OSD = 18
Target PGs = 200
Replicas = 3, so % Data is 1/3 (33%)
OSD ceph data = 100%

therefore:

Pool PG = (18 * 200 * 0.33) / 100% = 1200
1200 rounded up to the next power of 2 = 2048.

Bear in mind that Ceph PG/Pool strategy is not an exact science, and larger number of PGs require more ram and more CPU speed while smaller numbers yield cruder granularity and decreased small block performance. YMMV.

The GUI does not deal with the ceph keyring (a bug as far as I'm concerned, but one that the devs have not addressed to date.) In order to mount ceph resources you must add the keyring to your cluster ceph configuration manually. You can find instruction here:
https://pve.proxmox.com/wiki/Ceph_Server#Ceph_Pools

Graham Payton · Jun 22, 2016

Hi alexskysilk

Thanks once again for your replies and assistance.

That makes perfect sense now you have explained the calculation. I was using 100% as the %Data instead of 33% because of the 3 replicas. Explains totally why I was getting such weird numbers

Thanks for the link about the keyring as well. I will work my way through that and see if I can get it all working. Then I need to start doing to speed tests on the virtual hard drives of some VMs and see how well it behaves.

I completely agree that the GUI not dealing with the keyring should be regarded as a bug. In my mind there is not much point in creating a GUI to do something if it doesn't do all of it and you still have to keep diving down to the command line.

Graham Payton · Jun 22, 2016

I see what you mean now about how the PGCount is not an exact science. I still have a health warning but now because I have too many PGs per OSD.

Is it possible to change the PG count at a later date if I add more storage? I guess would have to be via command line. If so then I'll set it using the Target PGs at 100 for now and recalculate later when adding more storage. I only used the 200 number originally to try and future proof myself a bit.

PigLover · Jun 22, 2016

You can increase the PG count using Ceph command line tools, but you can not decrease it.

I understand why this is true (because Ceph is designed for horizontal scaling and assumes storage always grows). I also understand enough about how it really works to know that shrinking the PG count is much harder than growing it. But because it is easy to get into the situation you are in there really should be a way to "degrow" placement groups.

Note that this is not a Proxmox issue - it is a Ceph issue. Proxmox GUI can only expose capabilities that actually exist in Ceph.

udo · Jun 22, 2016

PigLover said:
You can increase the PG count using Ceph command line tools, but you can not decrease it.

I understand why this is true (because Ceph is designed for horizontal scaling and assumes storage always grows). I also understand enough about how it really works to know that shrinking the PG count is much harder than growing it. But because it is easy to get into the situation you are in there really should be a way to "degrow" placement groups.

Note that this is not a Proxmox issue - it is a Ceph issue. Proxmox GUI can only expose capabilities that actually exist in Ceph.

Hi,
you can create an new pool with less PGs and use the storage migration from PVE*... if the too-much-PG-pool empty you can drop the pool.

Need a little bit work + time and space, but isn't impossible.

Udo

* afaik, you can do the same with ceph-tools but not with running VMs

Graham Payton · Jun 23, 2016

OK, so I appear to gotten Ceph up and running nicely.

I have created a VM with a simple 20GB hard drive running CentOS 7 on there and done some speed tests. They really are not great at all. I am aware of the posts and threads surrounding the Samsung SSD 850 Pro disks, but had to prove this out.

We have 8 x 1TB of these Samsung disks that run on the Supermicro (AVAGO) MR9361-8i RAID card, in each of the three servers. Each server uses 2 x 1TB drives in a RAID1 for the Proxmox install. The other 6 x 1TB disks are used by Ceph as OSDs with the journal on the same disk as well. The RAID controller is set to JBOD for the other 6 x 1TB disks.

So here is a speed test run from the the RAID1 array that runs the Proxmox and Debian install.

Not too shabby at all running at 423MB/s. I think they advertise the drives running at 500MB/s read and write.

So that is all good.

Now onto the test VM I created with CentOS 7 on the Ceph shared storage.

That speed is terrible!

So I think there are two main factors at work here and would appreciate the input from other Ceph users please?

1) The Samsung SSD 850 Pro disks are just not good enough for writing the journals too and therefore cannot keep up any sort of decent speed. As per this link that Proxmox cite on their Ceph installation instructions
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

2) I am having to use 1Gb ethernet links for running the Ceph Replication. Although I am using 6 x 1Gbe connections in a bond I know I will still never, ever get more than the speed of one of those 1Gbe links. That has a maximum theoretical speed of only 125MB/s.

In conclusion, I would guess that I need better disks so the journals can write quicker, and also a fatter pipe (10Gbe) for the Ceph replication between the servers as well?

Any thoughts on this are welcome please

udo · Jun 23, 2016

Graham Payton said:
OK, so I appear to gotten Ceph up and running nicely.

I have created a VM with a simple 20GB hard drive running CentOS 7 on there and done some speed tests. They really are not great at all. I am aware of the posts and threads surrounding the Samsung SSD 850 Pro disks, but had to prove this out.

We have 8 x 1TB of these Samsung disks that run on the Supermicro (AVAGO) MR9361-8i RAID card, in each of the three servers. Each server uses 2 x 1TB drives in a RAID1 for the Proxmox install. The other 6 x 1TB disks are used by Ceph as OSDs with the journal on the same disk as well. The RAID controller is set to JBOD for the other 6 x 1TB disks.

So here is a speed test run from the the RAID1 array that runs the Proxmox and Debian install.

View attachment 3944

Not too shabby at all running at 423MB/s. I think they advertise the drives running at 500MB/s read and write.

So that is all good.

No - that looks good, but isn't an real test... you measure the raid-cache!
On an raid-1 you must write to both disks - so you can't get an higher speed than the write speed of one disk (this is the reason, why an raid-cache is important).

Now onto the test VM I created with CentOS 7 on the Ceph shared storage.

View attachment 3945

That speed is terrible!

In this case your raid-cache don't work... and your journal on the disk is also not the burner... Write Ack after all copies are writen to the journal, but after that the content will write again to the disk (hdd head movement and so on).

2) I am having to use 1Gb ethernet links for running the Ceph Replication. Although I am using 6 x 1Gbe connections in a bond I know I will still never, ever get more than the speed of one of those 1Gbe links. That has a maximum theoretical speed of only 125MB/s.

In conclusion, I would guess that I need better disks so the journals can write quicker, and also a fatter pipe (10Gbe) for the Ceph replication between the servers as well?

Any thoughts on this are welcome please

I guess with 3 servers your 1GB-connection are not the biggest bottleneck (10GB has the big advantage of an smaller latency).
Your write speed will be much higher with good Journal-SSDs (like Intel DC S3700). I use 1 SSD for six journals but recommendet are only 4 journals on one SSD.
Read speed is the next problem ;-)

Udo

Graham Payton · Jun 23, 2016

udo said:
No - that looks good, but isn't an real test... you measure the raid-cache!
On an raid-1 you must write to both disks - so you can't get an higher speed than the write speed of one disk (this is the reason, why an raid-cache is important).

In this case your raid-cache don't work... and your journal on the disk is also not the burner... Write Ack after all copies are writen to the journal, but after that the content will write again to the disk (hdd head movement and so on).

I guess with 3 servers your 1GB-connection are not the biggest bottleneck (10GB has the big advantage of an smaller latency).
Your write speed will be much higher with good Journal-SSDs (like Intel DC S3700). I use 1 SSD for six journals but recommendet are only 4 journals on one SSD.
Read speed is the next problem ;-)

Udo

Thanks for the reply.

All of the disks are SSD so there shouldn't be a head movement and having to spin again to slow it down. The guys at Proxmox told me to put the journal on the same disk as the OSD because with an SSD it would be OK.

So both the OSD and the journal for each is on the same Samsung SSD 850 Pro disk.

How would you suggest I do the disks tests then please to get better readings?

alexskysilk · Jun 27, 2016

I see a couple of things.

1. With only 1gb ceph cluster connections, you will have a hard time exceeding 80-90MB/S PER I/O REQUEST. Depending on what you're actually trying to do, this may not really be a problem.
2. You have an 18 disk pool with 2048 PGs. Have you given any thought to how much compute/memory you have available for crush processing?
3. You have an LSI RAID HBA. Assuming your SSDs are attached to it, you would probably have better results if you use it with IT firmware, although I'm not sure if such exists for that particular card. RAID firmware is not conducive for Ceph.
4. (more of a continuation of 1) you want to test the subsystem in a meaningful manner- have you defined what that is? If all you're trying to do is move bits using dd from one initiator to one target, you may be better off using a SAS device and replicating it.

Search

Search

Beginner Ceph Questions

Graham Payton

New Member

alexskysilk

Distinguished Member

Graham Payton

New Member

alexskysilk

Distinguished Member

Graham Payton

New Member

alexskysilk

Distinguished Member

Graham Payton

New Member

Graham Payton

New Member

PigLover

Renowned Member

udo

Distinguished Member

Graham Payton

New Member

udo

Distinguished Member

Graham Payton

New Member

alexskysilk

Distinguished Member