PVE5 Ceph Storage

troycarpenter

Renowned Member
Feb 28, 2012
103
7
83
Central Texas
After a year of using freeNAS with LVM over iSCSI with 30 nodes and almost 100 VMs, that solution finally revealed its shortcomings. As a stopgap to keep the VMs running, I moved the VM images back onto each node's local hard drives. I'm now looking to setup a three node ceph cluster, running on the three most powerful nodes I have. I am using PVE5 with the luminous test repository.

I have created two of the nodes, each with four 2-TB hard drives. I created four montors and created the 8 osds. For the OSDs I put the journals on partitions in a separate SSD. I then created a pool with a size/min of 2/2 with 512 pgs. The cluster shows as "HEALTH_OK".

I need to start using this storage so I can move the VM images from the third ceph node so I can bring it down to configure the hardware and add it to the cluster. Even after reading the documentation, I'm not sure of the following things:

  1. Should I have created the pool with szie/min of 3/2, looking forward to the full number of OSDs (the PGCalc says the it would still be 512pgs)?
  2. What is the danger of data unavailability or loss of the two-node cluster if one of those nodes becomes inaccessible for any reason (network, reboot, software failure, etc) while I'm getting the third node online?
  3. When the third node is ready, how do I add the OSDs to the existing pool? I'm guessing that would involve bumping the size from 2 to 3.
  4. Once the third node is online and the cluster is healthy, I will need to take down the original two nodes one at a time to reconfigure the SSD drive where the journals are kept.
  5. I haven't found any good advice as to what size the journal partitions should be. It looks like Proxmox is creating those partitions. Any thoughts on moving the OS from the current separate HD to the same SSD used for the journals (in order to free up another disk slot in the chassis)?
Thanks for the advice.
 
I added another monitor per your suggestion...but since I only have two nodes operational right now, I don't see the point of having three replicas yet. You say it's risky, but when I used a NAS server, that was a single point of failure...so what I asked is whether or not with the two replicas is enough to allow one of the two current nodes to go down until the third node is added. What are the risks other than there are two copies rather than three. I would expect the system to distribute the two copies between the nodes so that I can still have one go down. That begs the question whether or not the replication unit is an osd, or is a node? If it's an osd, then the second copy could actually be on another osd in the same node, which won't help if the node goes down. I'm guessing Ceph is smarter than that and would put the replica on an osd in the second node. If that's the case, then I can live with that until the third node is built and added to the ceph cluster.
 
Hi,

Ceph need minimum 3 nodes and this is a must and not nice to have.

I created four montors
Do you have really 4 monitors? This means 2 on one node?

If yes only one per node and as jeffwadsworth say
need an odd number of monitors and at least 3

1.)
size 3/2 is required because the size 2/1 is a over 7% probability you will loose your data.
The calculation was made by some guys on the ceph user list.
2.)
As I say there is no two node cluster with ceph. Because it is only a question of time when you loose your data.
3.)
You can add a node the same way you have create the cluster, but keep in mind if you have data in your cluster you will force an rebalance what will result a massive network and disk load.
4.)
Why? You should take only down the osd one by one.
5.)
The journal size must not be large because it is only a write cache. So 5GB is fine.
Never put the mon on the same disk like the journal.
You can use a enterprise ssd for max 4 osd journals.
 
Hi,

Ceph need minimum 3 nodes and this is a must and not nice to have.

Do you have really 4 monitors? This means 2 on one node?

If yes only one per node and as jeffwadsworth say

Perhaps I wasn't as specific as I could be. The ceph cluster is using the same hardware as my 13 node PVE 5 cluster. I currently have only two storage nodes (which are also PVE nodes), but I will be adding new hard drives to one of the PVE nodes to create a third ceph storage node.

The monitors are currently running on the three storage nodes, as well as two other nodes in the PVE cluster. None of the monitors are running on the same machine. I am doing most of the ceph configuration though the PVE interface, so I don't think it allows to configure more than one monitor per host. I don't know if that makes this a 5 node ceph cluster, or a 13 node since ceph is initialized on all the nodes in the PVE cluster.

1.) size 3/2 is required because the size 2/1 is a over 7% probability you will loose your data.
The calculation was made by some guys on the ceph user list.
2.) As I say there is no two node cluster with ceph. Because it is only a question of time when you loose your data.
So once the new storage node is ready, I will add the OSDs, then change the size to 3/2 (still don't know how to do that, I imagine that can be done from the command line since it is not in webGUI). Rebalancing shouldn't take too long since the cluster is currently only storing a single 80GB file.

3.) You can add a node the same way you have create the cluster, but keep in mind if you have data in your cluster you will force an rebalance what will result a massive network and disk load.
4.) Why? You should take only down the osd one by one.
The journal disk is currently a 300GB SAS drive, but I want to replace that with a 250GB SSD (which sounds like overkill, by the way). I'm sure there is a procedure somewhere online that explains how to change the osd journal.

5.) The journal size must not be large because it is only a write cache. So 5GB is fine.
Never put the mon on the same disk like the journal.
You can use a enterprise ssd for max 4 osd journals.
Good to know. So I should not move the OS onto the 250GB SSD and use that for for both the OS and the journal partitions. So keep the journals off the system OS disk since that's where the MON usually gets created.
 
I'm sure there is a procedure somewhere online that explains how to change the osd journal.
Ceph has a good online documentation you should read this.

250GB SSD (which sounds like overkill, by the way).
Yes but you need the endurance, keep in mind if you have journals for two osd on this all what is written to this osd will first be written to the journal.
 
Thanks for all the help. Two more questions now that I have the third storage node up and added to the cluster and healthy:
  1. When I created the OSDs with Bluestore as the backing, it only created 1GB block.db partitions on my SSD, probably because the system didn't correctly detect the SSD and failed to use the SSD default of 3GB. Is there any way to change that after the fact without deleting the OSD?
  2. I keep getting the following error when I select the OSD menu item from the left pane of the PVE GUI:
    mon_command failed - this command is obsolete (500)
    What does that mean? I usually can get this to go away by switching between hosts on the left side.
 
1.)
Do you use a Raid controller?
Do you use only SSD or the db on the SSD and the data on spinning HDD?

2.)
Please check if all ceph managers are running. if yes try to restart them.
 
We do not test the migration path from ceph luminous packages
and to give a serious answer I must test every new packages again our packages.
 
We do not test the migration path from ceph luminous packages
and to give a serious answer I must test every new packages again our packages.
As far as I understand Proxmox use ceph only for create and remove RBD volumes. I'm sure that after the upgrade it will work well. Are there any other concerns? Is there anything else that is important?
 
1.)
Do you use a Raid controller?
Do you use only SSD or the db on the SSD and the data on spinning HDD?

2.)
Please check if all ceph managers are running. if yes try to restart them.

Yes, there is a RAID controller on all three hosts, which is why the SSD shows up as a regular HD. The RAID controller says it is optimizing things for the SSD behind the scenes. There is a SSD on sdb, and the OSDs are sdc, sdd, sde, sdf. When I configured an OSD, I selected the data drive, then set the journal as /dev/sdb. On the SSD drive there are now four 1-GB partitions.
 
Raid card make to 95% problem with ceph and I mean problems what are hart to finde.
Only IT/HBA mode are ok.
If you SDD is mark as HDD thru the controller you will use for sure 1GB for the DB.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!