HA and multiple nodes

@udo could you please share some details about your infrastructure? Hardware/software/network?
Hi,
5 pve-nodes (different models), 8 osd-nodes (E5-1650V2, 48GB RAM), 2* 10GB-Nic SFP+ (pve nodes: 1 vlans, 1 cluster(+ceph); osd-nodes: 1 private, 1 ceph).
Each osd-node has 11 4TB-disks + 1 8TB-disk + 2 Journal-SSD (Intel DC S3600 200GB) with space for an cache-pool + 1 SSD for system.

ceph is version 0.94.7 (hammer) and pve 4.2 (and an part due drbd8+openvz pve 3.4).

Udo
 
Everything installed via proxmox or did you manually installed ceph?

How many VMs are you running with this configuration?

Why 1 8tb disk? I'm curious
 
Everything installed via proxmox or did you manually installed ceph?
Hi,
I install the ceph cluster directly - because I started before proxmox supported ceph as mon/osd-node. And ceph don't recommend to use an osd-node for other things... this is more for large installartions.
How many VMs are you running with this configuration?
on the pve-nodes are app. 80 running VMs, but not all use ceph-storage (but the most).
The number of VMs say not much - it's depends on the usage of the VMs.
Why 1 8tb disk? I'm curious
We have a lot of "cold" archive data. As we stay with 7 OSD-nodes we need more space, so we extend on a eight node but change on each node one 4TB disk with an 8TB disk. With primary-affinity of 0.5 the 8tb-hdd will used as primary (as reading storage) not more than the 4tb-hdds.

Udo
 
Ok, I could try with 3 OSDs (used only as OSDs), 3 MONs/RGWs and a bounch of PVE nodes.
What I'm unable to calculate are the read/write speed. I'm not a storage expert, I don't know how to calculate the current read/write speed on local storage with 8x300GB SAS 15K in raid-6 hardware and the equivalent in Ceph with SATA disks and not raid but replica 3.

Any advice/guide/paper on this ?
 
Ok, I could try with 3 OSDs (used only as OSDs), 3 MONs/RGWs and a bounch of PVE nodes.
What I'm unable to calculate are the read/write speed. I'm not a storage expert, I don't know how to calculate the current read/write speed on local storage with 8x300GB SAS 15K in raid-6 hardware and the equivalent in Ceph with SATA disks and not raid but replica 3.

Any advice/guide/paper on this ?
Hi,
compare "local storage raid6 with SAS-15K" to "three node Sata ceph" can ceph never ever win!

You have many relevant things for IO but the important are latency and througput.

a) latency
With ceph you have an much bigger latency (network-stack, software-stack).
You can use low-latency network like 10GB SFP+ Ethernet, but this is still not perfect. Infiniband work over IP only (at this time), so this is better - but not so much.
The latencies due the software is also not to small - there are a lot of inprovement in the last years but the same: still not perfect.

b) througput
write: ceph acknoledge an write, if all replicas wrote the data into the journal (with sync! - which slow down consumer SSDs). E.G. more replicas - less write speed. With good (and enough) journal-SSDs the write speed is good.
But again: you don't won't an replica of 2!! (this is no ha - this is dangerous!)

read: ceph read the data in health state from the primary only (not from the replicas) - the primary is different on each PG. Mean, that with 3 nodes, all data are on all nodes, but the reads are going to all nodes (if one node is an local one, you have for two the network between them).
For read cache you need ram on the OSD-Nodes. If the data are in Ram allready the speed is fast - sata disks not.

ceph has the advantage, that with every expansion (more osd-nodes) the starage will be faster. 3 nodes are not fast!
Our 8 node cluster is ok.
ceph is good with multible reads - mean many VMs use the storage. Single thread performance are not so high (will say: is low)!

The best thing is to try such an config.

Udo
 
3 nodes is just to start. We plan to add 2 more nodes in a month after going in production and up to 15 nodes in next years.

Writes: if ceph acknowledge the write after all replicas wrote to journal, using faster SSDs could improve the performance a lot, as the only limiting factor is the network and not the SATA disks (where writes would be sequentially from journal and not randoms)using a raid1 for journal on sad could also improve the resiliency, right?

Reads: is not clear how ceph reads data. usually 100 PGs are created for each OSDs (1 osd per spinning disk) so in a 15 nodes cluster with 12 spinning disks each, I'll have about 100*15*12/3=6000 placement groups to round at nearest factor of 2: 2^13=8192 placement groups

Is PVE storing VMs to the same RBD pool and thus all reads are made from the same master node? Isn't possible to spread reads across all nodes?
ram is cheaper, having 64GB for each osd node is not an issue and should create a lot of space for cache (inktank suggest 1gb per osd disk, on my case 12gb would be the suggested amount and the rest for operating system and read cache)

Why do you suggest ceph over gluster?
Gluster is way easier to configure and is also much cheaper on hardware, no need for mons and so on.

Currently we have about 100 VMs across 7 XenServer nodes with local storage (all raid6) we would like to move everything on ceph and PVE
 
Another ceph's drawback is that is almost impossible to take backup for disaster recovery.
Crushmap, pools and so on are impossible to backup without taking a Mon down and backing up that files in consistent state but what if you redistribute datas or add/remove disk after the backup? You'll have a backup of a different cluster and when restored you are referring to a bad cluster map with data phycally moved (after redistribution), this, these backup are inconsistent and would lead to data loss and not to data recovery.

Gluster, having no metadatas, databases or similiar to maintain is easier to backup. Just rsync (preserving xattr, -X flag for rsync) the whole bricks directory on each node and you have done . You could also easily create incremental backups with rsnapshot as everything is a file.
Snapshot the brick and you are done.

How do you safely backup your ceph cluster?
 
Another ceph's drawback is that is almost impossible to take backup for disaster recovery.
Crushmap, pools and so on are impossible to backup without taking a Mon down and backing up that files in consistent state but what if you redistribute datas or add/remove disk after the backup? You'll have a backup of a different cluster and when restored you are referring to a bad cluster map with data phycally moved (after redistribution), this, these backup are inconsistent and would lead to data loss and not to data recovery.

Gluster, having no metadatas, databases or similiar to maintain is easier to backup. Just rsync (preserving xattr, -X flag for rsync) the whole bricks directory on each node and you have done . You could also easily create incremental backups with rsnapshot as everything is a file.
Snapshot the brick and you are done.

How do you safely backup your ceph cluster?

I'll leave other solutions to the Ceph experts, but I'd simply backup the data that is on ceph. In case something dies, take down the OSD and replace it. If your whole ceph cluster burns down (i.e., a really catastrophic event in one way or another), rebuild it according to your (hopefully existing?) documentation.

Having good journal SSDs will improve Ceph write performance a lot, that is correct.

Is PVE storing VMs to the same RBD pool and thus all reads are made from the same master node? Isn't possible to spread reads across all nodes?

The premise is right (by default, one pool is created and used for everything, but you can setup more than one if you want), but the conclusion is wrong.

Like Udo said, reads are done from the master node of a placement group, not of a pool. Reads are by default spread across all nodes that have replicas of the data you want to read (because the master nodes are spread over the nodes). This is in my opinion also the only (small) drawback of Ceph - it will read data (from other nodes) over the network even if it would have it available locally. For most use cases, this is not a problem at all (hint: read cache in RAM).

If you are more comfortable using GlusterFS, feel free to use it.
 
  • Like
Reactions: CBdVSdFSMB
I'll leave other solutions to the Ceph experts, but I'd simply backup the data that is on ceph. In case something dies, take down the OSD and replace it. If your whole ceph cluster burns down (i.e., a really catastrophic event in one way or another), rebuild it according to your (hopefully existing?) documentation.

Backing up a petabyte cluster (ceph is developed for this size) is impossible for everyone.

Additionally, you don't have a filesystem to use for rsync, you have to backup (how?) tons of small distributed chunks that must be reconstructed during the backup.
If the VM disk is chunked in 100.000 of 4mb pieces across the cluster, how do you backup all of these?

Like Udo said, reads are done from the master node of a placement group, not of a pool. Reads are by default spread across all nodes that have replicas of the data you want to read (because the master nodes are spread over the nodes). This is in my opinion also the only (small) drawback of Ceph - it will read data (from other nodes) over the network even if it would have it available locally. For most use cases, this is not a problem at all (hint: read cache in RAM).

the same would happen with gluster with sharding enabled. You split the image file in small chunks so you have to read everything from the net as you don't have a local copy.

If you are more comfortable using GlusterFS, feel free to use it.

I'm not more comfortable with gluster as I'm not using gluster nor ceph.
I'm trying to figure our why ceph is better than gluster or viceversa.
3 huge gluster's advantages over ceph are:

1. 5 minutes to setup, 5 minutes to explain how it works to any tech employees that should do the maintenance. With ceph you need at least one week full of test.

2. A filesystem to use for backups. I can easily use rsnapshot to backup the whole cluster (even with sharding, as each shard is a file)

3. 1/2 servers needed on small clusters. No need for 3 MDS, just 3 server for everything as "everything" in gluster means the storage server. With ceph you need at least 6 server (mons are better as standalone and not on osds)
 
Backing up a petabyte cluster (ceph is developed for this size) is impossible for everyone.

Additionally, you don't have a filesystem to use for rsync, you have to backup (how?) tons of small distributed chunks that must be reconstructed during the backup.
If the VM disk is chunked in 100.000 of 4mb pieces across the cluster, how do you backup all of these?
Hi,
looks that you not fully understand ceph... (which is no problem, because object store is quite different to an normal filesystem).
Why you should get an backup of the 4MB-Chunks? They are useless! And normal ceph-cluster are so big, that cou can't use things like rsync...
It's a little bit as you like to backup the content on one disk in an large (proprietary) Raid-device... You got a lot of data, but you can't do anything with this data (but this is in case of ceph not completly truth - you can reassemble an VM-disk from the 4MB-Chunks - if you know how the VM-disk is named on the ceph-storage).

What do you whant to backup? How long? What scenario?
Ceph has multible choices - you can define where copies of the data must be (or what can be dead whithout trouble), osd, host (normal behaviour), rack, room, datacenter...
So it's should be possible to have an copy of all data in another DC... but this is an extra copy - not an backup!
But you can sync data with an rados gateway to another ceph-cluster... I don't use it but I think you can build something with snapshots to have an backup solution.
For me the strategie is an backup of the client-disk with bareos on an other storage medium (LTO Cartridges). Independent of the storage (DRBD, ceph...), and an backup of the important ceph-files (which i provide with puppet to the ceph nodes).

Udo
 
Rsync was just an example (anyway, with gluster you could also use plain rsync)

How do you backup the whole ceph cluster with bareos?
 
Rsync was just an example (anyway, with gluster you could also use plain rsync)

How do you backup the whole ceph cluster with bareos?
Hi,
I don't backup the whole cluster (374 TB size - 189 TB used) and this is an small ceph-cluster.
I backup the VMs, and all archive-data which is written to the ceph-cluster will backuped one copy to tape... so if all breaks (what should not happens) I can reread all tapes to get the data back.

Udo
 
What do you mean with "archive-data"?

so are you making backups from "inside" the virtual machine and not directly from ceph?
In example, you are backing up the VM filesystem and not the VM disk image
 
What do you mean with "archive-data"?

so are you making backups from "inside" the virtual machine and not directly from ceph?
In example, you are backing up the VM filesystem and not the VM disk image

you can also backup the disks (rbd images) individually.
 
I mean what I said - you can access the block devices ("mapped rbd images") and back them up on a block device level - why you would want to do that manually I don't know, but you asked for it.
 
Alessandro i do the backup inside proxmox, from ceph to a nfs storage, u can select what vm-disc should be backed up und what vm (regardless where the vm is running). So if the VM switch from node1 to node2 the backup still works at the given time.
The data backup is done inside the vm with rsync to an iscsi storage.

The install of ceph is no longer than 1 hour if u use the interface, only if u want ssd journals it needs done by hand. but without ssd journal its not a big task to start ceph.

If a disc is going down who cares, u have 2 or 3 replicas! Only if a PG goes out of sync u have to repair it but its also no big deal if it works like expected. :)

U can run Ceph and the vm's on the same machine, no need a dedicated storage cluster.

But 3 Nodes with replicas 3 means every node save the data, so u cant expect a lot of speed with that overhead.
But it is a start and u can easy expand it, u boost the storage _and_ the power for the vm's because u can migrate/spread them to the new (faster) nodes.

But i only see gluster once and the speed was not bretty, so i cant say anything about gluster! If u like it more u should use it because if something goes wrong u are in charge! :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!