PVE cluster, PXE and Ceph

jdl · Feb 22, 2016

Hello,
I'm building a 7 nodes cluster :
- 4 VM nodes
- 3 Ceph nodes (using pveceph), no VM.
- All nodes in the same cluster.
I don't have full control on the hardware, server choices are limited by our hosting provider.
The physical configuration of the Ceph nodes is the following : 2*Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz 8 cores, 64GB RAM, 2*SSD Intel DC S3300 300GB, 2*SATA 2To, no RAID.
To maximize the storage capabilities of the Ceph nodes, i didn't installed PVE directly on the server, rather, the Ceph nodes are booting PVE from PXE/NFS.
The NFS server hosting the PVE-Ceph system is already HA-enabled.
Basically i used debootstrap on the NFS server, and installed PVE on top of Jessie using this : https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Jessie.
On each Ceph node, i created two OSD configured as followed : Intel 300GB SSD for journal, 2To SATA for storage.
Network is 10Gbps, and the rados bench results are around 250MB/s.
LiveMigration works, HA works also correctly but the dowtime before the VM comes up on another node is 3-4 minutes.
Is such a configuration considered a tolerable practice, or is it a FrankenConfig ?
For the Ceph nodes, would you recommend installing PVE on a single one SSD, and using the remaining SSD for the journal of both SATA drives, instead of booting via PXE ?
Should i expect better performance ? (I know there's a lot of things that may come into account regarding performance...)
Thank you

udo · Feb 22, 2016

Hi,
some remarks:
I would not boot the OSDs-node (and other pve-nodes) via PXE (SPOF - even with an ha-nfs-server).
Understand I right - 3 ceph-nodes only (OSD+MON?) and 4 Nodes VMs only?
Ceph got better speed with more nodes!

I suggest:
a) use all 7 nodes for osds (3 mons)
b) install the pve-system on a SSD - but don't use the full space - perhaps you can use one small partition for one journal (like 10GB).
But normaly this should not nessesary - for two hdds one ssd wth two journals should be fine.

BTW. I don't know how reliable the S3300 is... I prefer the S3700 for journaling. Don't forget to monitoring your SSD!

You have one or two 10GB-Networks?

Udo

jdl · Feb 23, 2016

Hi,
Yes, 4 nodes for the VMs (so CPU+RAM), and 3 nodes dedicated to Ceph, OSD+MON.
I can add more nodes to Ceph, but the 4 current VM nodes don't have the same hardware as the Ceph nodes (they are configured with 3 SAS 600GB 15k drives + Cache 80GB SSD (CacheCade)).
This is not necessarily the definitive architecture, i can drop the SAS nodes, i'm trying to find the best combination regarding our constraints : some products need the features commonly associated with shared storage (mostly HA), and others don't require reliability but a lot of IOPS, etc...
I have two 10GB networks but only one is usable for the infrastructure : one nic is connected to the public network (Internet), and is artificially limited to 3GB between the physical servers, and the other nic is connected to a private, dedicated, 10GB network, so i have to use the same network both for Ceph and the VM traffic.
However each kind of traffic is isolated using VLANs.
I could use the 10/3GB network for VM traffic, but then some kind of tunneling (w/ or w/o encryption) would be needed, and it's probably an over-complicated approach.
About your point b), i thought i add to use the whole disk for the journal, dedicated to the journal, not a partition, thank you for the info, i will try to setup a server according to your recommendations (PVE-system on SSD, two small partitions for journals).
Also, i can use different kind of hardware for storage : full SSD, either 2x480Go SSD, no RAID, or 4x800Go SSD RAID hard (but each disk can be exposed as a single disk, without any array), without PXE 2x480G is probably useless, how would you setup the 4x800GB ?
Thank you

udo · Feb 23, 2016

jdl said:
Hi,
Yes, 4 nodes for the VMs (so CPU+RAM), and 3 nodes dedicated to Ceph, OSD+MON.
I can add more nodes to Ceph, but the 4 current VM nodes don't have the same hardware as the Ceph nodes (they are configured with 3 SAS 600GB 15k drives + Cache 80GB SSD (CacheCade)).

Hi,
3 SAS-Drives as raid-5 with cache-ssd to avoid the poor write-performance of an raid-5?

Perhaps you can use DRBD9 (but with pve4.1 not realy useable?! - I hope with pve4.2 DRBD9 will be usable for production) for IOPS.
With DRBD (network raid1) you can use only two SAS-disks in raid-0 for good speed - then you have one bay free for an OSD ;-)
With this config, your VMs on the drbd-storage will fail if one disk of the raid-0 die. But you can start the VMs on the other node again...
raid-10 is the much more perfect solution.

...
Also, i can use different kind of hardware for storage : full SSD, either 2x480Go SSD, no RAID, or 4x800Go SSD RAID hard (but each disk can be exposed as a single disk, without any array), without PXE 2x480G is probably useless, how would you setup the 4x800GB ?
Thank you

Hmm,
depends on your usage... what's about an mix with two smaller SSD in raid 1 (pve-system and DRBD-space) and two 800GB-SSDs as OSD?

Udo

jdl · Feb 26, 2016

udo said:
Hi,
3 SAS-Drives as raid-5 with cache-ssd to avoid the poor write-performance of an raid-5?

Yes, and it works quite well. It's a good compromise providing a better capacity/price ratio, and better IOPS for read/write than only the SAS disks. I didn't setup the config myself, it's provided out-of-the-box by the hosting provider.

udo said:
Perhaps you can use DRBD9 (but with pve4.1 not realy useable?! - I hope with pve4.2 DRBD9 will be usable for production) for IOPS.
With DRBD (network raid1) you can use only two SAS-disks in raid-0 for good speed - then you have one bay free for an OSD ;-)
With this config, your VMs on the drbd-storage will fail if one disk of the raid-0 die. But you can start the VMs on the other node again...
raid-10 is the much more perfect solution.

In fact, i already tried DRBD9 before moving to Ceph, this was my first attempt (4 nodes). The setup experience was quite... mixed. Not really painful but not very nice either. There are three "show stoppers" for me about DRBD9 currently : first, data is replicated and not shared, resulting on a capacity loss on each node, this was not the desired behavior, at least for us, second, i did some testing (abruptly taking down a node), and was unable to easily/transparently restore the cluster to a clean state after the "failed" node reboots (it's of course possible that i don't have enough experience with this solution), also some data apparently deleted through PVE was in fact not deleted and "impossible" to delete, and third, most importantly, our system team does not have any experience with DRBD and i can't afford the risk of a huge downtime when a disk crash occurs a few month/years after the initial setup (and everybody has forgotten how the system was setup

). I was not aware of the IOPS issue.

udo said:
Hmm,
depends on your usage... what's about an mix with two smaller SSD in raid 1 (pve-system and DRBD-space) and two 800GB-SSDs as OSD?

Unfortunately currently i don't have full control on the hardware, i have to find a good combination using the constraints of the hosting provider. I will try a full SSD setup with 800GB SSD for the OSD, as you said.
I'm also trying LVM over iSCSI on a 4*10GB network in another hosting provider

Many thanks for your help

Search

Search

PVE cluster, PXE and Ceph

jdl

New Member

udo

Distinguished Member

jdl

New Member

udo

Distinguished Member

jdl

New Member