PVE cluster, PXE and Ceph

jdl

New Member
Oct 14, 2014
18
1
3
Hello,
I'm building a 7 nodes cluster :
- 4 VM nodes
- 3 Ceph nodes (using pveceph), no VM.
- All nodes in the same cluster.
I don't have full control on the hardware, server choices are limited by our hosting provider.
The physical configuration of the Ceph nodes is the following : 2*Intel(R) Xeon(R) CPU D-1540 @ 2.00GHz 8 cores, 64GB RAM, 2*SSD Intel DC S3300 300GB, 2*SATA 2To, no RAID.
To maximize the storage capabilities of the Ceph nodes, i didn't installed PVE directly on the server, rather, the Ceph nodes are booting PVE from PXE/NFS.
The NFS server hosting the PVE-Ceph system is already HA-enabled.
Basically i used debootstrap on the NFS server, and installed PVE on top of Jessie using this : https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Jessie.
On each Ceph node, i created two OSD configured as followed : Intel 300GB SSD for journal, 2To SATA for storage.
Network is 10Gbps, and the rados bench results are around 250MB/s.
LiveMigration works, HA works also correctly but the dowtime before the VM comes up on another node is 3-4 minutes.
Is such a configuration considered a tolerable practice, or is it a FrankenConfig ?
For the Ceph nodes, would you recommend installing PVE on a single one SSD, and using the remaining SSD for the journal of both SATA drives, instead of booting via PXE ?
Should i expect better performance ? (I know there's a lot of things that may come into account regarding performance...)
Thank you
 
Hi,
some remarks:
I would not boot the OSDs-node (and other pve-nodes) via PXE (SPOF - even with an ha-nfs-server).
Understand I right - 3 ceph-nodes only (OSD+MON?) and 4 Nodes VMs only?
Ceph got better speed with more nodes!

I suggest:
a) use all 7 nodes for osds (3 mons)
b) install the pve-system on a SSD - but don't use the full space - perhaps you can use one small partition for one journal (like 10GB).
But normaly this should not nessesary - for two hdds one ssd wth two journals should be fine.

BTW. I don't know how reliable the S3300 is... I prefer the S3700 for journaling. Don't forget to monitoring your SSD!

You have one or two 10GB-Networks?

Udo
 
Hi,
Yes, 4 nodes for the VMs (so CPU+RAM), and 3 nodes dedicated to Ceph, OSD+MON.
I can add more nodes to Ceph, but the 4 current VM nodes don't have the same hardware as the Ceph nodes (they are configured with 3 SAS 600GB 15k drives + Cache 80GB SSD (CacheCade)).
This is not necessarily the definitive architecture, i can drop the SAS nodes, i'm trying to find the best combination regarding our constraints : some products need the features commonly associated with shared storage (mostly HA), and others don't require reliability but a lot of IOPS, etc...
I have two 10GB networks but only one is usable for the infrastructure : one nic is connected to the public network (Internet), and is artificially limited to 3GB between the physical servers, and the other nic is connected to a private, dedicated, 10GB network, so i have to use the same network both for Ceph and the VM traffic.
However each kind of traffic is isolated using VLANs.
I could use the 10/3GB network for VM traffic, but then some kind of tunneling (w/ or w/o encryption) would be needed, and it's probably an over-complicated approach.
About your point b), i thought i add to use the whole disk for the journal, dedicated to the journal, not a partition, thank you for the info, i will try to setup a server according to your recommendations (PVE-system on SSD, two small partitions for journals).
Also, i can use different kind of hardware for storage : full SSD, either 2x480Go SSD, no RAID, or 4x800Go SSD RAID hard (but each disk can be exposed as a single disk, without any array), without PXE 2x480G is probably useless, how would you setup the 4x800GB ?
Thank you
 
Hi,
Yes, 4 nodes for the VMs (so CPU+RAM), and 3 nodes dedicated to Ceph, OSD+MON.
I can add more nodes to Ceph, but the 4 current VM nodes don't have the same hardware as the Ceph nodes (they are configured with 3 SAS 600GB 15k drives + Cache 80GB SSD (CacheCade)).
Hi,
3 SAS-Drives as raid-5 with cache-ssd to avoid the poor write-performance of an raid-5?

Perhaps you can use DRBD9 (but with pve4.1 not realy useable?! - I hope with pve4.2 DRBD9 will be usable for production) for IOPS.
With DRBD (network raid1) you can use only two SAS-disks in raid-0 for good speed - then you have one bay free for an OSD ;-)
With this config, your VMs on the drbd-storage will fail if one disk of the raid-0 die. But you can start the VMs on the other node again...
raid-10 is the much more perfect solution.
...
Also, i can use different kind of hardware for storage : full SSD, either 2x480Go SSD, no RAID, or 4x800Go SSD RAID hard (but each disk can be exposed as a single disk, without any array), without PXE 2x480G is probably useless, how would you setup the 4x800GB ?
Thank you
Hmm,
depends on your usage... what's about an mix with two smaller SSD in raid 1 (pve-system and DRBD-space) and two 800GB-SSDs as OSD?

Udo
 
Hi,
3 SAS-Drives as raid-5 with cache-ssd to avoid the poor write-performance of an raid-5?

Yes, and it works quite well. It's a good compromise providing a better capacity/price ratio, and better IOPS for read/write than only the SAS disks. I didn't setup the config myself, it's provided out-of-the-box by the hosting provider.

Perhaps you can use DRBD9 (but with pve4.1 not realy useable?! - I hope with pve4.2 DRBD9 will be usable for production) for IOPS.
With DRBD (network raid1) you can use only two SAS-disks in raid-0 for good speed - then you have one bay free for an OSD ;-)
With this config, your VMs on the drbd-storage will fail if one disk of the raid-0 die. But you can start the VMs on the other node again...
raid-10 is the much more perfect solution.

In fact, i already tried DRBD9 before moving to Ceph, this was my first attempt (4 nodes). The setup experience was quite... mixed. Not really painful but not very nice either. There are three "show stoppers" for me about DRBD9 currently : first, data is replicated and not shared, resulting on a capacity loss on each node, this was not the desired behavior, at least for us, second, i did some testing (abruptly taking down a node), and was unable to easily/transparently restore the cluster to a clean state after the "failed" node reboots (it's of course possible that i don't have enough experience with this solution), also some data apparently deleted through PVE was in fact not deleted and "impossible" to delete, and third, most importantly, our system team does not have any experience with DRBD and i can't afford the risk of a huge downtime when a disk crash occurs a few month/years after the initial setup (and everybody has forgotten how the system was setup :) ). I was not aware of the IOPS issue.

Hmm,
depends on your usage... what's about an mix with two smaller SSD in raid 1 (pve-system and DRBD-space) and two 800GB-SSDs as OSD?
Unfortunately currently i don't have full control on the hardware, i have to find a good combination using the constraints of the hosting provider. I will try a full SSD setup with 800GB SSD for the OSD, as you said.
I'm also trying LVM over iSCSI on a 4*10GB network in another hosting provider :)
Many thanks for your help
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!