Ideal Proxmox Setup

Nizam · Jun 15, 2017

Hello,

We are currently selling OpenVZ/KVM VPS to our clients using SolusVM. And its time for us to upgrade our offerings with high availability VPS with inbuilt incremental backup's for our customers. 100% uptime is very IMP for us. If we start with Proxmox we will narrow down our setup purely with KVM hypervisors as we sell mix of Windows and CentOS VPS. We will consider ModulesGarden WHMCS module for Proxmox as front end for our clients to manage their VPS.

We own and operate our setup in 2 data centers including network and hardware so we are flexible to choose the switches and hardware needed for Proxmox. Its been few weeks that I am reading through all related threads and narrowed our choices for storage with Ceph or ZFS. However it would be great to hear from you guys to see what will be best for production storage environment? Will Ceph or ZFS will be a better option? What hardware specs and networking gear would you suggest if we are to choose Ceph or ZFS.

We currently cater to specific set of client who requires large memory VPS and moderate SSD storage. For an example 4 vCPU, 8GB Memory, 80GB SSD Storage and 8 vCPU, 16GB Memory, 150GB SSD Storage. So what would be your suggestion in choosing a server configuration for a Hypervisor that can host 10 to 15 VPS one each? Would it be better to choose moderate hypervisors configuration with 2x Six Core CPU's, 64GB memory and put 5 VPS on them or use heavy hypervisors and use 2x Eight Core CPU's with 256GB memory and put 10 to 15 VPS on the same?

Thank you.

Regards,
K.Nizam

fortechitsolutions · Jun 16, 2017

Hi, I have just a couple of comments, maybe others will chime in hopefully as well.
-- ZFS I believe won't give you a distributed HA fault tolerant storage (aka 'converged hypervisor' environment); but rather a ZFS feature-rich local storage pool on each physical node. So if that physical node goes down, then VMs trapped on that storage are offline until you get that node online again (ie, Not HA). Also, I believe, ZFS has its own design considerations in terms of RAM and CPU requirements to purely make the filesystem 'operational' (ie, perform dedup, compression, and possibly replication). Those requirements are in effect reducing resources available on the hardware node which proxmox can give to VMs running on the host.
-- Ceph, I believe, will be happier with 'reasonable viable number of spindles' possibly in the domain of 24 disks for block storage for a single ceph cluster. So that might be 4 proxmox nodes with 6 x block disks each, or 3 x 8, or .. something like that. Clearly for serious ceph deployment you need fast (ie 10gig interconnect for the ceph data replication network not 1-gig or trunked-1gig). So to some extent your Proxmox 'cpu ram density' optimal point, may be a bit of a balance based on - preferred server chassis, number of disks supported, and then - roughly planning out your 'minimum viable ceph cluster unit of growth'. Then when scaling things out you always build out in units of <minimum viable cluster> and keep adding new parallel independant cluster; although possibly you will do an incremental growth to maximal-optimal cluster size first. (ie, start with 4 x nodes in ceph proxmox cluster; and scale up to maybe 8-10 nodes as your 'top of cluster size'; then if you need more capactiy; build more parallel clusters, starting first again at 4 nodes and going from there. It is always possible of course with ceph builds to be quite asymmetric (ie, some nodes are heavier disk contributors than others). Ideally a few real-world deployments / making feedback here will be very helpful.
-- if possible it will be very instructive for you to look at your current data you have regarding, client trends for VM size, VM utilization (ie, do people tend to use CPU ~10% of the time, 50%, 100% ? in your 'client base' ?) - to give insight into how much, if any, CPU oversubscription you can reasonably do without hitting performance issues. Similarly, maybe you have idea about how much if any RAM oversubscription you are able to get away with (ie, due to KSM type memory sharing / many similar OS VMs on same physical box ..)
-- ultimately it is beneficial to have the 'resource balance' fairly optimized, so that - you don't end up with machines that are servicing, for example, RAM-hungry clients / who then leave many CPU cores idle on that box. etc.

Just a few initial thoughts that spring to mind.

Tim

Nizam · Jun 16, 2017

Thanks Tim.

I was under impression that we can do HA setup using ZFS?

If we choose Ceph we are certain to go with 2x 40GbE (Bonded) for each node. When choosing Ceph are there any suggestions on type os SSD's to use? We were planning for Samsung Pro SSD's.

Regards,
K.Nizam

fortechitsolutions · Jun 16, 2017

Hi,

For any HA Config, you need 'shared storage' which permits access to the concurrent storage from all nodes with impunity. During a HA fail event, when one node is confirmed 'dead' by the remaining nodes, then all VMs who were configured HA and running on the "Failed Node" - are auto-started on other nodes. For this to happen, all nodes must have access to the underlying VM storage tank (shared storage) where the VMs reside. (and of course as part of this, all the VMs in HA control, will effectively experience a brief "shut down uncleanly, then restart" - as part of this automated HA restarting-of-failed-vms.)

If you were talking about ZFS, as "local ZFS storage on each host" then that is non-shared storage, and not HA capable. I believe.
If you were talking about ZFS, as a "shared ZFS filer" (ie, one node with lots of disks, and then exports storage to all Proxmox Ram-Compute nodes, who don't need local disk for VMs) - then it is HA capable, since you have shared storage. I believe in this "ZFS Filer" mode it implies you have (a) single point of fail on the ZFS filer; (b) NFS or iSCSI exports from the ZFS tank to the proxmox nodes. I don't think there is other export storage available, unless maybe? you have fibre connectivity and you are running a "Filer Distro" which permits you to create "SAN LUN Exports via fibre" (?). Possibly it can be done, to setup a pair of ZFS NAS Filers running in a "HA Config" (hearbeat between them, and failover IP moves from active to passive on a filer-fail-event?) - I've done sometihng kind of like this in a test deploy using a non-free filer distro which simplifies the setup of the "HA Filer Pair". I'm not sure there is a "free" HA-capable Filer distro out there. I know there are non-trivial walkthrougs that in theory describe how to set this up.

Possibly someone else with lots of ZFS hands-on experience can comment to this thread to help clarify as well. Just in case I'm missing something.

For Ceph - I have a feeling 2 x 40gig bonded will be plenty of bandwidth! And possibly would permit you to scale out your cluster to a bigger 'maximum cluster size' since you have such large bandwidth available. (?) again would be nice if a ceph hands on experience person decides to reply to comment

Tim

LnxBil · Jun 16, 2017

Nizam said:
When choosing Ceph are there any suggestions on type os SSD's to use?

Yes, these here for OSD:

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

Nizam said:
I was under impression that we can do HA setup using ZFS?

Yes, there is some ZFS implementation as a SAN solution that supports HA:

http://www.zeta.systems/zetavault/high-availability/

There was an article series about ZFS-HA on one of the BSD magazines, but that was more a proof-of-concept-kind of architecture.

About about a "real" SAN solution with e.g. 16 GB FC on a flash-only SAN with off-site replication (or cheaper multi-tier SAS HD/SSD)? I'd go with dual 20+ core machines 1 HE, 256 GB or more of memory and 10 GBE to the outside world. Alternatively iSCSI, but I never personally experienced a fast iSCSI setup compared to a FC one.

Nizam · Jun 16, 2017

Thank you guys for your feedback. So now we decided we would go with Ceph. When it comes to Proxmox with Ceph do you suggest we keep the Ceph and VM's on same node or should we have separate nodes for Ceph and for VM's?

How many networks would we need in total? We would go with 2x 40GbE for Ceph Storage and 2x 1Gig for VM networking. Should there be a management network also? If yes should it also be as good enough as storage network and what tasks would it handle?

Regards,
K.Nizam

fortechitsolutions · Jun 17, 2017

Hi,

Management network needs multicast support (assuming this is how the proxmox cluster nodes identify one another) (inherently will be typically supported by your average local dedicated hardware..); and ideally should be nice and reliable (ie, to prevent false alarms for "HA detects fail"). I believe this network is how the 'magical' /etc/pve cluster filesystem is synced between nodes; but this is tiny so - 100meg network is probably sufficient here.

If you had share-nothing storage then VM Migrations would squeeze through this pipe; but if you are using ceph and have 'Ceph SAN storage' then - no such bandwidth pig for you

possibly you would want trunked interfaces across pair of switches (ie, to avoid single point of fail - topology) if you are terribly keen. But to be honest, I can count on ~zero fingers, the number of times I've seen a switch fail cause this kind of sadness. (ie, human error will kill the cluster much more often, I would guess).

Tim

LnxBil · Jun 17, 2017

fortechitsolutions said:
possibly you would want trunked interfaces across pair of switches (ie, to avoid single point of fail - topology) if you are terribly keen. But to be honest, I can count on ~zero fingers, the number of times I've seen a switch fail cause this kind of sadness. (ie, human error will kill the cluster much more often, I would guess).

Only for the record: I totally agree with you. I witnessed a UPS failure resulting in a switch powerdown once. In general, it is good practice to build your system so that anything can fail (Murphy says that it will eventually). If you're obliged to some law like here in Germany with the 'BSI Grundschutzverordnung für kritische Infrastrukturen' then you have to have everything at least twice.

fortechitsolutions · Jun 17, 2017

yes! It is true, Murphy will find a way. For a serious deployment which supports many clients, I think this is a very good design philosophy. ( "Baseline Protection Regulation" if google translate is proper for "Grundschutzverordnung" ). Most of my client deployment projects are 'tiny' by comparison so their sites are non-redundant more or less top-to-botton for network:firewall (ie, small office of 15 people or fewer). Generally the server hardware has baseline fault protection (ie, redundant power, disk) but beyond that I find small sites 'are ok' with such config.

T

Nizam · Jun 18, 2017

Thank you for your comments guys. We definitely would go with everything redundant. Right from power feeds, virtual chassis for switches, power supplies for servers and even network cards on the hypervisors.

Some serious suggestion I need whether in production environment with 500+ VM's ceph and VM's can be on same node or should we have separate nodes for Ceph and for VM's?

Also in regard to network, except for appliance network (internet for VM's) and Ceph network is there any other separate network we need like management network?

Can someone tell me in real world with 2x 40GbE bonded connections to each hypervisor with 8x 512GB Enterprise SSD's how much read/write speed should we expected?

Regards,
K.Nizam

alexskysilk · Jun 18, 2017

Nizam said:
Some serious suggestion I need whether in production environment with 500+ VM's

The number of vms in and of itself is not indicative of load. in my experience, "general purpose" vms can be safely overprovisioned approximately 3-4x in corecount, so assuming 1 core/vm, you'd need ~ 120-150 vcores, which should be achievable with 5 dual socket nodes or so, plus one for failover. If the vms are busier or heavier on corecount or lighter on utilization this number can change.

ceph and VM's can be on same node or should we have separate nodes for Ceph and for VM's?

they can. add ceph into your load considerations; when ceph is operating normally it doesnt consume much cpu but during rebuild, rebalancing, on other heavy IO they can and will impact your systems.

Nizam said:
Also in regard to network, except for appliance network (internet for VM's) and Ceph network is there any other separate network we need like management network?

1 network for proxmox cluster traffic
1 network for ceph cluster traffic (2 if you have the pipes)
1+ networks for vm traffic. management should really be on its own vlan but could ride here.

If you have 2x 40gbe links, bond them and use vlans for the first two networks. Ideally each connection should go to a separate switch which means you'll likely want to use balance-alb if LACP is not available. active passive with alternate masters per vlan is an option as well.

Nizam said:
Can someone tell me in real world with 2x 40GbE bonded connections to each hypervisor with 8x 512GB Enterprise SSD's how much read/write speed should we expected?

too many variables, BUT- if you'll have 5 nodes (40 OSDs) with a 3PG pool and sufficiently fast cpus, 1GB/s is doable. more... maybe.
HOWEVER, you're asking the wrong question. in this configuration you'll have awesome latency; Have a look here: http://www.mellanox.com/blog/2016/02/making-ceph-faster-lessons-from-performance-testing/

Nizam · Jun 19, 2017

alexskysilk said:
The number of vms in and of itself is not indicative of load. in my experience, "general purpose" vms can be safely overprovisioned approximately 3-4x in corecount, so assuming 1 core/vm, you'd need ~ 120-150 vcores, which should be achievable with 5 dual socket nodes or so, plus one for failover. If the vms are busier or heavier on corecount or lighter on utilization this number can change.

When I asked production environment with 500+ VM's I didn't mean all 500VM's on a single sever, I meant on the whole proxmox cluster.

If we have good CPU's and memory on the nodes there won't be issue to run Ceph and VM's on same node correct?

Regards,
K.Nizam

Nizam · Aug 16, 2017

Hello Everyone,

Our final specs for Proxmox + Ceph cluster will be as below.

2x E5-2660v2
128GB Memory
1x Intel DC P3700 400GB NVMe SSD (For Proxmox OS/Ceph Monitors)
8x 400GB Intel DC S3700 SSD’s (OSD)
LSI 9211-8i Controller in JBOD mode.
2x 40Gig Bonded for Ceph Storage Cluster
2x 40Gig Bonded for Management Network
2x 1Gig Bonded for Public Network

However I have some questions, it will be great if you can give some inputs here.

Does number of drives matter in performance? If I choose 16x 200GB SSD's instead of 8x 400GB will it perform better or it does not make any difference when it comes to ceph? Because logically more drives will have more IOPS's. Or is it suggested to go for less number of drives with higher capacity?

Also is it a must for NVMe SSD to be bootable? Can we install Proxmox on Intel DC S3700 SSD’s which is bootable and instal Ceph monitors on P3700 400GB NVMe SSD?

Basically I am still confused on storage part, I need to understand what is NVMe SSD's suggested for and is it a must for it to be bootable. Also is it good if we keep Proxmox OS on Raid 1 or single drive should be fine. As far as I read the drives for OSD's should not be on Raid and should be on JBOD mode.

Regards,
K.Nizam

TwiX · Aug 17, 2017

Nizam said:
Hello Everyone,

Does number of drives matter in performance? If I choose 16x 200GB SSD's instead of 8x 400GB will it perform better or it does not make any difference when it comes to ceph? Because logically more drives will have more IOPS's. Or is it suggested to go for less number of drives with higher capacity?

Also is it a must for NVMe SSD to be bootable? Can we install Proxmox on Intel DC S3700 SSD’s which is bootable and instal Ceph monitors on P3700 400GB NVMe SSD?

Hi,

It depends of your load, but Ceph really love parallels. However, you probably know that, most of the time, larger capacity disks have better performance specially on iops.
I would say that more osd you have better it is. If you lose an OSD, rebalancing data should take less time. But of course, you have more chances to lose a disk due to high number.

OS system don't need lot of iops, so IMHO, a raid 1 of SAS or SSD disk should be sufficient and it's easy to change (hotplug), with a P3700 you have to stop the node.

What kind of switch will you use for bonding 2*40 G ?

Antoine

Nizam · Aug 17, 2017

Hello Antoine,

We plan to use 2x Arista DCS-7050QX-32S-R in MLAG.

I was referring to Proxmox/Ceph install video, which clearly specifies that they recommend fast NVMe SSD drives for OS and Ceph monitors/journals are installed there? Also I read somewhere that it's better to have max 8 OSD per server and I believe it's always advised to have 1 OSD each drive. So I was assuming 8 SSD's should be the way to go for each node?

Regards,
K.Nizam

TwiX · Aug 17, 2017

Hi,

For Ceph journals, you're right you need to use fast SSD/NVMe drives. For OS I dont think so (but you could use 2 hotplug SSD without any pb). The highest iops will be related to ceph and the journals.
If you have sata or sas drives (7, 10 or 15k), (you're not concerned) for osds, ceph journals must be placed on a ssd drive (a max 4-5 sata/sas osds for one ssd ceph journals).
But in your case, you only have ssd drives, so journals for each osd have to be placed directly on your ssd drive, for example :

osd.1 => on SSD1 (ceph will create 2 partitions : one for FS, the other for journals (5GB by default))
osd.2 => on SSD2 (ceph will create 2 partitions : one for FS, the other for journals (5GB by default))
and so on ...

if you have sata drives :

osd.1 => on sata1 (only one partition for FS)
osd.2 => on sata2 (only one partition for FS)
...
ssd drive will contain x partitions related to journals of osd.1, osd.2, etc.

I know people who have more than 24 osd/drive per node !

Hope it's clear enough

Antoine

Nizam · Aug 17, 2017

Thanks Antoine, so below drive configuration should work fine?

2x Samsung SM863a (Raid 1) for OS and I believe Ceph monitors are also installed here?
8x Samsung SM863a (JBOD Mode) for Ceph OSD's and Journals

Regards,
K.Nizam

TwiX · Aug 17, 2017

It would be nice.
However, IMHO if you have a RAID controller, let osd drives as raid 0. 8 'raid 0' vdisks.
You would benefit of the raid cache for common read and write tasks and for journals it uses the DIRECT and D_SYNC flags, so it deactivates cache if needed automatically.

Monitors are just daemons , it consumes lot of ram but don't write so much.

1 monitor per 15 osds is the maximum you could expect.

Nizam · Aug 17, 2017

But many threads I read that Raid with Ceph is not a good option? That's why I was going JBOD route.

TwiX · Aug 17, 2017

AFAIK - RAID is not recommended exept raid 0 vdisk (with one drive only).
for JBOD I don't know I don't use it.

Ideal Proxmox Setup

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

We value your privacy