Ideal Proxmox Setup

alexskysilk · Aug 17, 2017

Nizam said:
I was referring to Proxmox/Ceph install video, which clearly specifies that they recommend fast NVMe SSD drives for OS and Ceph monitors/journals are installed there?

Fast NVMes are always nice; if you have the available slots and dont care about the cost- sure, go nuts. In my experience, the only load generated on the boot device is the logs. Do no mix journal and boot devices, thats an invitation for disaster. I'd use the NVMes for journal, and wouldnt lose any sleep on boot devices; just mirror whatever you pick.

you only need 8-16GB for boot devices; I dont know how many slots you have available in your chassis but I'd use 2 for boot devices (32gig or smaller SSDs would suit you fine, and they're cheap enough that you can pick up a few cold spares too.) NVMEs for journal, no more then 5 SSDs per journal disk although this isnt a hard rule; I've seen configs with 12OSD:1 Journal used without issue. you'll need to benchmark your setup to find the optimal config. you need approx 5GB/OSD in journal space. Fill the rest of your slots with OSDs. Since your load is HV, you will most likely have a triple PG pool, so your available space will be your total OSD space / 3; plan your hardware accordingly.

As for HBA- dont use a raid controller for ceph. full stop. marking disks as RAID0 is a kludge and will noticeably impact performance because the raid controller will bridge the block size from disk native to raid volume native. even when the block sizes are the same, it still has a performance impact and can result in unpredicable OSD behavior. I'd suggest https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-8i for SAS HBA; its not only faster but is much better attuned to modern SSDs.

Nizam · Aug 18, 2017

Hello Alex,

I have option of 2 hardware.

Dell R620 with 10x 2.5" Drive Bays and Cisco UCS C240 M3 with 24x 2.5" Drive Bays. Can you suggest a drive configuration and the details of the drives of what they should be used for?

Regards,
K.Nizam

alexskysilk · Aug 18, 2017

you're asking the wrong question. I mentioned some metrics to consider further up the thread, but its worth drilling down a bit further.

1. What is your desired failure domain for your cluster? in other words, What is the allowed level of fault before you lose service(s)? in your case you mentioned you have two data centers. is it ok for a data center to go dark, or do you want to be able to continue normal operation? how many chassis can drop before you lose the cluster? etc.
2. What is the usable storage requirement for each location? Do you have a minimum IOPs requirement?
3. What are your power and space considerations in each of your locations? in other words, how much power can you burn before you trip your allotment/breakers?

Answers without questions are not worth much.

Nizam · Aug 18, 2017

Alex,

We plan to setup Proxmox in just one one datacenter for now, our plan is to provide High Availability VM's and we are starting with 5x nodes initially with all SSD storage. We would like to create 16x VM's of 150GB Disk Space on each node. I assume if we use Proxmox with KVM we cannot overcommit disk space? Currently our Onapp setup permits to overcommit disk space, we have seen our clients only using 50%-60% of allocated disk space.

As far as power feeds, PDU's, switches and network cards on nodes are concerned they are all redundant. We are just stuck on choosing the right type and number of drives each node.

Regards,
K.Nizam

alexskysilk · Aug 18, 2017

Nizam said:
I assume if we use Proxmox with KVM we cannot overcommit disk space?

of course you can

ceph RBDs are thin provisioned by default. the better question is how much space do you want to have at deployment?

assuming Dell R620s, you will have 8 drive bays for OSDs per node. 5x8=40, at 400GB each your raw capacity is 16TB. using 3PG pools your usable capacity would be 5.33TB. since you were specing out 16VMs per node, you'll have 80VMs which will give you an average used space of 60GB per vm. thats probably pretty close to on the money for 150GB thin provisioned.

assuming Cisco UCS C240 M3, you will have 22 drive bays for OSDs per node (I dont know if Cisco NVMe option goes in a drive bay or not.) 5x22=110, at 400GB each your raw capacity is 44TB. using 3PG pools your usable capacity would be 13.33TB. since you were specing out 16VMs per node, you'll have 80VMs which will give you an average used space of 166GB per vm... massive overkill I think.

Nizam · Aug 18, 2017

Our client who we want to setup these VM's for is currently using only 50% of allotted resource so considering that fact we plan to setup nodes with below configuration. We plan to setup 4x Core, 16GB Ram, 150GB Disk Space for each VM. And as said earlier we will setup 16VM's on each node with above said config.

Dell PowerEdge R620 10x SFF
2x E5-2660v2
128GB Memory
2x 200GB Intel DC S3700 Raid 1 (OS/Ceph Monitors)
4x 960GB Samsung PM853T SSD's

The server currently has PERC H710 with 512MB NV Cache, but as suggested by you if Raid 0 is not recommended for 4x 960GB we can go with LSI 9211-8i or LSI 9207-8i that supports JBOD)

Regards,
K.Nizam

TwiX · Aug 18, 2017

alexskysilk said:
As for HBA- dont use a raid controller for ceph. full stop. marking disks as RAID0 is a kludge and will noticeably impact performance because the raid controller will bridge the block size from disk native to raid volume native. even when the block sizes are the same, it still has a performance impact and can result in unpredicable OSD behavior. I'd suggest https://www.broadcom.com/products/storage/host-bus-adapters/sas-9300-8i for SAS HBA; its not only faster but is much better attuned to modern SSDs.

Interesting...
For tests purpose, I switch a 3 Dell R430 nodes cluster with 4 bluestore osd on each node from raid 0 to non-raid
Performances a divided by 2 at least inside VM :

with raid-0 :
dd if=/dev/zero of=test bs=100M count=5 oflag=direct
=> ~400 Mbytes/s
dd if=/dev/zero of=test bs=4k count=10k oflag=direct
=> ~40 Mbytes/s

with non-raid :
dd if=/dev/zero of=test bs=100M count=5 oflag=direct
=> ~200 Mbytes/s
dd if=/dev/zero of=test bs=4k count=10k oflag=direct
=> ~20 Mbytes/s

Antoine

alexskysilk · Aug 18, 2017

disable your controller's write cache

I'm pretty sure your workload doesnt consist of zeros...

guletz · Aug 18, 2017

Nice problem to solve... and hard to find the optimum solution. My own thoughts about this (and not a solution):
- I would separate proxmox nodes from the storage (one hdd is failing .... you need to stop this node)
- SSD are very good in term of IOPS, but are not very reliable
- watch your clients .... they have same usage patterns, some of them use cpu/ram/iops in some period of the day, and at they do not use resurces in others hours/time-period - with this fact you can have 2 different kind of storage(one is fast and other one is slow, or not so fast), so if client X need top iops at 12.00 pm, then you can move his VM from slow storage to the fast storage (online migration), for let say 30 min, then go back
- maybe some clients do not need top cpu /ram so why do not create a special zone ... using some toys like raspberry pi/odroid-c2 (20 pcs can serve many clients, odroid can use kvm .... so can be part of proxmox)
- to be or not to be ... this is the question = aka to be zfs or ceph ? each of them has pro and/or con.
- for sure I do not know a lot of ceph
- but I have some knowledge about zfs
- you say that you have 2 different locations, so for sure zfs is better in this case (zfs send receive)
- some of the guys who respond to you tell that zfs can not be used in a active/active setup - and for sure I know this is not true entirely (mybe they do not detail about this)
- but with others software you can have active/active

guletz · Aug 18, 2017

Nice problem to solve... and hard to find the optimum solution. My own thoughts about this (and not a solution):
- I would separate proxmox nodes from the storage (one hdd is failing .... you need to stop this node)
- SSD are very good in term of IOPS, but are not very reliable
- watch your clients .... they have same usage patterns, some of them use cpu/ram/iops in some period of the day, and at they do not use resurces in others hours/time-period - with this fact you can have 2 different kind of storage(one is fast and other one is slow, or not so fast), so if client X need top iops at 12.00 pm, then you can move his VM from slow storage to the fast storage (online migration), for let say 30 min, then go back
- maybe some clients do not need top cpu /ram so why do not create a special zone ... using some toys like raspberry pi/odroid-c2 (20 pcs can serve many clients, odroid can use kvm .... so can be part of proxmox)
- to be or not to be ... this is the question = aka to be zfs or ceph ? each of them has pro and/or con.
- for sure I do not know a lot of ceph
- but I have some knowledge about zfs
- you say that you have 2 different locations, so for sure zfs is better in this case (zfs send receive)
- some of the guys who respond to you tell that zfs can not be used in a active/active setup - and for sure I know this is not true entirely (mybe they do not detail about this)
- but with others software you can have active/active

TwiX · Aug 18, 2017

alexskysilk said:
disable your controller's write cache I'm pretty sure your workload doesnt consist of zeros...

Already done. However, I will keep these settings for few days.

Antoine

Search

Search

Ideal Proxmox Setup

alexskysilk

Distinguished Member

Nizam

New Member

alexskysilk

Distinguished Member

Nizam

New Member

alexskysilk

Distinguished Member

Nizam

New Member

TwiX

Renowned Member

alexskysilk

Distinguished Member

guletz

Distinguished Member

guletz

Distinguished Member

TwiX

Renowned Member

We value your privacy