Cluster with 30+ nodes?

dmora · Oct 17, 2016

In the docs it states:
"A Proxmox VE Cluster consists of several nodes (up to 32 physical nodes, probably more, dependent on network latency)."

Given ample network capacity, is there a hard limit on how many nodes can participate in the cluster?
I'll have about 35 hosts, I'm trying to figure out if I should just run 2-3 smaller clusters.

Anything I need to be particularly mindful of, when running that many hosts?

Thanks in advance! Proxmox seems to have come a long way since I used it 2-3 years about on V2. Well done.
I like the html5 console, no more java BS.

jeffwadsworth · Oct 17, 2016

My guess is that the "limitation" is based on every host maintaining a copy of the "datacenter" database and the latency you may experience with 30+ of these in a cluster. I for one would be very interested in how such a configuration works out for you. Yes, Proxmox has made incredible advancements over the last few years.

alexskysilk · Oct 17, 2016

network capacity =/= latency. While I have no idea if there is a hard limit, unless you have a very fast (NOT high bandwidth, low latency) network a very high number of nodes can cause delays long enough to fence nodes, which could range from a nuisance to a disaster.

ymmv.

dmora · Oct 18, 2016

I have a pair of Juniper EX4550-32Ts being shipped in which will be running stacked.
http://www.juniper.net/us/en/products-services/switching/ex-series/ex4550/

Guess we'll see, not sure how sensitive the watchdog function is to latency guess thats something I'll have to test out before all the gear arrives. Maybe I can do a hack about to make it less sensitive to delays, it latency becomes an issue. Thanks guys.

I'll be running ceph on the backend with about ~30Tb of Samsung 840 1TBs. Still tuning the ceph portion as I seem to have hit a storage IOPs limit of around 2k random write despite all OSDs being ssd with a separate ssd for the journaling. In my testing I've not seen this improve performance however. It's is likely because of the all ssd configuration. The samsungs have no prob doing sustained sequential writes at 400MB/s. Even with 4-5 journals pointed at a single ssd, the most throughout I've seen under load is 60MB/s on a journaling disk. I'm also seeing some pretty bad sequential write performance from within the VMs when compared to running local storage. Benching the RBD image confirms this as well. Ceph hammer doesn't seem do so well with a high level of sequential writes, but again, at this point it's early and I'm in large running a default ceph config. Storage traffic will be vlan'd off away from the cluster traffic. I'm surprised there's isn't a requirement to have a separate "cluster" network that the nodes communicate over to cut down on crosstalk. Hyper-V has this model like this, I think VMware as well.

Any input on ceph tuning would be greatly appreciated on top of my own research/googling.

Thanks guys. If y'all are interested I'll pop in from time to time and maybe throw around some benchmark numbers and production results/ailments.

morph027 · Oct 18, 2016

Hi!

Yes, please keep us posted. I'm also interested in scaling big

BTW, the seperate cluster network isn't a hard requirement, but mentioned in some places in the wiki. At least in the Ceph page, there is a hint to separate traffic.

Advise is to use a dedicated link for all other services.

- 1 Cluster Interconnect (vmbr0)
- 1 VM Connect
- 1 Ceph Cluster Interconnect

mir · Oct 18, 2016

dmora said:
all OSDs being ssd with a separate ssd for the journaling.

What do you gain by having journal on SSD when all OSD's is SSD only?
Have you tried benchmarking without a separate journal?

dmora · Oct 18, 2016

mir said:
What do you gain by having journal on SSD when all OSD's is SSD only?
Have you tried benchmarking without a separate journal?

In my case, I gained nothing lol, or something so minuscule it didn't jump out as a significant level of improvements. I've seen more significant gains in IOPs and throughput by adding more disks(OSD's) than I have with messing with the journal placement.

In fact just yesterday I augmented my test cluster from 11 to 15 ssds and I saw read performance IOPs jump 400iops and write 200iops.

I need to wait till the network gear and production hosts come in I'm being limited here due to the SAS interface onyl running the SSD's at data 1.5Gbps

What I don't get is why these guys are able to get 40k iops @ 4k rand write. I have some hardware limitations being 1Gbps network throughput and 1.5Gbps sata, but I'm not getting anywhere near maxing those out.
https://www.google.com/url?sa=t&rct...CRKUawQPA&sig2=jP2RFsNmX5tt-63Ahvq7Mw&cad=rja

Code:

    [root@localhost fio-2.0.9]# while true; do ./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --directory=/ceph --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 ; done
    test: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
    fio-2.0.9
    Starting 1 process
    Jobs: 1 (f=1): [m] [100.0% done] [15592K/5382K /s] [3898 /1345 iops] [eta 00m:00s]
    test: (groupid=0, jobs=1): err= 0: pid=2711: Mon Oct 17 19:13:35 2016
    read : io=3068.5MB, bw=26568KB/s, iops=6642 , runt=118265msec
    write: io=1027.6MB, bw=8896.9KB/s, iops=2224 , runt=118265msec
    cpu : usr=5.42%, sys=26.26%, ctx=471426, majf=0, minf=17
    IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
    submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued : total=r=785530/w=263046/d=0, short=r=0/w=0/d=0

    Run status group 0 (all jobs):
    READ: io=3068.5MB, aggrb=26568KB/s, minb=26568KB/s, maxb=26568KB/s, mint=118265msec, maxt=118265msec
    WRITE: io=1027.6MB, aggrb=8896KB/s, minb=8896KB/s, maxb=8896KB/s, mint=118265msec, maxt=118265msec

    Disk stats (read/write):
    dm-0: ios=785271/262977, merge=0/0, ticks=1594234/5882938, in_queue=7479788, util=100.00%, aggrios=785530/263089, aggrmerge=0/1, aggrticks=1599983/5886523, aggrin_queue=7486129, aggrutil=100.00%

mir · Oct 18, 2016

dmora said:
I have some hardware limitations being 1Gbps network

This is your bottleneck. I suspect you use jumbo frames?
Apart from that the only way to get higher IOPS and throughput is to use 10 Gb ethernet or infiniband which comes at speed 10 - 80 Gb for your storage network.

PigLover · Oct 18, 2016

I don't think you really want to use separate journal for an all SSD cluster. You won't gain any speed (the journal write and the final commit have to be serialized, so there is no threading gain and both journal and data disks are the same speed). Worse - you actually increase your risk profile since two things can now fail that would take out the OSD (the journal and the data drive).

I think you might gain if the journal drive were NVMe x4 (and fast). I do not believe Ceph will gain anything from multiple journals.

Your bottleneck is clearly the 1gbe LAN. After you fix that, take the drives you are using for journals and create more OSDs. Should fly.

dmora · Oct 18, 2016

mir said:
This is your bottleneck. I suspect you use jumbo frames?
Apart from that the only way to get higher IOPS and throughput is to use 10 Gb ethernet or infiniband which comes at speed 10 - 80 Gb for your storage network.

Yeah, my 10gig gear hasn't arrived yet. The test cluster is to assess cephs ability to take a kick the balls and be able to recover.
Storage performance even untuned will work just fine, even using a 1Gbps network and a LSI SAS card thats limiting my SSD throughput to 1.5Gbps....

HOWEVER! I just managed to gain other 1000 IOPS on my 4k rand write performance just now.
I'm finishing up the read and simultaneous read/write tests now.

PigLover · Oct 18, 2016

So what did you do to get increase in 4k random write IOPS?

dmora · Oct 18, 2016

PigLover said:
I don't think you really want to use separate journal for an all SSD cluster. You won't gain any speed (the journal write and the final commit have to be serialized, so there is no threading gain and both journal and data disks are the same speed). Worse - you actually increase your risk profile since two things can now fail that would take out the OSD (the journal and the data drive).

I think you might gain if the journal drive were NVMe x4 (and fast). I do not believe Ceph will gain anything from multiple journals.

Your bottleneck is clearly the 1gbe LAN. After you fix that, take the drives you are using for journals and create more OSDs. Should fly.

Yeah it doesn't appear that way, actually, I saw a loss in iops when I split off the journals. I think that recommendation is for HDD + SSD combos. I have like 100-200 Samsung 840 1Tb Evos laying around. I ran out of 3.5 -> 2.5 converters and had to have more ordered in.

I'll have 2x 6 NVMe drives coming in for the OS drives I believe, I think my boss just want to play with NVMe. Os really doesn't do anything but sit there. Might be good for corosync operations.

As far as losing the journal drive itself, yeah, I though about that, and the docs mention not putting all journals on one drive. Then I purposely blew out the journal drive while in operation. You end up losing all OSD's that were journaling to that drive. You can recover them with ceph osd X --mkjournal since the data from the OSDs are still on disk.

dmora · Oct 18, 2016

PigLover said:
So what did you do to get increase in 4k random write IOPS?

Tests:
r/w: ./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --directory=/ceph --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
4k read: ./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --directory=/ceph --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread
4k write: ./fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --directory=/ceph --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

Strange what happened there on the second set of test where I ran all journals. You can see my performance overall tanked in the r/w test compared to my earlier post.

The last test to the far right you can see the gains.
For the combined test r+w, ~+1000 read IOPS and +200-300iops write side
4k random write alone skyrocketed. to 3.2k.

I'm really still just testing different hardware configs. I've not even really started tuning the software except for enabling the use of tcmalloc @ 128MB. http://ceph.com/planet/the-ceph-and-tcmalloc-performance-story.
I'm not sure if I did something wrong here, but I was expecting 5x the iops...I didnt' really see it do anything.
https://bugzilla.redhat.com/show_bug.cgi?id=1297502
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES

I wish proxmox would move to Jewel soon that has binary support for jemalloc. I really dont' care to recompile ceph from source. Plus that makes upgrading a pain in the butt.

dmora · Oct 18, 2016

PigLover said:
So what did you do to get increase in 4k random write IOPS?

Oh I didnt' answer your question.
I blew out all the OSD's that I had split off the journals from and set it up so that each ssd in the cluster is in the ODS+Journal layout.
Other than that I adjusted my pgp group to 400...for some reason I had it set to 0 and my pg count was set to 400. Not sure if that had a performance effect or not. I should have benched before I changed that.

dmora · Oct 18, 2016

Its seems like the more I keep adding SSD's the faster this thing keeps getting. Once i get more 3.5->2.5 converters in a slap a couple more in there and re-test without turning the software.

Search

Search

Cluster with 30+ nodes?

dmora

New Member

jeffwadsworth

Member

alexskysilk

Distinguished Member

dmora

New Member

morph027

Renowned Member

mir

Famous Member

dmora

New Member

mir

Famous Member

PigLover

Renowned Member

dmora

New Member

PigLover

Renowned Member

dmora

New Member

dmora

New Member

dmora

New Member

dmora

New Member