Newbie need your input

aychprox · Oct 27, 2015

Hi all,

I am new to Proxmox and really impressed by PVE4.0 HA and live migration of KVM VM after tested with 3 nodes.
Unfortunately my management not going to pump additional capital to invest on new full set hardware and currently we have following ready hardware:

2 x Dell C6100 4 nodes server (Dual Xeon L5640)
24GB RAM each node
2 x 120GB SSD each node
4 x 1TB 2.5" Seagate SSHD each node
2 x Dlink Gigabit switch (LAG capable)

I am planning to deploy ceph as the storage architecture to store vm and image. backup will be on NFS node which is separate 1U server.
I understand with normal gigabit interface, the bottleneck will be on the bandwidth capacity. I am thinking of network bonding by getting a quad port gigabit network card to increase the network throughput. So, total I have 6 gigabit ports per node.

2 ports bond to be 2Gb/s for NFS backup on switch 1
3 ports bond to be 3Gb/s for VM <-> ceph storage on switch 2
1 port for public network

My concern are:
1) is SSD journal will further help to increase the storage performance of SSHD?
2) is add-on quad port gigabit network card capable to help to increase the network throughput?

I would like to seek your opinion, does this configuration workable and any other area that I need to take into consideration before deploy all eight nodes to production environment.

Thanks in advance.

themilo · Oct 27, 2015

I am very interested in the response that you might receive. I have a somewhat similar configuration. However I have set up LACP bonds on some of my connections. Like you I have separated my management (1 port), storage (1 - Active/Backup Bond with 2 ports for LACP {primary} and 1 port for hey my other switch went out and I need to limp by) and public interfaces (2 ports under active/backup bond). My switches do not perform multicast and so I have had to attempt setting up Unicast. Still working to get my configuration working and ensuring the correct traffic is using the correct ports. Haven't seen a lot of doc's on manually specifying certain traffic to certain ports - still looking.

What is the backplane speed of the DLINK GB Switches?

Q-wulf · Oct 27, 2015

you can benchmark the drives to see what happens when you use the SSHd as a journal.

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

benchmark the SSHD with at least numjobs of 2.

aychprox · Oct 27, 2015

themilo said:
What is the backplane speed of the DLINK GB Switches?

Thanks for your reply.
It is DGS-1210 (48 ports), should be 104Gbps for backplane speed.

The reason why doing so is we don't have 10Gbe interface for both switch and server. To my knowledge network bonding does help on network throughput, maybe I am wrong here.

aychprox · Oct 27, 2015

Q-wulf said:
you can benchmark the drives to see what happens when you use the SSHd as a journal.

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

benchmark the SSHD with at least numjobs of 2.

Thanks for your input.
Journal will be on SSD.
SSHD is the primary storage.

Q-wulf · Oct 27, 2015

aychprox said:
Thanks for your input.
Journal will be on SSD.
SSHD is the primary storage.

out of curiosity, how did the SSHD do compared to the SSD ?

aychprox · Oct 27, 2015

Q-wulf said:
out of curiosity, how did the SSHD do compared to the SSD ?

AFAIK, SSD in general faster then SSHD. sshd just a combination of NAND flash with HDD.

Q-wulf · Oct 27, 2015

I was just curious if you actually benchmarked it to see by how much its faster. The reason is as follows:

Yes. 1 SSD is faster then 1 SSHD. But is 1 SSD faster then 4 SSHD ?
I am assuming you will be using 1 SSD for OS and 1 SSD for journal ?

So you would need to benchmark the following case:

1.) at how many "--numjobs" does your SSD max out?
^^Tells you how many Journals you can put on it and how much speed/IOPs you'd get. keep increasing it to minimum "--numjobs=4".

2.) What is the result of "--numjobs=2" on your SSHD (1x Journal, 1x OSD)
^^Tells you what the speed and Iops of a Single SSHD is, if you'd put the journal on the same disk.

3.) is the speed and IOP's of "--numjobs=4" on your SSD greater then the aggregated speed and Iops of "--numjobs=2" on your 4x SSHD ?
^^Tells you wether it makes sense to use the SSD for the 4x SSHD, or wether you'd bottleneck your SSHD's.

i can not even begin to guess there, as i have no baseline on SSHD's as journal's from any Benchmarks on the net. Point being, you need to benchmark it to be sure.

on a personal level i'm quite interested. As we have a 4-node cluster at work (each node =5x SSD, 20x HDD) and we are thinking of getting 6 more nodes early next year. Maybe it might make sense to use SSHD's then.

edit:
What type of SSD are we talking about ?

aychprox · Oct 27, 2015

What type of SSD are we talking about ?
>>> Samsung EVO 850

I am assuming you will be using 1 SSD for OS and 1 SSD for journal ?
>>> Yes, correct.

1.) at how many "--numjobs" does your SSD max out?

--numjobs=1 bw=25884KB/s, iops=6471
--numjobs=2 bw=42705KB/s, iops=10676
--numjobs=3 bw=59071KB/s, iops=14767
--numjobs=4 bw=66876KB/s, iops=16719
--numjobs=5 bw=71388KB/s, iops=17846
--numjobs=6 bw=77177KB/s, iops=19294 <<<< Should be max ?
--numjobs=7 bw=76303KB/s, iops=19075

2.) What is the result of "--numjobs=2" on your SSHD (1x Journal, 1x OSD)
>>>> sorry, no ready single disk on hand.

Update:

Single SSHD without hardware RAID

--numjobs=1 bw=448914 B/s, iops=109
--numjobs=2 bw=2347.7KB/s, iops=586
--numjobs=3 bw=3246.3KB/s, iops=811
--numjobs=4 bw=4262.3KB/s, iops=1065
--numjobs=5 bw=4028.7KB/s, iops=1007
--numjobs=6 bw=4242.5KB/s, iops=1060
--numjobs=7 bw=4472.1KB/s, iops=1118

3.) is the speed and IOP's of "--numjobs=4" on your SSD greater then the aggregated speed and Iops of "--numjobs=2" on your 4x SSHD ?
^^Tells you wether it makes sense to use the SSD for the 4x SSHD, or wether you'd bottleneck your SSHD's.
>>>> I just do a quick test on 4xSSHD with LSI megaraid 9260 on RAID10 setup. Maybe it is not relevant or doesn't make sense, but just share some info here:

--numjobs=1 bw=63462KB/s, iops=15865
--numjobs=2 bw=50160KB/s, iops=12540
--numjobs=3 bw=84969KB/s, iops=21242
--numjobs=4 bw=124261KB/s, iops=31065
--numjobs=5 bw=151194KB/s, iops=37798
--numjobs=6 bw=163376KB/s, iops=40843

Let see what you can intepret from these results.

Q-wulf · Oct 27, 2015

regarding 1.)

--numjobs=4 bw=66876KB/s, iops=16719 is what you will most likely get when sticking 4 journals on that. At least according to the ceph journal benchmark thread and how i understood it. You would be able to run 6 Journals on that SSD, at least thats what it looks like from your benchmark.

regarding 2.)

Raid 10 is Raid 0 ontop of a Raid 2. So your benchmark values should be the same as for 2 SSHD's.
interpolated that would mean:

Raid10 SSHD --numjobs=2 bw=50160KB/s, iops=12540
since it basicaly raid-0's 2 SSD you can assume more or less that 1 SSHD gets you:
--numjobs=2 bw=25160KB/s, iops=6540

regarding 3.)

--numjobs=4 bw=66876KB/s, iops=16719 for SSD

4x single SSHD --numjobs=2 bw=25160KB/s, iops=6540
That would be : bw = 100640 KB/s, iops= 26160.
Thats about 60% more bw and Iops.

a benchmark of a single SSHD with --numjobs=2 would be needed to confrm this, as this is "interpolated" based on my understanding of Raid10 and Raid10 performance values on SSHD's act analog to Raid10 on HDD.

If that holds true, I's either use the 2nd SSD for (ZFS Raid 1), or put a 5th SSHD instead of it. Or if your doing e.g. Erasure coded pools, you could use your 4x 1 SSD as a Ceph Cache for your "slow" pools.

4.)

something i have not mentioned before (it never made sense before seeing those benchmark numbers)

journals on SSHD would be more failure proof.
If you'd loose your SSD, you'd loose the OSd's attached to it.

So if your 4-Node cluster with dedicated journals looses one journal (SSD), you loose 25% of your OSD's (4).
If your 4-Node Cluster without dedicated journals looses one journal (SSHD), you loose 6.25% of your OSS's (1).

some more reading on this:
http://forum.proxmox.com/threads/24...-you-loose-a-journal-disk?p=121378#post121378

hope that helps.
For me it is definitely intriguing.

ps.: regarding Bonds:

If your Switches support Vlans, i'd use openvswitch, bond all 6 eth's into one ovsbridge, then separate the 3 networks via ovsports (vlans).
If your switches support jumbo frames even better. go max size for the Ceph network, while keeping normal MTUs for e.g. your public network.

Compare the examples in this wiki: https://pve.proxmox.com/wiki/Open_vSwitch

aychprox · Oct 27, 2015

Hi,

Thanks for your input.

The whole setup is almost similar to Proxmox ceph test but I'm using 4 x 1TB SSHD (without hardware raid, just individual disk) for ceph storage x 4 nodes, and SSD basically for journal and OS.
Based on your description, do you recommend instead of journal on SSD, better to journal on SSHD pool itself for better failure proof?

Q-wulf · Oct 27, 2015

Since the SSHD is not as straight cut as with normal HDD's (as your benchmarks show)
I'd benchmark it.

Once you set up your Cluster you do benchmark a single SSHD (to be sure the values i interpolated are no fluke).

If they come in around the same as above, you'd first go setup your osd's and pools , then do a benchmark with rados bench and benchmark those pools.

Code:

rados bench -p Pool1 300 write --no-cleanup
rados bench -p Pool1 300 seq
rados -p Pool1 cleanup --prefix bench
sleep 60
rados -p Pool1 cleanup --prefix bench
sleep 120
rados bench -p PoolX 300 write --no-cleanup
rados bench -p PoolX  300 seq
rados -p PoolX  cleanup --prefix bench
sleep 60
rados -p PoolX  cleanup --prefix bench

First you do writes for 5 Minutes, then read the data, then delete it from the pool.

If those results are not satisfactory, you wipe the pools, wipe the OSD's and set it up with the SSD journals. Then run the benchmarks again.

Then be sure whats best, build from there.

What type of pools were you thinking of using ?

Replication (size??)
Erasure-Coded (K=?, m=??)
Cache-Tier on the 4x SSD's (if you go with SSHD-journals) for any of the other pools ??

cause that has a bearing on your failure decisions as well.

udo · Oct 27, 2015

Q-wulf said:
Since the SSHD is not as straight cut as with normal HDD's (as your benchmarks show)
I'd benchmark it.

Once you set up your Cluster you do benchmark a single SSHD (to be sure the values i interpolated are no fluke).

If they come in around the same as above, you'd first go setup your osd's and pools , then do a benchmark with rados bench and benchmark those pools.

Code:

rados bench -p Pool1 300 write --no-cleanup rados bench -p Pool1 300 seq rados -p Pool1 cleanup --prefix bench sleep 60 rados -p Pool1 cleanup --prefix bench sleep 120 rados bench -p PoolX 300 write --no-cleanup rados bench -p PoolX 300 seq rados -p PoolX cleanup --prefix bench sleep 60 rados -p PoolX cleanup --prefix bench

First you do writes for 5 Minutes, then read the data, then delete it from the pool.

If those results are not satisfactory, you wipe the pools, wipe the OSD's and set it up with the SSD journals. Then run the benchmarks again.

Then be sure whats best, build from there.

What type of pools were you thinking of using ?

Replication (size??)
Erasure-Coded (K=?, m=??)
Cache-Tier on the 4x SSD's (if you go with SSHD-journals) for any of the other pools ??

cause that has a bearing on your failure decisions as well.

Hi,
some remarks:
you measure caching with multible rados bench (network speed, because osd-data is in host ram). OK - not the whole 5 min!
You must drop the cache before he next run on all nodes.

For performance reason i would not use EC-pools! For cold data only.

Dont use an Evo-drive as journal! its will die after a short time! And often the performance will drop after a while.

Udo

Q-wulf · Oct 27, 2015

udo said:
[...]
Dont use an Evo-drive as journal! its will die after a short time! And often the performance will drop after a while.

Udo

Is that for Samsung evo 850's only ? or all MLC SSD's ?

aychprox · Oct 28, 2015

Getting more confuse on ceph storage now.
From some reference mentioned that ceph will turn into best storage architecture but it seems like it is susceptible to data integrity as well.

Loss of journal may lead to data loss, so what are the advantages of ceph distributed block storage compared to if I setup NFS in pair?

What is the optimum setup with the hardware that I have in hand?
Sorry, really new in this.

* p/s I'd would not got more than 6 disks per node (since i have only 6 x 2.5" slots per node).

Q-wulf · Oct 28, 2015

lets simplify this.

lets say you create a pool that keeps e.g. 3 replicas.
If your crush rule replicates over type "hosts", then there will be the same copy on 3 out of your 4 nodes.
If your crush rule replicates over type "osd", then there will be the same copy on 3 out of your 3/16 osds.

side note: what are those "types"
the types you define yourself via crushmap. to give you an example, this is a screenshot from my single node test-machine at home:

Custom Crush Types, Split SSD OSD's from HDD OSD's via crustom hook..
if you e.g. have hosts in different racks, rows, rooms, buildings, ... you can have ceph take care on make sure the data is e.g. replicated in different buildings.

back to your question:
Now lets assume 4 of your osds use the same journal. If one journal fails, then you could loose 4 osd's if you loose a journal. so far so good.
But what if by chance the 3 copies of your data are located on osd's using the same journal ? then that data would be lost.

So if you keep 4 OSD's on the same journal, you need to make sure that your replication size is "<numberOfOsdsOnJournal> + 1". in the assumed case above, that would be 5.

Now it needs to be understood, that loosing a osd because of a lost journal ist not the same as loosing a osd because of a defective HDD hosting said osd. That is because you can rebuild journals. however, as long as your journal is not repaired (as in moved to another ssd), you will not be able to access said osd's and by implication data. compare http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-journal-failure/

ps.: I was asking earlier what type of Pools you wanted to use.

Replication (size??)
Erasure-Coded (K=?, m=??)
Cache-Tier on the 4x SSD's (if you go with SSHD-journals) for any of the other pools ??

cause that has a bearing on your failure decisions as well.

pps.: I might aswell ask how much "data" you have to store. cause that might have a bearing on "how and how" much you replicate aswell.

udo · Oct 28, 2015

Q-wulf said:
Is that for Samsung evo 850's only ? or all MLC SSD's ?

Hi,
for the journal you should use an SSD which is able to store many TB before the lifetime ends... this is not esp. MLC related (more, how much reserve the ssd keep).

On the ceph-user mailing list some people wrote about bad experiences with samsung evo (and normal) SSDs - they die without warning. But they are not build for run as journal disks!
With the DC series from samsumg some people have good experiences... but I prefer Intel DC S3700 (which use MLC also).

Udo

Q-wulf · Oct 28, 2015

udo said:
for the journal you should use an SSD which is able to store many TB before the lifetime ends...

The way i understand the journals, they only access the ssd with writes, when writes actually get written to the OSD thats housed on that journal, right ? Or are there a lot more writes happening for journals ? I am basically asking if the journal has more writes happening then the OSD does. And if so by what factor (for planing purposes).

in any case, good to know that the Samsung Evo 850's are not suitable as journal drives.

udo · Oct 28, 2015

Q-wulf said:
The way i understand the journals, they only access the ssd with writes, when writes actually get written to the OSD thats housed on that journal, right ? Or are there a lot more writes happening for journals ? I am basically asking if the journal has more writes happening then the OSD does. And if so by what factor (for planing purposes).

in any case, good to know that the Samsung Evo 850's are not suitable as journal drives.

Hi,
right - the journal wrote all writes before they are written to the SSDs. If you use 1 SSDs as journal for 4 OSDs this SSD will wrote all data for the 4 HDDs - and this normaly multible times, because every time you change your ceph-cluster, like expanding OSDs/nodes/..., the crush algorythm will move the data to an new place and the journal must write the data "again".
This also happens if one OSD die - the data will move to other OSDs...

I started also with cheap SSDs as journal, but end after a short time with the 3700er.

Udo

Q-wulf · Oct 28, 2015

That is defnitely good to know.

so aychprox would be better off then putting his journals on the SSHD, since the benchmark suggest his single SSD as a journal would be slower and udo suggests that the SSD will soonish burn it self out, cause its receiving a multitude of more writes. On top of that he reduces his failure risk by one failure-domain.

Newbie need your input

Renowned Member

New Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member

Well-Known Member

Distinguished Member

Well-Known Member

Renowned Member

Well-Known Member

Distinguished Member

Well-Known Member

Distinguished Member

Well-Known Member