Proxmox + Ceph limitation

itvietnam · Sep 7, 2015

Hi,

Anybody try Proxmox + Ceph storage ?

We tried 3 nodes:

- Dell R610
- Raid H310 support jbod for hot swap SSD
- 3 SSD MX200 500GB (1 mon + 2 osd per node)
- Gigabit for wan and dedicated Gigabit for Ceph replication

When i test dd speed on 1 VM store on Ceph i only get avg speed at 47-50MB/s

Even my staff testing running dd test on multiple VM at the same time (simultances). The speed per VM still stand at 47-50MB/s

While i test at local SSD the speed is more better. 1GB/s

Is there anyone face with this issues ? Is this limitation of Ceph storage handle by Proxmox?

Sent from my iPhone using Tapatalk

dietmar · Sep 7, 2015

I guess you need more SSDs per node to get better ceph performance.

spirit · Sep 7, 2015

You need to take network latency in account.

Depend how much queue depth and also blocksize.

you can reduce things by turnoff cephx authentification and debug in ceph.conf.

(I'm able to reach 400000 iops 4k randread / qd=32 from 1vm, 70000iops by virtio disk + iothread)

spirit · Sep 7, 2015

I forgot to say,

if the low if for write,
check that your ssd are fast for sync writes, because ceph need it for journal

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

(as far I remember mx200 are really slow for sync writes).

I recommand you tu use enterprise grade ssd for ceph (at least for journals)

itvietnam · Sep 7, 2015

dietmar said:
I guess you need more SSDs per node to get better ceph performance.

May i know how many disk is enough ? I read many articles talking more disk is better and using SAN 10Gbps for storage but i do not see any article mention clearly requirement disk for production environment.

Sent from my iPhone using Tapatalk

itvietnam · Sep 7, 2015

spirit said:
You need to take network latency in account.

Depend how much queue depth and also blocksize.

you can reduce things by turnoff cephx authentification and debug in ceph.conf.

(I'm able to reach 400000 iops 4k randread / qd=32 from 1vm, 70000iops by virtio disk + iothread)

Thanks for your answer.

May i know queue depth meaning in this case ? Is it on network level or storage level ?

For contrast, the same setup with Parallel Cloud Server (Odin) the dd test can get better at 1GB/s with 3 node above.

I'm able to reach 400000 iops 4k randread / qd=32 from 1vm, 70000iops by virtio disk + iothread <--if you dont mind can you share your system configuration in this case ?

Sent from my iPhone using Tapatalk

itvietnam · Sep 7, 2015

spirit said:
I forgot to say,

if the low if for write,
check that your ssd are fast for sync writes, because ceph need it for journal

http://www.sebastien-han.fr/blog/20...-if-your-ssd-is-suitable-as-a-journal-device/

(as far I remember mx200 are really slow for sync writes).

I recommand you tu use enterprise grade ssd for ceph (at least for journals)

Thanks for your interesting articles.

Can i use 1 Intel DC S3700 as journals for per node and multiple MX200 to store data. I havent try this yet.

Sent from my iPhone using Tapatalk

dietmar · Sep 7, 2015

itvietnam said:
For contrast, the same setup with Parallel Cloud Server (Odin) the dd test can get better at 1GB/s with 3 node above.

I wonder how you can reach 1GB/s when using a single GBit network?

itvietnam · Sep 7, 2015

dietmar said:
I wonder how you can reach 1GB/s when using a single GBit network?

There are 2 cases:

- local storage you can get this result

- for PCS system all data replicate via network to sync data from local node to all node in cluster. But for read all VM will read on local SSD on the same node. So network in their design just use for replication, 1Gbps is enough in production environment.

If you need to publish iscsi on their storage like seperate node for ceph and hypervisor so you will need SAN 10Gbps in this case. Else you can start with 1Gbps.

Sent from my iPhone using Tapatalk

dietmar · Sep 7, 2015

itvietnam said:
- for PCS system all data replicate via network to sync data from local node to all node in cluster. But for read all VM will read on local SSD on the same node. So network in their design just use for replication, 1Gbps is enough in production environment.

So you benchmark a special case were all data is available locally. Ceph is designed and optimized for larger setups.

spirit · Sep 8, 2015

I can reach around 70000iops 4k read and 10000iops 4k write with 1ssd / osd daemon. (ssd intel s3500, cpu : xeon e5 3.1ghz)

also use hammer release (firefly is not optimized for ssd)

also using high cpu frequencies improve latency

here my ceph.conf tuning:

Code:

auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
osd_pool_default_min_size = 1
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
osd_op_threads = 5
filestore_op_threads = 4
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
ms_nocrc = true
ms_dispatch_throttle_bytes = 0
cephx_sign_messages = false
cephx_require_signatures = false
throttler_perf_counter = false
ms_crc_header = false
ms_crc_data = false


[osd]
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false

itvietnam · Sep 8, 2015

spirit said:

I can reach around 70000iops 4k read and 10000iops 4k write with 1ssd / osd daemon. (ssd intel s3500, cpu : xeon e5 3.1ghz)

also use hammer release (firefly is not optimized for ssd)

also using high cpu frequencies improve latency

here my ceph.conf tuning:

Code:

auth_cluster_required = none
auth_service_required = none
auth_client_required = none
filestore_xattr_use_omap = true
osd_pool_default_min_size = 1
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
osd_op_threads = 5
filestore_op_threads = 4
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
ms_nocrc = true
ms_dispatch_throttle_bytes = 0
cephx_sign_messages = false
cephx_require_signatures = false
throttler_perf_counter = false
ms_crc_header = false
ms_crc_data = false


[osd]
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false

Thanks for your sharing information and follow up.

I forgot to says that we are using 2 x X5650 per nodes.

I will retest when i'm back our HO. I'll contact with you via your site Odiso.com for consulting services.

Thanks again,
Nghia,

itvietnam · Sep 8, 2015

dietmar said:
So you benchmark a special case were all data is available locally. Ceph is designed and optimized for larger setups.

Hi,

In the corner of BA, people think how to get better ROI with the same cost invest. IMHO i see their storage is quite optimized with efficient cost. They are using 1 SSD for processing data and then store processed data to HDD (SATA or SAS) per node. I'm still looking for better way to improve performances with the same cost of lower. That's why i'm looking at Proxmox combined with Ceph and others platform like Openstack, OpenNebula, CloudStack.... for replacement.

As i mentioned before i do not see any ceph requirement for larger setups. How much SSD/network is enough ?

Thanks for your reply,

fortechitsolutions · Sep 12, 2015

Small footnote to thread, all discussion for Ceph I have read which suggests "Serious" ceph use, more or less mandates a 10 gig connect between ceph cluster members for the ceph storage traffic (ie, not 1 gig). ie, "1 gig works but the performance will not be the same as 10 gig" - is my understanding. However, I haven't actually played with ceph deployment yet so my experience is ~zero and possibly the value of this comment is also the same

Tim

udo · Sep 12, 2015

itvietnam said:
...
As i mentioned before i do not see any ceph requirement for larger setups. How much SSD/network is enough ?

Hi,
depends how you defined "larger setup".
I have an 8 node ceph-cluster with 110 OSDs and 370TB space (96 HDDs + 14 SSDs for an EC-cache pool).
Each node has two DC S3700 for journaling (and ec-cache pool) and 2*10GB network (SFP+).
The speed is ok for most things (filestore) but databases are run on drbd, because of latency.

Writespeed inside VM:

Code:

# dd if=/dev/zero of=/data/cephstore/bigfile bs=1M count=4096 conv=fdatasync
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 19.8251 s, 217 MB/s

For an better read-speed you should use an higher readahead-value inside the vm:

Code:

cat /etc/udev/rules.d/99-virtio.rules 
SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

Udo

itvietnam · Sep 13, 2015

udo said:
The speed is ok for most things (filestore) but databases are run on drbd, because of latency.
Udo

Your speed is just a half of ssd speed. I guest just because of your HDD speed. Not full ssd.

What do you mean drbd for database ? Is this system seperate from ceph ? And ceph is not the good performances for database ?

I plan to test again with 10Gbps network in this month end. I will update more result.

Sent from my iPhone using Tapatalk

udo · Sep 13, 2015

itvietnam said:
Your speed is just a half of ssd speed. I guest just because of your HDD speed. Not full ssd.

Hi,
no - the data is written if the sync in the journal (for all replicas ) is done. Full ssd is much better for read but not really for writing (if you have enough journal SSDs).

What do you mean drbd for database ? Is this system seperate from ceph ? And ceph is not the good performances for database ?

I use different storage on the pve-nodes, DRBD-storage (SSD + SAS) for less latencys (databases...), Ceph storage for filestore.

Ceph has an benefit with multible access (many VMs). One VM do only one IO-tread (yet).

Udo

Patrick Zippenfenig · Sep 14, 2015

Hi,

I'm also running a ceph cluster on 3 storage nodes (each 12x4TB, no SSD jornal, 20G infiniband).
Don't expect the same disk performance as raw. Ceph ensures each write-operation is replicated two 2 or 3 journals. After a certain timeout journal data is moved into it's actual storage location. This is good for short random-writes as they are turned into sequential IO. For sequential write operations thats a lot of overhead.

Ceph scales linearly with more disks. You can easily expand or shrink your storage needs based on capacity or performance.

If you stick with 1G connection, use a dedicated port for ceph or even truncate multiple links. You can get 20G and 40G infiniband as used parts fairly cheap (nice to play around with. Search for bbcp for insane kernel-cache to kernel-cache file copy speeds).

Patrick

Search

Search

Proxmox + Ceph limitation

itvietnam

Renowned Member

dietmar

Proxmox Staff Member

spirit

Distinguished Member

spirit

Distinguished Member

itvietnam

Renowned Member

itvietnam

Renowned Member

itvietnam

Renowned Member

dietmar

Proxmox Staff Member

itvietnam

Renowned Member

dietmar

Proxmox Staff Member

spirit

Distinguished Member

itvietnam

Renowned Member

itvietnam

Renowned Member

fortechitsolutions

Renowned Member

udo

Distinguished Member

itvietnam

Renowned Member

udo

Distinguished Member

Patrick Zippenfenig

Active Member