Hello everyone,
first of all I want to say thank you to each and everyone in this community!
I've been a long time reader ( and user of pve ) and could get so much valuable information from this forum!
Right now the deployment of the Ceph Cluster gives me some trouble.
We were using DRBD but since we are expanding and the are more nodes in the pve-cluster we decided to switch to Ceph.
The 3 Ceph-Server-Nodes are connected via a 6*GbE-LACP-Bond with Jumbo-Frames over two stacked switches and the Ceph traffic is on a seperate VLAN.
Currently there are 9 OSDs (3*15K SAS with BBWC per host).
The journal is 10GB per OSD and on LVM-Volumes of a SSD-RAID1.
pg_num and pgp_num are set to 512 for the pool.
Replication is 3 and the CRUSH-Map is configured to distribute the requests over the 3 hosts.
The performance of the rados benchmarks is good:
rados -p test bench 60 write -t 8 --no-cleanup
rados -p test bench 60 seq -t 8
rados -p test bench 60 rand -t 8
What makes me wonder is the "Min bandwidth (MB/sec): 0" and "Max latency: 1.91126" at write - benchmark.
I've modified the Linux autotuning TCP buffer limits and the rx/tx ring parameters of the Network-Cards (all Intel), which increased the bandwidth, but didn't help with the latency of small IO.
For example in a wheezy-kvm-guest:
I also did put flashcache in front of the OSDs but this didn't help much and since there's 1GB of Cache from the RAID-Controller in front of the OSDs I wonder why this is so slow in the guests?
Compared to the raw performance of the SSDs and the OSDs this is realy bad...
Running fio from a node directly via rbd gives expected results, but also with some serious deviations:
Thanks for reading so far
I know this is my first post, but I have really run out of options here and would really appreciate your help.
My question are:
Why is the performance in the guests so much worse?
What can we do to enhance this for Linux as well as Windows guests?
Thanks for reading this big post and I hope we can have a nice discussion with a good outcome for everyone, since this is, in my point of view a common issue for a few users.
first of all I want to say thank you to each and everyone in this community!
I've been a long time reader ( and user of pve ) and could get so much valuable information from this forum!
Right now the deployment of the Ceph Cluster gives me some trouble.
We were using DRBD but since we are expanding and the are more nodes in the pve-cluster we decided to switch to Ceph.
The 3 Ceph-Server-Nodes are connected via a 6*GbE-LACP-Bond with Jumbo-Frames over two stacked switches and the Ceph traffic is on a seperate VLAN.
Currently there are 9 OSDs (3*15K SAS with BBWC per host).
The journal is 10GB per OSD and on LVM-Volumes of a SSD-RAID1.
pg_num and pgp_num are set to 512 for the pool.
Replication is 3 and the CRUSH-Map is configured to distribute the requests over the 3 hosts.
The performance of the rados benchmarks is good:
rados -p test bench 60 write -t 8 --no-cleanup
Code:
Total time run: 60.187142
Total writes made: 1689
Write size: 4194304
Bandwidth (MB/sec): 112.250
Stddev Bandwidth: 48.3496
Max bandwidth (MB/sec): 176
Min bandwidth (MB/sec): 0
Average Latency: 0.28505
Stddev Latency: 0.236462
Max latency: 1.91126
Min latency: 0.053685
Code:
Total time run: 30.164931
Total reads made: 1689
Read size: 4194304
Bandwidth (MB/sec): 223.969
Average Latency: 0.142613
Max latency: 2.78286
Min latency: 0.003772
Code:
Total time run: 60.287489
Total reads made: 4524
Read size: 4194304
Bandwidth (MB/sec): 300.162
Average Latency: 0.106474
Max latency: 0.768564
Min latency: 0.003791
What makes me wonder is the "Min bandwidth (MB/sec): 0" and "Max latency: 1.91126" at write - benchmark.
I've modified the Linux autotuning TCP buffer limits and the rx/tx ring parameters of the Network-Cards (all Intel), which increased the bandwidth, but didn't help with the latency of small IO.
For example in a wheezy-kvm-guest:
Code:
dd if=/dev/zero of=/tmp/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 9,99445 s, 51,2 kB/s
dd if=/dev/zero of=/tmp/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 10,0949 s, 406 kB/s
I also did put flashcache in front of the OSDs but this didn't help much and since there's 1GB of Cache from the RAID-Controller in front of the OSDs I wonder why this is so slow in the guests?
Compared to the raw performance of the SSDs and the OSDs this is realy bad...
Code:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 0,120224 s, 4,3 MB/s
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 0,137924 s, 29,7 MB/s
dd if=/dev/zero of=/mnt/ssd-test/test bs=512 count=1000 oflag=direct,dsync
512000 Bytes (512 kB) kopiert, 0,147097 s, 3,5 MB/s
dd if=/dev/zero of=/mnt/ssd-test/test bs=4k count=1000 oflag=direct,dsync
4096000 Bytes (4,1 MB) kopiert, 0,235434 s, 17,4 MB/s
Running fio from a node directly via rbd gives expected results, but also with some serious deviations:
Code:
rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32
fio-2.2.3-1-gaad9
Starting 1 process
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/13271KB/0KB /s] [0/3317/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=849098: Mon Mar 23 20:08:25 2015
write: io=2048.0MB, bw=12955KB/s, iops=3238, runt=161874msec
slat (usec): min=37, max=27268, avg=222.48, stdev=326.17
clat (usec): min=13, max=544666, avg=7937.85, stdev=11891.77
lat (msec): min=1, max=544, avg= 8.16, stdev=11.88
Thanks for reading so far
I know this is my first post, but I have really run out of options here and would really appreciate your help.
My question are:
Why is the performance in the guests so much worse?
What can we do to enhance this for Linux as well as Windows guests?
Thanks for reading this big post and I hope we can have a nice discussion with a good outcome for everyone, since this is, in my point of view a common issue for a few users.
Last edited: