Hello everyone, first of all I want to say thank you to each and everyone in this community! I've been a long time reader ( and user of pve ) and could get so much valuable information from this forum! Right now the deployment of the Ceph Cluster gives me some trouble. We were using DRBD but since we are expanding and the are more nodes in the pve-cluster we decided to switch to Ceph. The 3 Ceph-Server-Nodes are connected via a 6*GbE-LACP-Bond with Jumbo-Frames over two stacked switches and the Ceph traffic is on a seperate VLAN. Currently there are 9 OSDs (3*15K SAS with BBWC per host). The journal is 10GB per OSD and on LVM-Volumes of a SSD-RAID1. pg_num and pgp_num are set to 512 for the pool. Replication is 3 and the CRUSH-Map is configured to distribute the requests over the 3 hosts. The performance of the rados benchmarks is good: rados -p test bench 60 write -t 8 --no-cleanup Code: Total time run: 60.187142 Total writes made: 1689 Write size: 4194304 Bandwidth (MB/sec): 112.250 Stddev Bandwidth: 48.3496 Max bandwidth (MB/sec): 176 Min bandwidth (MB/sec): 0 Average Latency: 0.28505 Stddev Latency: 0.236462 Max latency: 1.91126 Min latency: 0.053685 rados -p test bench 60 seq -t 8 Code: Total time run: 30.164931 Total reads made: 1689 Read size: 4194304 Bandwidth (MB/sec): 223.969 Average Latency: 0.142613 Max latency: 2.78286 Min latency: 0.003772 rados -p test bench 60 rand -t 8 Code: Total time run: 60.287489 Total reads made: 4524 Read size: 4194304 Bandwidth (MB/sec): 300.162 Average Latency: 0.106474 Max latency: 0.768564 Min latency: 0.003791 What makes me wonder is the "Min bandwidth (MB/sec): 0" and "Max latency: 1.91126" at write - benchmark. I've modified the Linux autotuning TCP buffer limits and the rx/tx ring parameters of the Network-Cards (all Intel), which increased the bandwidth, but didn't help with the latency of small IO. For example in a wheezy-kvm-guest: Code: dd if=/dev/zero of=/tmp/test bs=512 count=1000 oflag=direct,dsync 512000 Bytes (512 kB) kopiert, 9,99445 s, 51,2 kB/s dd if=/dev/zero of=/tmp/test bs=4k count=1000 oflag=direct,dsync 4096000 Bytes (4,1 MB) kopiert, 10,0949 s, 406 kB/s I also did put flashcache in front of the OSDs but this didn't help much and since there's 1GB of Cache from the RAID-Controller in front of the OSDs I wonder why this is so slow in the guests? Compared to the raw performance of the SSDs and the OSDs this is realy bad... Code: dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=512 count=1000 oflag=direct,dsync 512000 Bytes (512 kB) kopiert, 0,120224 s, 4,3 MB/s dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/test bs=4k count=1000 oflag=direct,dsync 4096000 Bytes (4,1 MB) kopiert, 0,137924 s, 29,7 MB/s dd if=/dev/zero of=/mnt/ssd-test/test bs=512 count=1000 oflag=direct,dsync 512000 Bytes (512 kB) kopiert, 0,147097 s, 3,5 MB/s dd if=/dev/zero of=/mnt/ssd-test/test bs=4k count=1000 oflag=direct,dsync 4096000 Bytes (4,1 MB) kopiert, 0,235434 s, 17,4 MB/s Running fio from a node directly via rbd gives expected results, but also with some serious deviations: Code: rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=32 fio-2.2.3-1-gaad9 Starting 1 process rbd engine: RBD version: 0.1.8 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/13271KB/0KB /s] [0/3317/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=849098: Mon Mar 23 20:08:25 2015 write: io=2048.0MB, bw=12955KB/s, iops=3238, runt=161874msec slat (usec): min=37, max=27268, avg=222.48, stdev=326.17 clat (usec): min=13, max=544666, avg=7937.85, stdev=11891.77 lat (msec): min=1, max=544, avg= 8.16, stdev=11.88 Thanks for reading so far I know this is my first post, but I have really run out of options here and would really appreciate your help. My question are: Why is the performance in the guests so much worse? What can we do to enhance this for Linux as well as Windows guests? Thanks for reading this big post and I hope we can have a nice discussion with a good outcome for everyone, since this is, in my point of view a common issue for a few users.