CEPH read performance

e100 · Jan 27, 2016

7 mechanical disks in each node using xfs
3 nodes so 21 OSDs total
I've started moving journals to SSD which is only helping write performance.

CEPH nodes still running Proxmox 3.x
I have client nodes running 4.x and 3.x, both have the same issue.

Using 10G IPoIB, separate public/private networks and iperf shows no problems.

Inside a VM I can perform sequential writes at over 100MB/sec.

But inside a VM sequential read is around 30MB/sec Max.

I've tried every guest change I could think of, noop, read ahead, virtio, ide, scsi virtio, you name it I likely tried it.

I've tried tuning on the CEPH side, turning off auth helped but not by much.

Any suggestions?

Q-wulf · Jan 27, 2016

Whats your Replication like ?
Replication 3 via Host ?
If you can post the following parts from your crush map:

# buckets

rule-yourPool

I assume you have tried using virtio with IOthread, right ?

Can you execute the following commands and provide us the output via Code-tags ?

Code:

ceph osd tree

Code:

ceph pg dump | awk '
/^pg_stat/ { col=1; while($col!="up") {col++}; col++ }
/^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, RSTART, RLENGTH); poollist[pool]=0;
up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = substr(up, RSTART+RLENGTH) }
for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];}
}
END {
printf("\n");
printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| SUM \n");
for (i in poollist) printf("--------"); printf("----------------\n");
for (i in osdlist) { printf("osd.%i\t", i); sum=0;
   for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; sumpool[j]+=array[i,j] }; printf("| %i\n",sum) }
for (i in poollist) printf("--------"); printf("----------------\n");
printf("SUM :\t"); for (i in poollist) printf("%s\t",sumpool[i]); printf("|\n");
}'

e100 · Jan 27, 2016

Q-wulf said:
I assume you have tried using virtio with IOthread, right ?

Yes, also tried krbd. Both of those seem to help but very little.

Thanks for the help, let me know if you need more info.

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host vm4 {
   id -2     # do not change unnecessarily
   # weight 18.140
   alg straw
   hash 0   # rjenkins1
   item osd.0 weight 3.630
   item osd.1 weight 3.630
   item osd.2 weight 3.630
   item osd.11 weight 3.630
   item osd.12 weight 2.720
   item osd.13 weight 0.450
   item osd.14 weight 0.450
}
host vm5 {
   id -3     # do not change unnecessarily
   # weight 16.770
   alg straw
   hash 0   # rjenkins1
   item osd.3 weight 3.630
   item osd.4 weight 3.630
   item osd.5 weight 3.630
   item osd.10 weight 3.630
   item osd.15 weight 0.450
   item osd.16 weight 0.900
   item osd.17 weight 0.900
}
host vm6 {
   id -4     # do not change unnecessarily
   # weight 18.140
   alg straw
   hash 0   # rjenkins1
   item osd.6 weight 3.630
   item osd.7 weight 3.630
   item osd.8 weight 3.630
   item osd.9 weight 3.630
   item osd.18 weight 2.720
   item osd.19 weight 0.450
   item osd.20 weight 0.450
}
root default {
   id -1     # do not change unnecessarily
   # weight 53.050
   alg straw
   hash 0   # rjenkins1
   item vm4 weight 18.140
   item vm5 weight 16.770
   item vm6 weight 18.140
}

# rules
rule rbd {
   ruleset 2
   type replicated
   min_size 1
   max_size 10
   step take default
   step chooseleaf firstn 0 type host
   step emit
}

# end crush map

Code:

# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 53.04982 root default  
-2 18.13994  host vm4  
0  3.62999  osd.0  up  1.00000  1.00000
1  3.62999  osd.1  up  1.00000  1.00000
2  3.62999  osd.2  up  1.00000  1.00000
11  3.62999  osd.11  up  1.00000  1.00000
12  2.71999  osd.12  up  1.00000  1.00000
13  0.45000  osd.13  up  1.00000  1.00000
14  0.45000  osd.14  up  1.00000  1.00000
-3 16.76994  host vm5  
3  3.62999  osd.3  up  1.00000  1.00000
4  3.62999  osd.4  up  1.00000  1.00000
5  3.62999  osd.5  up  1.00000  1.00000
10  3.62999  osd.10  up  1.00000  1.00000
15  0.45000  osd.15  up  1.00000  1.00000
16  0.89999  osd.16  up  1.00000  1.00000
17  0.89999  osd.17  up  1.00000  1.00000
-4 18.13994  host vm6  
6  3.62999  osd.6  up  1.00000  1.00000
7  3.62999  osd.7  up  1.00000  1.00000
8  3.62999  osd.8  up  1.00000  1.00000
9  3.62999  osd.9  up  1.00000  1.00000
18  2.71999  osd.18  up  1.00000  1.00000
19  0.45000  osd.19  up  1.00000  1.00000
20  0.45000  osd.20  up  1.00000  1.00000

Code:

dumped all in format plain

pool :  5       2       | SUM
--------------------------------
osd.17  17      10      | 27
osd.4   135     79      | 214
osd.5   123     76      | 199
osd.18  74      59      | 133
osd.19  15      6       | 21
osd.6   106     67      | 173
osd.7   121     76      | 197
osd.8   98      71      | 169
osd.9   87      69      | 156
osd.10  114     85      | 199
osd.20  11      11      | 22
osd.11  87      65      | 152
osd.12  83      45      | 128
osd.13  16      12      | 28
osd.0   94      66      | 160
osd.14  22      10      | 32
osd.1   101     72      | 173
osd.15  9       6       | 15
osd.2   109     58      | 167
osd.16  10      8       | 18
osd.3   104     73      | 177
--------------------------------
SUM :   1536    1024    |

Q-wulf · Jan 27, 2016

You seem to replicate via your nodes by using the "host" bucket (which is good). Since you have 3 Nodes i assume you do replicate with a size of 3. Thats all good news.

Now the bad news.
1.)
You have differently sized Disks. You can see this by the "weight" in "osd tree" and also by the pg allocation per osd. I am gonna wager, that some of the disks are also slower then others.

To put this in perspective:

You have OSD.17, which has a total of 27 PG's.
You have OSD.4, which has a total of 214 PG's.

That in it self means that some OSD's will get hammered more then others (because there are more pg's on it)

2.)
You also have a primary-affinity of 1.0 for all of your 21 OSD's. which means that when ceph decides where to put a primary PG (the one it will read from later), each osd has the same chance of holding primary OSD's. Now if you have different read speeds for your drives, then that will end up biting you, by potentially assigning slower drives the same amount of primary PG's as faster drives, and thereby slowing you down.

Q1: Can you tell me more about your osd's ? You seem to have at least 4 different types (e.g. osd 6, 16, 18, 19) based on the weight that crush assigned.

Q2: your pg dump revels 2 pools (ID 2 and ID 5), but your crushmap only lists one rule-set, do you happen to have 2 pools using the same rule-set ?

e100 · Jan 27, 2016

Forgot to mention, we are running hammer.
I'm not expecting anything super fast with this setup but 30MB/sec read is horrible.

The rbd pool has a replica of 2 not 3

The weights are set based on their size (Proxmox does this by default) so you can easily ascertain their sizes by looking at the weights.
They are all roughly the same speed, just different capacities.
All are 7200 RPM mechanical disks.

This read performance issue existed when we had only 12 OSDs, all of the same size 4TB.
We recently added 9 more OSDs, 500GB, 1TB and 3TB sizes, performance is better with 21 vs 12.

The second pool is not really being used, I created it in Proxmox GUI and had not added a crushmap for it.
I'll likely just delete it, was thinking of creating a new pool named "rbd_safe" that has replica of 3, do you have additional suggestions for settings of that new pool?

Q-wulf · Jan 27, 2016

Lets just say that 35 MB/s Read and 100 MB/s write are wrong. You should see at best a a 2-time increase in reads over your writes (unless your OSD's are THAT slow, that without the SSD's they just generate around 17 MB/s in a R2 pool)

As a comparrison-point i have a home ceph-cluster (3-nodes), that runs OSDs from 2002-2013 (age wise), with 24 Osds (no dedicated journals on SSD), on a ~~R2-pool i get 55 MB/s write and around 125 MB/s read~~ (sry, that was my R3 benchmark) Its 78 MB/s write and 155 MB/s read on a R-2 pool.
THAT is without a Cache_ing tier attached (which i normally do) . Those drives are OLD and SLOW tho, some even Sata-1 from when sata-1 was new and cool.

You should see higher speeds, by a wide margin.

I assume you tested your OSDs on a 1 on 1 basis ?

Code:

Write-Bench per OSD:

dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1 oflag=direct
sleep 15
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-1/deleteme bs=10G count=1 oflag=direct
sleep 15
...




Read-Bench per OSD:

time dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1 iflag=direct
sleep 15
time dd if=/var/lib/ceph/osd/ceph-1/deleteme of=/dev/null bs=10G count=1 iflag=direct
sleep 15
...


delete your written benchmark data:

rm /var/lib/ceph/osd/ceph-*/deleteme

ps.: keep a copy of those benchmark results - might be needed later

See if any of em fall outside of the spectrum of the others. if they do (and that is EXPECTED) either set their primary affinity to 0 or if there is a big spectrum all over the place, adjust all osd's primary-affinity based on their read-speeds.

If it is not expected (as in during a normal benchmark on Vm - those OSD's have very very high latency in "OSD" tab on proxmox), you wanna take em out of the ceph-system as OSD's and investigate what is wrong with those Drives.

e100 said:
The second pool is not really being used, I created it in Proxmox GUI and had not added a crushmap for it.
I'll likely just delete it, was thinking of creating a new pool named "rbd_safe" that has replica of 3, do you have additional suggestions for settings of that new pool?

No thats fine, I had seen an issue before where there were "gost pg's" left over from a previous pool, no longer in existance. THAT is obviously not the case here. so safe to ignore.

e100 said:
This read performance issue existed when we had only 12 OSDs, all of the same size 4TB.
We recently added 9 more OSDs, 500GB, 1TB and 3TB sizes, performance is better with 21 vs 12.

that is to be expected. the more OSDs you have, the more PG's get distributed over your osd's. that reduces the load on your "old" osds, even if you increased your pg size afterwards. In my experience it scales pretty linear.

udo · Jan 27, 2016

e100 said:
7 mechanical disks in each node using xfs
3 nodes so 21 OSDs total
I've started moving journals to SSD which is only helping write performance.

CEPH nodes still running Proxmox 3.x
I have client nodes running 4.x and 3.x, both have the same issue.

Using 10G IPoIB, separate public/private networks and iperf shows no problems.

Inside a VM I can perform sequential writes at over 100MB/sec.

But inside a VM sequential read is around 30MB/sec Max.

I've tried every guest change I could think of, noop, read ahead, virtio, ide, scsi virtio, you name it I likely tried it.

I've tried tuning on the CEPH side, turning off auth helped but not by much.

Any suggestions?

Hi e100,
have you tried to enhance the read ahead cache in the VM?

This speed up my reads:

Code:

/etc/udev/rules.d/99-virtio.rules
SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

Udo

udo · Jan 27, 2016

e100 said:
7 mechanical disks in each node using xfs
...
Any suggestions?

Hi again,
ext4 permormed better for us than xfs.

Have you tried to disable scrubbing temporarly? (and deep scrub)

Which cachemode do you use for the VM-disk? writeback? And yes, writeperformance has an impact to readperformance.

Udo

Q-wulf · Jan 27, 2016

e100 said:
Inside a VM I can perform sequential writes at over 100MB/sec.

But inside a VM sequential read is around 30MB/sec Max.

udo said:
Hi e100,
have you tried to enhance the read ahead cache in the VM?

totally overlooked this. Have you tried to do a synthetic benchmark of your pool ? As in from outside your VM, to see if its a ceph issue or a VM issue ?

Code:

Example:

rados bench -p P12__HDD_OSD_REPLICATED_R3 450 write --no-cleanup
rados bench -p P12__HDD_OSD_REPLICATED_R3 450 seq
rados -p P12__HDD_OSD_REPLICATED_R3 cleanup --prefix bench
sleep 60
rados -p P2__HDD_OSD_REPLICATED_R2 cleanup --prefix bench
sleep 120

would tell you where the issue is truly located.
If its the same slow performance --> ceph
If its significantly faster --> VM / rados config

e100 · Jan 27, 2016

Q-wulf said:
You should see higher speeds, by a wide margin.

I agree, got any other suggestions of tests to perform?

What does your ceph.conf look like on your home cluster?

The slowest disk was 115MB/sec, the disks are not the problem:

Code:

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.857 s, 161 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-1/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.7304 s, 161 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-2/deleteme bs=1G count=10 oflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 65.9065 s, 163 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-11/deleteme bs=1G count=10 oflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.6315 s, 161 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-12/deleteme bs=1G count=10 oflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 90.6615 s, 118 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-13/deleteme bs=1G count=10 oflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 85.2583 s, 126 MB/s

root@vm4:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-14/deleteme bs=1G count=10 oflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 74.6866 s, 144 MB/s


echo 3 > /proc/sys/vm/drop_caches


root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 72.7822 s, 148 MB/s

real  1m12.862s
user  0m0.002s
sys  0m1.012s

root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-1/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 71.2202 s, 151 MB/s

real  1m11.296s
user  0m0.000s
sys  0m1.022s

root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-2/deleteme of=/dev/null bs=1G count=10 iflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 69.0858 s, 155 MB/s

real  1m9.160s
user  0m0.000s
sys  0m0.918s

root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-11/deleteme of=/dev/null bs=1G count=10 iflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.5757 s, 161 MB/s

real  1m6.641s
user  0m0.000s
sys  0m0.994s

root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-12/deleteme of=/dev/null bs=1G count=10 iflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 89.8726 s, 119 MB/s

real  1m29.960s
user  0m0.001s
sys  0m1.014s


root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-13/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 82.5259 s, 130 MB/s

real  1m22.581s
user  0m0.001s
sys  0m0.848s


root@vm4:~# time dd if=/var/lib/ceph/osd/ceph-14/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 72.6505 s, 148 MB/s

real  1m12.705s
user  0m0.001s
sys  0m0.908s








root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-3/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.8944 s, 161 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-4/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 70.2644 s, 153 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-5/deleteme bs=1G count=10 oflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 65.6497 s, 164 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-10/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.4549 s, 162 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-15/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 75.3248 s, 143 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-16/deleteme bs=1G count=10 oflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 77.1039 s, 139 MB/s

root@vm5:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-17/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 76.681 s, 140 MB/s


echo 3 > /proc/sys/vm/drop_caches

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-3/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 70.8108 s, 152 MB/s

real  1m10.888s
user  0m0.000s
sys  0m0.931s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-4/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 71.5329 s, 150 MB/s

real  1m11.603s
user  0m0.002s
sys  0m0.874s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-5/deleteme of=/dev/null bs=1G count=10 iflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 68.479 s, 157 MB/s

real  1m8.533s
user  0m0.000s
sys  0m0.852s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-10/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 67.9805 s, 158 MB/s

real  1m8.028s
user  0m0.003s
sys  0m0.824s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-15/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 72.2749 s, 149 MB/s

real  1m12.319s
user  0m0.002s
sys  0m0.812s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-16/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 75.1798 s, 143 MB/s

real  1m15.233s
user  0m0.001s
sys  0m0.842s

root@vm5:~# time dd if=/var/lib/ceph/osd/ceph-17/deleteme of=/dev/null bs=1G count=10 iflag=direct  
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 75.5131 s, 142 MB/s

real  1m15.560s
user  0m0.002s
sys  0m0.827s






root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-6/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.0435 s, 163 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-7/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.5568 s, 161 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-8/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.4958 s, 161 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-9/deleteme bs=1G count=10 oflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 65.6829 s, 163 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-18/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 101.632 s, 106 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-19/deleteme bs=1G count=10 oflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 75.1904 s, 143 MB/s

root@vm6:~# dd if=/dev/zero of=/var/lib/ceph/osd/ceph-20/deleteme bs=1G count=10 oflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 73.8833 s, 145 MB/s

echo 3 > /proc/sys/vm/drop_caches

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-6/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 68.2689 s, 157 MB/s

real  1m8.337s
user  0m0.000s
sys  0m1.089s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-7/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 68.1043 s, 158 MB/s

real  1m8.231s
user  0m0.002s
sys  0m0.917s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-8/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 67.8468 s, 158 MB/s

real  1m7.894s
user  0m0.000s
sys  0m0.890s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-9/deleteme of=/dev/null bs=1G count=10 iflag=direct
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 66.8821 s, 161 MB/s

real  1m6.951s
user  0m0.000s
sys  0m0.904s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-18/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 93.038 s, 115 MB/s

real  1m33.112s
user  0m0.003s
sys  0m0.984s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-19/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 77.8167 s, 138 MB/s

real  1m17.872s
user  0m0.002s
sys  0m0.928s

root@vm6:~# time dd if=/var/lib/ceph/osd/ceph-20/deleteme of=/dev/null bs=1G count=10 iflag=direct 
10+0 records in
10+0 records out
10737418240 bytes (11 GB) copied, 71.2732 s, 151 MB/s

real  1m11.323s
user  0m0.002s
sys  0m0.900s

Q-wulf · Jan 27, 2016

I'd se

e100 said:
The slowest disk was 115MB/sec, the disks are not the problem:

I'd drop the primary-affinities of osd's 12 ,13,18 (those below 140 MB/s ~~write~~ read) to 0.0

Then run the benchmark again. You should see slight improvement. (but i doubt its the major culprit)

e100 said:
What does your ceph.conf look like on your home cluster?

Code:

global]
     auth client required = none
     auth cluster required = none
     auth service required = none
     cluster network = redacted
     filestore xattr use omap = true
     fsid = redacted
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = redacted
     osd crush location hook = /home/redacted-crush-location-lookup.sh
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_journaler = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring

[mon.0]
     host = redacted
     mon addr = redacted
    mon osd allow primary affinity = true

[mon.1]
     host = redacted
     mon addr = redacted
    mon osd allow primary affinity = true


[mon.2]
     host = redacted
     mon addr = redacted
    mon osd allow primary affinity = true
[...]

removing authentication gave me about 50% boost in read/writes (not the most up to date CPU/Ram combo).
the debug tunables added about 2% more performance.

My limiting factor (non Cached Pools) is that i DO NOT have SSD-journals, and most of my disks are in the 40-80 MB/s range and some that have 160 MB/s for read/writes. On top of that they range from 80GB to 6 TB in cpacity, which is a can of worms in and of itself. (that is obviously not your issue - at least not to that degree). Do circumvent most of that, i have stuck SSD's as SSD-based cache_Tier infront of my R-3 and EC_pools (something i picked up at work on our medium scale, large capacity clusters)

e100 said:
I agree, got any other suggestions of tests to perform?

check post #9

e100 · Jan 27, 2016

udo said:
Hi again,
ext4 permormed better for us than xfs.

Have you tried to disable scrubbing temporarly? (and deep scrub)

Which cachemode do you use for the VM-disk? writeback? And yes, writeperformance has an impact to readperformance.

Udo

I did disable scrubbing and deep scrubbing, did not help
I've tested with cache=default/writeback/write-through all have poor read speed.

I've increased read ahead in the VMs I have debian 7 and ubuntu 14.04 guests.
Do you have any specific suggestions of settings I can try? Maybe my brain fell out and I applied settings incorrectly.

Q-wulf · Jan 27, 2016

e100 said:
I did disable scrubbing and deep scrubbing, did not help

That is a bad idea btw. No way for ceph to find errors on its own (they do happen mostly during unscheduled restarts, or when your drives have a bad sector, that sometimes does not get recognized by smart as such, but also Ram-errors and/or Controller - onboard, hba or raid - related issues. Even seen a "bad cable" issue cause this once at work)

IMHO its better to specify a scrub-window during which there is "low activity" on your ceph-cluster.

consult: http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
search for

Code:

osd scrub begin hour
osd scrub end hour

and
osd scrub load threshold

That way you know when stuff happens, before your data becomes unrecoverable.

e100 · Jan 27, 2016

Q-wulf said:
totally overlooked this. Have you tried to do a synthetic benchmark of your pool ? As in from outside your VM, to see if its a ceph issue or a VM issue ?

rados bench -p rbd 450 write --no-cleanup

Code:

Total time run:  451.708554
Total writes made:  12994
Write size:  4194304
Bandwidth (MB/sec):  115.065

Stddev Bandwidth:  33.7199
Max bandwidth (MB/sec): 196
Min bandwidth (MB/sec): 0
Average Latency:  0.555435
Stddev Latency:  0.514375
Max latency:  6.58996
Min latency:  0.0356727

rados bench -p rbd 450 seq

Code:

Total time run:  231.967478
Total reads made:  12994
Read size:  4194304
Bandwidth (MB/sec):  224.066

Average Latency:  0.285471
Max latency:  11.8254
Min latency:  0.00879239

Q-wulf said:
That is a bad idea btw. No way for ceph to find errors on its own.

Yes, I only disabled it temporarily to see if it helped.

udo said:

This speed up my reads:

Code:

/etc/udev/rules.d/99-virtio.rules
SUBSYSTEM=="block", ATTR{queue/rotational}=="1", ACTION=="add|change", KERNEL=="vd[a-z]", ATTR{bdi/read_ahead_kb}="16384", ATTR{queue/read_ahead_kb}="16384", ATTR{queue/scheduler}="deadline"

That does help slightly.
In a VM that would only get 20-30MB/sec I now get a stead 37MB/sec

udo · Jan 27, 2016

e100 said:
I did disable scrubbing and deep scrubbing, did not help
I've tested with cache=default/writeback/write-through all have poor read speed.

I've increased read ahead in the VMs I have debian 7 and ubuntu 14.04 guests.
Do you have any specific suggestions of settings I can try? Maybe my brain fell out and I applied settings incorrectly.

Hi,
below my ceph-settings (ioprio need the cfq-scheduler on the host)

Code:

[osd]
...
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
osd mount options ext4 = "user_xattr,rw,relatime,nodiratime"
osd_mkfs_type = ext4
osd_mkfs_options_ext4 = -J size=1024 -E lazy_itable_init=0,lazy_journal_init=0
osd_scrub_load_threshold = 2.5
filestore_max_sync_interval = 10
filestore xattr use omap = true #enables the object map. Only if running ext4.
osd max backfills = 1
osd recovery max active = 1
osd_op_threads = 4
osd_disk_threads = 1 #disk threads, which are used to perform background disk intensive OSD operations such as scrubbing

filestore_op_threads = 4
osd_enable_op_tracker = false

osd_disk_thread_ioprio_class  = idle
osd_disk_thread_ioprio_priority = 7

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

How full are your osds?? esp. with xfs the performance drop with filled osds...

And with 16MB read ahead you don't see any differences?

Code:

cat /sys/block/vda/queue/read_ahead_kb
16384

Udo

Q-wulf · Jan 27, 2016

Based on your rados bench results, your issue is not the ceph subsystem. (altho adjusting primary-affinity as per post #11 will give you some more performance in that regard). It is what i'd expect to see based on your described config.

I'd look at anything that is not directly ceph related.

What Bus-type are your VM's using ?

SCSI or virtio ?
have you tried virtio with iothread ? I get a massive boost from that over scsi on my 3-node home cluster.

udo · Jan 27, 2016

Q-wulf said:
Based on your rados bench results, your issue is not the ceph subsystem..

Hi Q-wulf,
sure? I started also with such bad read performance on ceph and doing a lot at the config / systems / upgrades help me to reach better (not perfect) values.

For me the rados bench read is not so good. A part of the reading is cached on the osd-nodes... (depends how much ram the nodes have).

@e100: which CPUs are in your osd-nodes? Normaly I like AMD CPUs, but for ceph are Intel better.

Udo

e100 · Jan 27, 2016

Q-wulf said:
have you tried virtio with iothread ?

With udo's udev rules in a debian 7 VM and this disk configuration:
virtio0: ceph_rbd:vm-101-disk-1,cache=writeback,iothread=on,size=500G

I get this:
dd if=/dev/vda bs=1M
3384803328 bytes (3.4 GB) copied, 46.031 s, 73.5 MB/s

Much better bit still feels like it could be better.

Can iothread option be enabled on 3.x by editing config or does it only work in 4.x?

e100 · Jan 27, 2016

udo said:
@e100: which CPUs are in your osd-nodes? Normaly I like AMD CPUs, but for ceph are Intel better.

This whole cluster was built mostly from decomissioned production stuff so its older.
The three ceph nodes are:

Code:

model name  : AMD Phenom(tm) II X6 1100T Processor
stepping  : 0
cpu MHz  : 3314.825

My lone Proxmox 4.x client is the same as the CEPH nodes.
Thats where things run best with iothreads and krbd, I need to test librbd with your udev rules tho.

The other Proxmox 3.x clients are:

Code:

model name  : Intel(R) Xeon(R) CPU  E5420  @ 2.50GHz
stepping  : 6
microcode  : 1551
cpu MHz  : 2493.909

udo · Jan 27, 2016

e100 said:
This whole cluster was built mostly from decomissioned production stuff so its older.
The three ceph nodes are:

Code:

model name : AMD Phenom(tm) II X6 1100T Processor stepping : 0 cpu MHz : 3314.825

...

Hi,
I would guess, you can be happy with your 73MB/s with such osd-nodes.

If you get much better performance with this config let me know!

Udo

CEPH read performance

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Distinguished Member