Slow VM on external CEPH Cluster

fabilau · Jun 8, 2021

Hello all,

I have just set up an external CEPH cluster together with an external specialist.

This is configured as follows:
16 OSDs
1 pool
32PGs
7.1TiB free storage

on 4 nodes with each:
64GB RAM
12 core processors
Only NVMe SSDs & normal SSDs
Connected in Cluster Net with 10G
Connected to Proxmox nodes with 1G

Our Proxmox Cluster is configured as follows:
4 Servers with each
64GB RAM
12 core processors
512G RAID 1 SSDs main disks
Connected in Cluster Net with 1G
Proxmox Version 6.4-8

I have now moved two VMs to the Ceph cluster for testing.
One Windows VM & one Linux VM.

I have run disk tests on both of them.
The Windows VM does not hang, but the disk usage remains constant at 100%.

The Linux VM hangs completely after I try to test the speed with the "dd" command. (/proc/sys/kernel/hung_task_timeout_secs error)

Caching is not enabled on either VM.

Unfortunately the CEPH specialist is not familiar with Proxmox and can't help me there.

Therefore, does anyone possibly know what this could be due to?
What kind of data can I provide for troubleshooting?

Thanks in advance &
with kind regards,
Fabian L.

fabilau · Jun 8, 2021

I've already saw this post:
https://forum.proxmox.com/threads/krbd-and-external-ceph-slow-vm-disk-use-100.88684/

But this does not seem to be a solution for us.

fabilau · Jun 8, 2021

So we've done some performance testing on the cluster itself.
It seems to work as "designed"

So the problem is between proxmox and ceph..

fabilau · Jun 8, 2021

Well at the beginning I had entered the CEPH Mon IPs as below:
192.168.x.x,192.168.x.x,192.168.x.x

I've read an article, where they recommended using ";"

Now it seems to work better?!

Can someone confirm this?
I am completly confused ...

fabilau · Jun 10, 2021

Doesn't seem to fix the problem..
A VM froze again

Has anyone an idea why this happens?

ph0x · Jun 10, 2021

Is 32 PGs the value that the autoscaler calculated? It seems a bit low for me, then again, I'm not sure if this would cause the problem.

fabilau · Jun 10, 2021

ph0x said:
Is 32 PGs the value that the autoscaler calculated? It seems a bit low for me, then again, I'm not sure if this would cause the problem.

Thank you very much for your reply

Well the 32PGs were recommended / set by the specialist..
He said, when the autoscaler starts to complain about to few PGs, then I should increase it

ph0x · Jun 10, 2021

How much total space you have and how much is used?

fabilau · Jun 10, 2021

ph0x said:
How much total space you have and how much is used?

7.1TiB Raw space
And due to the replicas there is about 2.4TiB space available for use

currently in use is only 311GiB

ph0x · Jun 10, 2021

I'm a bit reluctant with that autoscaling since it can lead to significant rebalancing as soon as the number of PGs needs to be adjusted. Apart from that, with your fill rate Ceph's PG calculator recommends 64 PGs, so I doubt that this is the problem here.

fabilau · Jun 10, 2021

ph0x said:
I'm a bit reluctant with that autoscaling since it can lead to significant rebalancing as soon as the number of PGs needs to be adjusted. Apart from that, with your fill rate Ceph's PG calculator recommends 64 PGs, so I doubt that this is the problem here.

Yes, I have also heard about the autoscaler.
I will ask the specialist again if it is really necessary.

So the PGs are not it then.... :/

I do not remember to have set the block size.
Could that also be a cause?
I don't know if you have to pay attention to this under CEPH as well (?).

ph0x · Jun 11, 2021

I'm far from an experienced user with Ceph, but as far as I know, the block size is fixed.

Did you (or your Ceph expert) troubleshoot/benchmark the cluster on one of the nodes directly?

fabilau · Jun 11, 2021

ph0x said:
I'm far from an experienced user with Ceph, but as far as I know, the block size is fixed.

Did you (or your Ceph expert) troubleshoot/benchmark the cluster on one of the nodes directly?

Yes sure!

Here are the results:
rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_103380
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 44 28 111.994 112 0.484574 0.377677
2 16 76 60 119.992 128 1.15892 0.406142
3 16 107 91 121.325 124 0.844047 0.438868
4 16 139 123 122.991 128 0.348205 0.46094
5 16 167 151 120.791 112 0.349512 0.470014
6 16 200 184 122.658 132 0.0913956 0.481513
7 16 233 217 123.991 132 0.0946067 0.477677
8 16 267 251 125.49 136 0.0882063 0.471242
9 16 299 283 125.768 128 0.172326 0.465426
10 16 331 315 125.99 128 0.354578 0.484468
Total time run: 10.5449
Total writes made: 332
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 125.937
Stddev Bandwidth: 8.05536
Max bandwidth (MB/sec): 136
Min bandwidth (MB/sec): 112
Average IOPS: 31
Stddev IOPS: 2.01384
Max IOPS: 34
Min IOPS: 28
Average Latency(s): 0.502449
Stddev Latency(s): 0.519063
Max latency(s): 2.18493
Min latency(s): 0.0191345

------------------------------------------------

rados bench -p testbench 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 42 26 103.987 104 0.329363 0.368254
2 16 76 60 119.988 136 0.423038 0.42663
3 16 107 91 121.322 124 0.665843 0.454509
4 16 139 123 122.989 128 0.909961 0.480363
5 16 169 153 122.389 120 0.00722955 0.481459
6 16 200 184 122.656 124 0.759886 0.492079
7 16 230 214 122.275 120 0.174783 0.484788
8 16 266 250 124.989 144 0.193673 0.478455
9 16 300 284 126.211 136 0.166692 0.484694
10 16 329 313 125.189 116 0.791319 0.487616
Total time run: 10.4647
Total reads made: 330
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 126.138
Average IOPS: 31
Stddev IOPS: 2.86938
Max IOPS: 36
Min IOPS: 26
Average Latency(s): 0.50483
Max latency(s): 1.68897
Min latency(s): 0.00375365

------------------------------------------------

rados bench -p testbench 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 47 31 123.985 124 0.648521 0.342216
2 16 78 62 123.988 124 0.307707 0.422969
3 16 108 92 122.655 120 0.4973 0.449407
4 16 140 124 123.989 128 0.428458 0.475771
5 16 169 153 122.39 116 0.238793 0.476368
6 16 201 185 123.323 128 1.20252 0.482718
7 16 240 224 127.989 156 0.00158918 0.465556
8 16 272 256 127.99 128 0.607769 0.478087
9 16 303 287 127.545 124 1.397 0.480348
10 16 337 321 128.39 136 1.18878 0.481153
Total time run: 10.5381
Total reads made: 338
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 128.297
Average IOPS: 32
Stddev IOPS: 2.76687
Max IOPS: 39
Min IOPS: 29
Average Latency(s): 0.494304
Max latency(s): 1.84683
Min latency(s): 0.00137682

fabilau · Jun 11, 2021

And fio, which I ran on a VM:

fio --rw=randwrite --name=test --size=2G
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [w(1)][71.4%][w=1441MiB/s][w=369k IOPS][eta 00m:08s]
test: (groupid=0, jobs=1): err= 0: pid=12353: Fri Jun 11 04:21:08 2021
write: IOPS=26.0k, BW=105MiB/s (110MB/s)(2048MiB/19441msec); 0 zone resets
clat (nsec): min=922, max=246879k, avg=2342.82, stdev=366495.14
lat (nsec): min=972, max=246879k, avg=2396.77, stdev=366495.21
clat percentiles (nsec):
| 1.00th=[ 1160], 5.00th=[ 1224], 10.00th=[ 1256], 20.00th=[ 1320],
| 30.00th=[ 1352], 40.00th=[ 1400], 50.00th=[ 1432], 60.00th=[ 1480],
| 70.00th=[ 1544], 80.00th=[ 1688], 90.00th=[ 2128], 95.00th=[ 2416],
| 99.00th=[ 5024], 99.50th=[ 5664], 99.90th=[ 8512], 99.95th=[ 9536],
| 99.99th=[17792]
bw ( MiB/s): min= 265, max= 2013, per=100.00%, avg=1139.68, stdev=1236.26, samples=2
iops : min=67970, max=515544, avg=291757.00, stdev=316482.61, samples=2
lat (nsec) : 1000=0.01%
lat (usec) : 2=87.91%, 4=10.34%, 10=1.71%, 20=0.03%, 50=0.01%
lat (usec) : 250=0.01%
lat (msec) : 2=0.01%, 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
cpu : usr=1.30%, sys=5.81%, ctx=1537, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=2048MiB (2147MB), run=19441-19441msec

Disk stats (read/write):
rbd0: ios=0/780, merge=0/588, ticks=0/1540996, in_queue=1539416, util=93.60%

-------------------------------------------------------

fio --rw=randread --name=test --size=2G
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=7524KiB/s][r=1881 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12358: Fri Jun 11 04:26:34 2021
read: IOPS=1782, BW=7132KiB/s (7303kB/s)(2048MiB/294058msec)
clat (usec): min=402, max=28169, avg=528.49, stdev=112.29
lat (usec): min=402, max=28169, avg=528.63, stdev=112.30
clat percentiles (usec):
| 1.00th=[ 453], 5.00th=[ 474], 10.00th=[ 486], 20.00th=[ 498],
| 30.00th=[ 506], 40.00th=[ 519], 50.00th=[ 523], 60.00th=[ 529],
| 70.00th=[ 537], 80.00th=[ 545], 90.00th=[ 562], 95.00th=[ 578],
| 99.00th=[ 619], 99.50th=[ 742], 99.90th=[ 1926], 99.95th=[ 3064],
| 99.99th=[ 4555]
bw ( KiB/s): min= 2288, max= 7904, per=100.00%, avg=7541.76, stdev=312.28, samples=556
iops : min= 572, max= 1976, avg=1885.44, stdev=78.07, samples=556
lat (usec) : 500=21.73%, 750=77.78%, 1000=0.19%
lat (msec) : 2=0.21%, 4=0.08%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.44%, sys=2.08%, ctx=545409, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=7132KiB/s (7303kB/s), 7132KiB/s-7132KiB/s (7303kB/s-7303kB/s), io=2048MiB (2147MB), run=294058-294058msec

Disk stats (read/write):
rbd0: ios=523974/3931, merge=0/2727, ticks=271928/1486607, in_queue=1479152, util=100.00%

ph0x · Jun 11, 2021

The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution.

fabilau · Jun 11, 2021

ph0x said:
The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution.

Yes thats true!
That was also my thought.

The problem is, that I don't know much about fio.
The specialist told me, that I need to use --ioengine=rbd
but the ceph cluster is not available inside the VM, so that doesn't make sense to me..

I mean the randwrite without --ioengine=rbd was also OK.
While the randread was that bad... I think the problem could be there in some way

fabilau · Jun 11, 2021

ph0x said:
The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution.

The specialist just send me this table.
Looks interesting regarding my problems / bad performance.

So maybe it is the block size?

ph0x · Jun 11, 2021

I guess that refers to the block size of the test, and it's pretty normal that with less size throughput goes down while IOPS go up and vice versa.
But this is pretty theoretical since you usually don't saturate your Ceph network all the time. Although my cluster runs on the hypervisors, the idle write throughput is around 300KB/s and read around 50 KB/s with approximately 20 machines logging to an rsyslog VM plus one elasticsearch stack.

So running two guests should easily be possible ...

fabilau · Jun 11, 2021

ph0x said:
So running two guests should easily be possible ...

Yes I know...
We currently run our "productive" VMs on a single iSCSI Server, which performs much better than CEPH at the moment
(no problems at all regarding performance)

Because the iSCSI Server is our bottleneck, we wanted to use CEPH now, so it's very disappointing to get that hangs etc.

fabilau · Jun 11, 2021

Repeated the same test on a second VM:

Slow VM on external CEPH Cluster

New Member

New Member

New Member

New Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

Renowned Member

New Member

New Member

Renowned Member

New Member

New Member

Renowned Member

New Member

New Member

We value your privacy