Slow VM on external CEPH Cluster

fabilau

New Member
May 17, 2021
20
0
1
23
Hello all,

I have just set up an external CEPH cluster together with an external specialist.

This is configured as follows:
16 OSDs
1 pool
32PGs
7.1TiB free storage

on 4 nodes with each:
64GB RAM
12 core processors
Only NVMe SSDs & normal SSDs
Connected in Cluster Net with 10G
Connected to Proxmox nodes with 1G

Our Proxmox Cluster is configured as follows:
4 Servers with each

64GB RAM
12 core processors
512G RAID 1 SSDs main disks
Connected in Cluster Net with 1G
Proxmox Version 6.4-8

I have now moved two VMs to the Ceph cluster for testing.
One Windows VM & one Linux VM.

I have run disk tests on both of them.
The Windows VM does not hang, but the disk usage remains constant at 100%.

The Linux VM hangs completely after I try to test the speed with the "dd" command. (/proc/sys/kernel/hung_task_timeout_secs error)

Caching is not enabled on either VM.

Unfortunately the CEPH specialist is not familiar with Proxmox and can't help me there.

Therefore, does anyone possibly know what this could be due to?
What kind of data can I provide for troubleshooting?

Thanks in advance &
with kind regards,
Fabian L.
 

fabilau

New Member
May 17, 2021
20
0
1
23
So we've done some performance testing on the cluster itself.
It seems to work as "designed"

So the problem is between proxmox and ceph..
 

fabilau

New Member
May 17, 2021
20
0
1
23
Well at the beginning I had entered the CEPH Mon IPs as below:
192.168.x.x,192.168.x.x,192.168.x.x

I've read an article, where they recommended using ";"

Now it seems to work better?!

Can someone confirm this?
I am completly confused ...
 

fabilau

New Member
May 17, 2021
20
0
1
23
Doesn't seem to fix the problem..
A VM froze again

Has anyone an idea why this happens?
 

ph0x

Active Member
Jul 5, 2020
937
145
43
/dev/null
Is 32 PGs the value that the autoscaler calculated? It seems a bit low for me, then again, I'm not sure if this would cause the problem.
 

fabilau

New Member
May 17, 2021
20
0
1
23
Is 32 PGs the value that the autoscaler calculated? It seems a bit low for me, then again, I'm not sure if this would cause the problem.
Thank you very much for your reply :)

Well the 32PGs were recommended / set by the specialist..
He said, when the autoscaler starts to complain about to few PGs, then I should increase it
 

ph0x

Active Member
Jul 5, 2020
937
145
43
/dev/null
I'm a bit reluctant with that autoscaling since it can lead to significant rebalancing as soon as the number of PGs needs to be adjusted. Apart from that, with your fill rate Ceph's PG calculator recommends 64 PGs, so I doubt that this is the problem here.
 

fabilau

New Member
May 17, 2021
20
0
1
23
I'm a bit reluctant with that autoscaling since it can lead to significant rebalancing as soon as the number of PGs needs to be adjusted. Apart from that, with your fill rate Ceph's PG calculator recommends 64 PGs, so I doubt that this is the problem here.
Yes, I have also heard about the autoscaler.
I will ask the specialist again if it is really necessary.

So the PGs are not it then.... :/

I do not remember to have set the block size.
Could that also be a cause?
I don't know if you have to pay attention to this under CEPH as well (?).
 

ph0x

Active Member
Jul 5, 2020
937
145
43
/dev/null
I'm far from an experienced user with Ceph, but as far as I know, the block size is fixed.

Did you (or your Ceph expert) troubleshoot/benchmark the cluster on one of the nodes directly?
 

fabilau

New Member
May 17, 2021
20
0
1
23
I'm far from an experienced user with Ceph, but as far as I know, the block size is fixed.

Did you (or your Ceph expert) troubleshoot/benchmark the cluster on one of the nodes directly?

Yes sure!

Here are the results:
rados bench -p testbench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_103380
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 44 28 111.994 112 0.484574 0.377677
2 16 76 60 119.992 128 1.15892 0.406142
3 16 107 91 121.325 124 0.844047 0.438868
4 16 139 123 122.991 128 0.348205 0.46094
5 16 167 151 120.791 112 0.349512 0.470014
6 16 200 184 122.658 132 0.0913956 0.481513
7 16 233 217 123.991 132 0.0946067 0.477677
8 16 267 251 125.49 136 0.0882063 0.471242
9 16 299 283 125.768 128 0.172326 0.465426
10 16 331 315 125.99 128 0.354578 0.484468
Total time run: 10.5449
Total writes made: 332
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 125.937
Stddev Bandwidth: 8.05536
Max bandwidth (MB/sec): 136
Min bandwidth (MB/sec): 112
Average IOPS: 31
Stddev IOPS: 2.01384
Max IOPS: 34
Min IOPS: 28
Average Latency(s): 0.502449
Stddev Latency(s): 0.519063
Max latency(s): 2.18493
Min latency(s): 0.0191345

------------------------------------------------

rados bench -p testbench 10 seq
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 42 26 103.987 104 0.329363 0.368254
2 16 76 60 119.988 136 0.423038 0.42663
3 16 107 91 121.322 124 0.665843 0.454509
4 16 139 123 122.989 128 0.909961 0.480363
5 16 169 153 122.389 120 0.00722955 0.481459
6 16 200 184 122.656 124 0.759886 0.492079
7 16 230 214 122.275 120 0.174783 0.484788
8 16 266 250 124.989 144 0.193673 0.478455
9 16 300 284 126.211 136 0.166692 0.484694
10 16 329 313 125.189 116 0.791319 0.487616
Total time run: 10.4647
Total reads made: 330
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 126.138
Average IOPS: 31
Stddev IOPS: 2.86938
Max IOPS: 36
Min IOPS: 26
Average Latency(s): 0.50483
Max latency(s): 1.68897
Min latency(s): 0.00375365

------------------------------------------------

rados bench -p testbench 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 47 31 123.985 124 0.648521 0.342216
2 16 78 62 123.988 124 0.307707 0.422969
3 16 108 92 122.655 120 0.4973 0.449407
4 16 140 124 123.989 128 0.428458 0.475771
5 16 169 153 122.39 116 0.238793 0.476368
6 16 201 185 123.323 128 1.20252 0.482718
7 16 240 224 127.989 156 0.00158918 0.465556
8 16 272 256 127.99 128 0.607769 0.478087
9 16 303 287 127.545 124 1.397 0.480348
10 16 337 321 128.39 136 1.18878 0.481153
Total time run: 10.5381
Total reads made: 338
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 128.297
Average IOPS: 32
Stddev IOPS: 2.76687
Max IOPS: 39
Min IOPS: 29
Average Latency(s): 0.494304
Max latency(s): 1.84683
Min latency(s): 0.00137682
 

fabilau

New Member
May 17, 2021
20
0
1
23
And fio, which I ran on a VM:

fio --rw=randwrite --name=test --size=2G
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [w(1)][71.4%][w=1441MiB/s][w=369k IOPS][eta 00m:08s]
test: (groupid=0, jobs=1): err= 0: pid=12353: Fri Jun 11 04:21:08 2021
write: IOPS=26.0k, BW=105MiB/s (110MB/s)(2048MiB/19441msec); 0 zone resets
clat (nsec): min=922, max=246879k, avg=2342.82, stdev=366495.14
lat (nsec): min=972, max=246879k, avg=2396.77, stdev=366495.21
clat percentiles (nsec):
| 1.00th=[ 1160], 5.00th=[ 1224], 10.00th=[ 1256], 20.00th=[ 1320],
| 30.00th=[ 1352], 40.00th=[ 1400], 50.00th=[ 1432], 60.00th=[ 1480],
| 70.00th=[ 1544], 80.00th=[ 1688], 90.00th=[ 2128], 95.00th=[ 2416],
| 99.00th=[ 5024], 99.50th=[ 5664], 99.90th=[ 8512], 99.95th=[ 9536],
| 99.99th=[17792]
bw ( MiB/s): min= 265, max= 2013, per=100.00%, avg=1139.68, stdev=1236.26, samples=2
iops : min=67970, max=515544, avg=291757.00, stdev=316482.61, samples=2
lat (nsec) : 1000=0.01%
lat (usec) : 2=87.91%, 4=10.34%, 10=1.71%, 20=0.03%, 50=0.01%
lat (usec) : 250=0.01%
lat (msec) : 2=0.01%, 10=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
cpu : usr=1.30%, sys=5.81%, ctx=1537, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,524288,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=105MiB/s (110MB/s), 105MiB/s-105MiB/s (110MB/s-110MB/s), io=2048MiB (2147MB), run=19441-19441msec

Disk stats (read/write):
rbd0: ios=0/780, merge=0/588, ticks=0/1540996, in_queue=1539416, util=93.60%

-------------------------------------------------------

fio --rw=randread --name=test --size=2G
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.16
clock setaffinity failed: Invalid argument
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=7524KiB/s][r=1881 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=12358: Fri Jun 11 04:26:34 2021
read: IOPS=1782, BW=7132KiB/s (7303kB/s)(2048MiB/294058msec)
clat (usec): min=402, max=28169, avg=528.49, stdev=112.29
lat (usec): min=402, max=28169, avg=528.63, stdev=112.30
clat percentiles (usec):
| 1.00th=[ 453], 5.00th=[ 474], 10.00th=[ 486], 20.00th=[ 498],
| 30.00th=[ 506], 40.00th=[ 519], 50.00th=[ 523], 60.00th=[ 529],
| 70.00th=[ 537], 80.00th=[ 545], 90.00th=[ 562], 95.00th=[ 578],
| 99.00th=[ 619], 99.50th=[ 742], 99.90th=[ 1926], 99.95th=[ 3064],
| 99.99th=[ 4555]
bw ( KiB/s): min= 2288, max= 7904, per=100.00%, avg=7541.76, stdev=312.28, samples=556
iops : min= 572, max= 1976, avg=1885.44, stdev=78.07, samples=556
lat (usec) : 500=21.73%, 750=77.78%, 1000=0.19%
lat (msec) : 2=0.21%, 4=0.08%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=0.44%, sys=2.08%, ctx=545409, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=524288,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=7132KiB/s (7303kB/s), 7132KiB/s-7132KiB/s (7303kB/s-7303kB/s), io=2048MiB (2147MB), run=294058-294058msec

Disk stats (read/write):
rbd0: ios=523974/3931, merge=0/2727, ticks=271928/1486607, in_queue=1479152, util=100.00%
 

ph0x

Active Member
Jul 5, 2020
937
145
43
/dev/null
The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution. :)
 

fabilau

New Member
May 17, 2021
20
0
1
23
The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution. :)

Yes thats true!
That was also my thought.

The problem is, that I don't know much about fio.
The specialist told me, that I need to use --ioengine=rbd
but the ceph cluster is not available inside the VM, so that doesn't make sense to me..

I mean the randwrite without --ioengine=rbd was also OK.
While the randread was that bad... I think the problem could be there in some way
 

fabilau

New Member
May 17, 2021
20
0
1
23
The last test is a bit disappointing, isn't it? Especially if there's only SSDs and NVME disks in the cluster.
But the rest look okay for a Gigabit connection.
However, I don't hide that I'm much of use here, since I can't suggest a solution. :)

1623406752572.png

The specialist just send me this table.
Looks interesting regarding my problems / bad performance.

So maybe it is the block size?
 

ph0x

Active Member
Jul 5, 2020
937
145
43
/dev/null
I guess that refers to the block size of the test, and it's pretty normal that with less size throughput goes down while IOPS go up and vice versa.
But this is pretty theoretical since you usually don't saturate your Ceph network all the time. Although my cluster runs on the hypervisors, the idle write throughput is around 300KB/s and read around 50 KB/s with approximately 20 machines logging to an rsyslog VM plus one elasticsearch stack.

So running two guests should easily be possible ...
 

fabilau

New Member
May 17, 2021
20
0
1
23
So running two guests should easily be possible ...

Yes I know...
We currently run our "productive" VMs on a single iSCSI Server, which performs much better than CEPH at the moment
(no problems at all regarding performance)

Because the iSCSI Server is our bottleneck, we wanted to use CEPH now, so it's very disappointing to get that hangs etc.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!