Proxmox Ceph - higher iowait times

chrispage1

Member
Sep 1, 2021
90
47
23
32
Hi,

We're in the process of moving from a cluster that had a networked Dell ScaleIO storage to Proxmox using Ceph. Since moving across a couple of our VM's we've noticed quite an increase in iowait. Is this something that is typical of Ceph due to the nature of it's replication?

For reference the Ceph cluster is 3 nodes with 4 x 4TB 6Gbps SSD drives.

Thanks,
Chris.

1638985715074.png
 
I've tried modifying the Ceph configuration to disable debugging and increase the number of threads for each OSD -

Code:
# osd configuration
osd_pool_default_min_size = 2
osd_pool_default_size = 3
osd_op_num_threads_per_shard_ssd = 4

# disable debugging
debug ms=0
debug mds=0
debug osd=0
debug optracker=0
debug auth=0
debug asok=0
debug bluestore=0
debug bluefs=0
debug bdev=0
debug kstore=0
debug rocksdb=0
debug eventtrace=0
debug default=0
debug rados=0
debug client=0
debug perfcounter=0
debug finisher=0

This seems to have bought the apply/commit latency down somewhat.

Looking at my pool configuration, which I expect to have ~7TB of data when in production use, it has 64 PG's with an optimal PG of 128. From everything I've read, increasing this to 512 should help significantly?

Chris.
 
Regarding the pg num and the autoscaler: If you have the autoscaler enabled, it will only become active once the optimal pg num is off by a factor of 3, so in your case, with currently 64 PGs and 128 optimal, you are at a factor of 2.

How many pools do you have? Did you configure a target_ratio for the pools? This helps the autoscaler a lot to determine how many PGs you will need.
You can also calculate it yourself ( https://old.ceph.com/pgcalc/ ). Select the "All in one" use case and configure how many OSDs you have that the pool can be placed on and if you have more pools, add them as well. The size ratio should be a rough estimate on how much each pool will use.

If you have different OSD types (HDD, SSD,...) and according rules, the number of OSDs is of course only for the specific type.

Ideally, you will have around 100 PGs per OSD. Too little and you might have long recovery times and an uneven usage of the OSDs. Too many and you waste too much CPU time and memory into the management of them. The Ceph docs have a section on how the pg num affects recovery times: https://docs.ceph.com/en/latest/rados/operations/placement-groups/#placement-groups-tradeoffs

You can check with ceph osd df tree how many PGs you have per OSD.
 
Hi aaron,

Thanks for your reply. In total I have four pools -

device_health_metrics - 128 PGs / 128 Optimal PGs - 16MB used
ceph-cluster - 64 PGs / 128 Optimal PGs - 499GB used (the one we'll be storing the majority of data ~7TB)
cephfs_data - 128 PGs / 128 Optimal PGs - 4.6 GB (max of 500GB here)
cephfs_metadata - 256 PGs / 512 Optimal PGs - 19MB used

In total this gives us 576 PGs across three nodes with 4 OSD's, averaging out at about 140 PG's per OSD.
Code:
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME     
-1         43.66425         -   44 TiB  515 GiB  504 GiB   17 MiB    10 GiB   43 TiB  1.15  1.00    -          root default 
-3         14.55475         -   15 TiB  171 GiB  168 GiB  5.5 MiB   2.7 GiB   14 TiB  1.15  1.00    -              host pve01
 0    ssd   3.63869   1.00000  3.6 TiB   35 GiB   34 GiB  2.3 MiB   878 MiB  3.6 TiB  0.93  0.81  147      up          osd.0
 1    ssd   3.63869   1.00000  3.6 TiB   51 GiB   50 GiB  1.4 MiB   721 MiB  3.6 TiB  1.37  1.19  139      up          osd.1
 2    ssd   3.63869   1.00000  3.6 TiB   43 GiB   42 GiB  1.4 MiB   581 MiB  3.6 TiB  1.15  1.00  142      up          osd.2
 3    ssd   3.63869   1.00000  3.6 TiB   42 GiB   42 GiB  456 KiB   596 MiB  3.6 TiB  1.14  0.99  148      up          osd.3
-5         14.55475         -   15 TiB  172 GiB  168 GiB  5.5 MiB   4.0 GiB   14 TiB  1.15  1.00    -              host pve02
 4    ssd   3.63869   1.00000  3.6 TiB   40 GiB   39 GiB  483 KiB   1.1 GiB  3.6 TiB  1.08  0.94  143      up          osd.4
 5    ssd   3.63869   1.00000  3.6 TiB   38 GiB   37 GiB  2.3 MiB   1.2 GiB  3.6 TiB  1.02  0.88  145      up          osd.5
 6    ssd   3.63869   1.00000  3.6 TiB   43 GiB   42 GiB  976 KiB   766 MiB  3.6 TiB  1.16  1.01  142      up          osd.6
 7    ssd   3.63869   1.00000  3.6 TiB   51 GiB   50 GiB  1.9 MiB  1020 MiB  3.6 TiB  1.36  1.18  146      up          osd.7
-7         14.55475         -   15 TiB  172 GiB  168 GiB  5.5 MiB   3.8 GiB   14 TiB  1.15  1.00    -              host pve03
 8    ssd   3.63869   1.00000  3.6 TiB   48 GiB   47 GiB  1.8 MiB   1.1 GiB  3.6 TiB  1.28  1.11  147      up          osd.8
 9    ssd   3.63869   1.00000  3.6 TiB   45 GiB   44 GiB  1.4 MiB   565 MiB  3.6 TiB  1.20  1.05  139      up          osd.9
10    ssd   3.63869   1.00000  3.6 TiB   46 GiB   45 GiB  971 KiB   1.1 GiB  3.6 TiB  1.24  1.08  144      up          osd.10
11    ssd   3.63869   1.00000  3.6 TiB   33 GiB   32 GiB  1.3 MiB   1.1 GiB  3.6 TiB  0.88  0.77  146      up          osd.11
                        TOTAL   44 TiB  515 GiB  504 GiB   17 MiB    10 GiB   43 TiB  1.15                                   
MIN/MAX VAR: 0.77/1.19  STDDEV: 0.15


Would I be right in saying that 128 PGs for device_health_metrics and 256 PGs for cephfs_metadata is completely excessive?

In my mind it would look better as below -

device_health_metrics - 32
ceph-cluster - 256
cephfs_data - 64
cephfs_metadata - 32

Giving us a total of 384 PGs so ~96 per node. Does that sound a lot more reasonable? I don't see why device_health_metrics & cephfs_metadata have been automatically created with so many PGs.

Thanks,
Chris.
 
I don't see why device_health_metrics & cephfs_metadata have been automatically created with so many PGs.
This is due to the way the autoscaler works with default settings since 16.2.6, it will be reverted in the upcoming 16.2.7 release. You can either define target_ratios to your pools (ignore the device_health one) or switch the autoscaler back with ceph osd pool set autoscale-profile scale-up for the device_health_metrics pool to be set back to 1 PG.

The ceph_cluster pool should definitley have more PGs. I recommend that you do set the target_ratios to let the autoscaler know where you are headed. ceph_cluster will most likely end up with over 90% if the current situation will not change a lot in regards to how much data the cephfs pools hold.
 
Thanks Aaron - appreciate your help on this. I'll go ahead and change the scaling around a bit. With regards to making these changes, I presume it'll be a fairly intensive operation to refactor the 600GB or so we've got into new PG groups and we are likely to see some increased latency?

Thanks,
Chris.
 
It is a good test how well the cluster will handle that as you could also lose an OSD which would cause some rebalance / recovery traffic as well. Since you have only a 3-node cluster, that would be limited to the affected node (3 replicas over 3 nodes).

Usually such actions are prioritized quite a bit lower to keep the impact low.

Regarding the latency overall, which SSDs (model) do you have in there and how is the network for Ceph set up?
 
Hi Aaron,

I made the changes with ratios and all has been cleverly recalculated and moved as needed. It's automatically chosen 512 placement groups for the main ceph_cluster pool.

With regards to latency, each machine has 4 x 4TB 6Gbps Samsung 870 SSD's. There's one OSD to every disk and the machines are connected via 10Gbit active/active LACP 802.3ad with layer 2 hashing.

There's a dedicated 10G link just for Ceph's private network and then Ceph monitors and VMs are on a separate 10G link. From everything I've read this is an optimum setup.

My concern is that in moving away from network attached storage we're seeing VMs that would have historically had an IOWait averaging 0.3% which has now increased to 1-2%.

Our apply/commit latency sits averages about 9ms across all OSD's. I'm not sure if this is something that can be simply optimised with Ceph settings?

Appreciate your help with this,
Chris.
 
Hi,

So I've been running some fio tests to see what the random read/write performance is like.

I've got a Ceph pool called ceph_cluster (512 PGs) - this is the pool that all of the VM's sit on.
I also have a CephFS mount on the PVE's called cephfs_data (32 PGs).

The command I am running is - fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=300m --iodepth=1 --runtime=20 --time_based --end_fsync=1

Against the CephFS mounted on the PVE I get -

Code:
write: IOPS=31.9k, BW=125MiB/s (131MB/s)(2557MiB/20507msec); 0 zone resets

A very respectable figure and clearly Ceph is performing well here.

If I then perform this test again on a Ceph mountpoint within a VM on the same node (SCSI bus) I get -

Code:
write: IOPS=6904, BW=26.0MiB/s (28.3MB/s)(770MiB/28535msec); 0 zone resets

So a pretty drastic drop in both IOPS and bandwidth. I'd imagine this could also be the reason why the iowait's are so high?

Thanks,
Chris.
 
Samsung 870 SSD's
Samsung 870 QVOs?

Those are nice as long as you don't write too much data at once! Once the internal SLC cache is full, the write speed drops down to less than 100Mbyte/s... there is a reason why they are as cheap as they are ;)

Do you monitor the network performance? How used is the 10Gbit link? WIth 9 SSDs overall, you are already are in the realm where the 10Gbit network could be a bottleneck. See the Ceph benchmark whitepaper from 2018: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

If I then perform this test again on a Ceph mountpoint within a VM on the same node (SCSI bus) I get -
How is the VM configured? qm config <vmid>

Doing ~7000 IOPS through the qemu layer is not too bad. If you want to check bandwidth, you will need to run it with a larger block size, for example 4M instead of 4k to run into bandwidth and not IOPS limits.

Please also check out the newest Ceph benchmark paper from last year ( https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2020-09-hyper-converged-with-nvme.76516/ ) and especially the FIO commands. In order to not benchmark any caches, use a long runtime (600 seconds) and use direct=1 and sync=1. With that, any caches should be bypassed and if that is not possible, the long runtime should help to reduce their overall effect.

Since you only write just 300MBytes, the results will be nice. If you do a time based benchmark though and let it run for a long enough time, I guess performance will drop quite a bit once you hit those SSDs long enough for their cache to get full.

I just got myself 2 of those for some mostly read storage and only once I did a 10 minute bandwitdh benchmark (bs=4M) did I manage to fill the SLC cache (after about 3 to 4 minutes) and saw performance drop by a lot:
Code:
# fio --ioengine=libaio --filename=/dev/disk/by-id/ata-Samsung_SSD_870_QVO_4TB_S5STNF0R607065Y --direct=1 --sync=1 --rw=write --bs=4M --numjobs=1 --iodepth=1 --runtime=600 --time_based --name=fio
fio: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=108MiB/s][w=27 IOPS][eta 00m:00s]
fio: (groupid=0, jobs=1): err= 0: pid=55567: Wed Dec  1 19:20:01 2021
  write: IOPS=51, BW=205MiB/s (215MB/s)(120GiB/600084msec); 0 zone resets
    slat (usec): min=43, max=367, avg=140.99, stdev=48.67
    clat (msec): min=8, max=103, avg=19.40, stdev=18.44
     lat (msec): min=8, max=103, avg=19.54, stdev=18.46
    clat percentiles (msec):
     |  1.00th=[    9],  5.00th=[    9], 10.00th=[    9], 20.00th=[    9],
     | 30.00th=[    9], 40.00th=[    9], 50.00th=[    9], 60.00th=[    9],
     | 70.00th=[   20], 80.00th=[   25], 90.00th=[   55], 95.00th=[   57],
     | 99.00th=[   83], 99.50th=[   90], 99.90th=[   95], 99.95th=[  100],
     | 99.99th=[  103]
   bw (  KiB/s): min=65536, max=475136, per=100.00%, avg=209725.56, stdev=163449.54, samples=1199
   iops        : min=   16, max=  116, avg=51.15, stdev=39.94, samples=1199
  lat (msec)   : 10=64.26%, 20=8.35%, 50=10.72%, 100=16.63%, 250=0.03%
  cpu          : usr=0.34%, sys=0.25%, ctx=122652, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,30698,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=205MiB/s (215MB/s), 205MiB/s-205MiB/s (215MB/s-215MB/s), io=120GiB (129GB), run=600084-600084msec

That fio benchmark was done on the naked disk so no Ceph or virtualization in between that would also have an impact on the result.
These two lines are especially interesting, as you can see that the max, min and avg values are quite apart each, which is also shown by the quite high standard deviation.
Code:
   bw (  KiB/s): min=65536, max=475136, per=100.00%, avg=209725.56, stdev=163449.54, samples=1199
   iops        : min=   16, max=  116, avg=51.15, stdev=39.94, samples=1199
 
Samsung 870 QVOs?

Those are nice as long as you don't write too much data at once! Once the internal SLC cache is full, the write speed drops down to less than 100Mbyte/s... there is a reason why they are as cheap as they are ;)

No, EVO's fortunately :) Ideally would be good to have enterprise 12Gbps drives in there but they're hard to get hold of at the minute!

Do you monitor the network performance? How used is the 10Gbit link? WIth 9 SSDs overall, you are already are in the realm where the 10Gbit network could be a bottleneck. See the Ceph benchmark whitepaper from 2018: https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/

We monitor the throughput constantly and haven't seen any real issues but from our tests although I can see we could get close to the mark so is certainly something I'll keep my eye on. Our general workload though shouldn't put too much stress on this. Our applications are mostly small reads & writes.

How is the VM configured? qm config <vmid>

Here is an output of the config, if you spot anything obvious please do let me know!

Code:
boot: order=scsi0;ide2;net0
cores: 4
ide2: cephfs:iso/ubuntu-20.04.3-live-server-amd64.iso,media=cdrom
memory: 4096
meta: creation-qemu=6.1.0,ctime=1639144331
name: iotest
net0: virtio=6A:E3:F6:12:38:E3,bridge=vmbr0,tag=3103
numa: 0
ostype: l26
scsi0: ceph-cluster:vm-109-disk-0,discard=on,size=32G
scsi1: ceph-cluster:vm-109-disk-1,size=32G,ssd=1
scsihw: virtio-scsi-pci
smbios1: uuid=65557df6-4d98-49af-b019-efaa49af6a97
sockets: 1
virtio1: ceph-cluster:vm-109-disk-2,discard=on,size=32G
vmgenid: bc110a93-f080-4ef2-895a-b01e97438c37

- - - -

I noticed a separate anomaly which might actually be related to our IOWait issues and I wonder if this might be something significant? From my understanding, the VM's would communicate with Proxmox which can talk directly with the Ceph OSD's and the OSD's would then synchronise this data across themselves using the cluster network. We have a public network which the VM's and Ceph monitors sit on and a cluster network for the OSD's (both 10G)

However we had an issue with one of the bond ports (on the public network) on our PVE01 where the throughput was limited to around 40Mbit/s. The other bond port had no issues but due to the active/active configuration it'd route through either. While this was happening the IOWait on our VM's on the affected node shot through the roof to the point the VM's became unresponsive. Is this expected behaviour? I didn't imagine that Ceph would depend on the public network for performing reads & writes? I wonder if there is a misconfiguration that means Ceph is taking a more complex route to read/write data than it needs??

Thanks,
Chris.
 
  • Like
Reactions: Deepen Dhulla
So I've just run a test and from the VM I posted above. From monitoring the 'public' network and running an fio test I can see it's throwing data across the public network and distributing it to the other nodes.

I wasn't expecting this as I'd assumed (perhaps wrongly?) that it would read directly off of the OSD's on the current server and via Ceph's private network?

Checking network latency with ping, to communicate we're looking at about 0.093ms to one node and 0.137ms to another which I presume would add to the overhead associated with Ceph.
 
  • Like
Reactions: Deepen Dhulla
@spirit do you think that moving the journals off to enterprise disks but keeping our actual storage on the Samsung EVO's will resolve our IOWait issue? I'd imagine it should because the fast disks will very quickly be able to tell Ceph where the data is.

So from what you mention, our current problem right now is locating where the data is takes too long, or writing the location to the journal and as such the IOWait of the VM's is higher.

Thanks,
Chris.
 
  • Like
Reactions: Deepen Dhulla
I thought it'd be good to follow up on this off the back of @Deepen Dhulla's like...

Long story short, EVO's just don't cut it when it comes to Ceph. In the end we replaced all 12 drives with Samsung PM893 3.84TB SATA SSD's. These have power loss protection which is what Ceph depends on for running quickly.

At the server level we also disabled write caching on each disk and changed the cache type to write through with a little script stored in /etc/init.d/ceph-disk-cache.sh. It's worth testing this first as the below configuration may not be optimal for everyone - it depends on many factors.

Code:
#!/bin/bash

DISKS=("0:0:2:0" "0:0:3:0" "0:0:4:0" "0:0:5:0")

for DISK in ${DISKS[@]}; do
  echo "Setting write through for SCSI ${DISK}"
  hdparm -W0 "/dev/disk/by-path/pci-0000:02:00.0-scsi-${DISK}"
  echo "write through" > "/sys/class/scsi_disk/${DISK}/cache_type"
done

We created a systemctl configuration to run this script at every boot, forcing the write caching to be disabled with a cache type of write through at every boot stored in /etc/systemd/system/ceph-disk-cache.service

Code:
[Unit]
Description=Set Ceph disk cache setup on boot
After=local-fs.target
StartLimitIntervalSec=0

[Service]
Type=simple
ExecStart=/etc/init.d/ceph-disk-cache.sh

[Install]
WantedBy=multi-user.target

After this is simply a case of running systemctl enable ceph-disk-cache which will then force your systemctl script to run at startup and set the caching configuration.

With all of the above done, last month we had an average commit/apply latency of 0.28ms.

I hope this is of use to someone - probably to me in a few months time!
 
Last edited:
I thought it'd be good to follow up on this off the back of @Deepen Dhulla's like...

Long story short, EVO's just don't cut it when it comes to Ceph. In the end we replaced all 12 drives with Samsung PM893 3.84TB SATA SSD's. These have power loss protection which is what Ceph depends on for running quickly.

On the server we also disabled write caching on each disk and changed the cache type to write through with a little script stored in /etc/init.d/ceph-disk-cache.sh

Code:
#!/bin/bash

DISKS=("0:0:2:0" "0:0:3:0" "0:0:4:0" "0:0:5:0")

for DISK in ${DISKS[@]}; do
  echo "Setting write through for SCSI ${DISK}"
  hdparm -W0 "/dev/disk/by-path/pci-0000:02:00.0-scsi-${DISK}"
  echo "write through" > "/sys/class/scsi_disk/${DISK}/cache_type"
done

We created a systemctl configuration to run this script at every boot, forcing the write caching to be disabled with a cache type of write through at every boot stored in /etc/systemd/system/ceph-disk-cache.service

Code:
[Unit]
Description=Set Ceph disk cache setup on boot
After=local-fs.target
StartLimitIntervalSec=0

[Service]
Type=simple
ExecStart=/etc/init.d/ceph-disk-cache.sh

[Install]
WantedBy=multi-user.target

After this is simply a case of running systemctl enable ceph-disk-cache which will then force your systemctl script to run at startup and set the caching configuration.

With all of the above done, last month we had an average commit/apply latency of 0.28ms.

I hope this is of use to someone!
Hi, many thanks for sharing.

do you see a big difference in latency with writethrough vs writeback ?

my nvme drives are writethrough by default, but my ssd (datacenter grade, mostly intel), have indeed writeback by default.
(ceph journal/db use syncronous write, so it shouldn't be cache anyway)
 
Hi, many thanks for sharing.

do you see a big difference in latency with writethrough vs writeback ?

my nvme drives are writethrough by default, but my ssd (datacenter grade, mostly intel), have indeed writeback by default.
(ceph journal/db use syncronous write, so it shouldn't be cache anyway)

When I initially installed the drives I was disappointed to see we were still having issues as we started to apply load. It was mainly the IOWait of guests rather than apply/commit figures.

Turning the write cache off and setting to write through definitely helped us. Although this was done back in February so I don't have the exact benchmarks to hand.

I'd love to have NVMe within our setup but unfortunately we were limited by costs - mainly chassis.