ceph across 5 hosts and 20 osds on spindles no iowa on pve's but vm's very high even single vm

baggins · Apr 25, 2023

Hi Everyone,

We have a constant write (4-10MBs, total) application with ~ 120 vm's and burned though ssd's in just under 2 years with our dedicated hosting provider so they recommended we got back to spindles. We have built 5 new servers as follows:

high core count, 256G RAM, 480 SSD for the OS and 4 x 2TB
An OSD per spindle
1 x 10G network for Ceph, cluster, VM traffic running in there own Qvlan
1 X 10G network for our internet connect witch reports speed of 10G but is throttle to 1G for internet

Even with a single vm running we get high iowa on the vm. we dont see high iowa on the pve hosts. As we cant gpo back to ssd's and im not sure its a good fitr anyway, i do get that having a single 10G sink for all vlans is not optimal but again a pain point we habve to live with i could see how as i add vm's but a single vm?

we had ceph running on our last 5 node cluster but there were three front ends and two file servers as the primary ceph nodes with smaller osd's on the 3 nodes so again not optimal but we got good performance until the disks started to drop like flies after around 1.5 years.

vm's are setup as follows:

vertio single
disks: iothread=1, discard=0
both imported vms from the old cluster and a newly created vm tested same results.

VM: %Cpu(s): 14.3 us, 0.7 sy, 0.0 ni, 0.0 id, 84.4 wa, 0.0 hi, 0.7 si, 0.0 st
PVE Host: %Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 98.8 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st

Proxmox: 7.4-3 (all nodes updated and running same versions)
During ceph rebalance we see average of 60 MiB/s
average writes when when iowa were recorded above 10-15

Ceph:
[global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.62.2.97/27 fsid = 460edbc2-56d9-44d0-a07d-891edffe6f0b mon_allow_pool_delete = true mon_host = 10.62.2.97 10.62.2.99 10.62.2.101 10.62.2.98 10.62.2.100 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.62.2.97/27 [client] keyring = /etc/pve/priv/$cluster.$name.keyring [mds] keyring = /var/lib/ceph/mds/ceph-$id/keyring [mds.pve1] host = pve1 mds_standby_for_name = pve [mds.pve2] host = pve2 mds_standby_for_name = pve [mds.pve3] host = pve3 mds_standby_for_name = pve [mds.pve4] host = pve4 mds_standby_for_name = pve [mds.pve5] host = pve5 mds standby for name = pve [mon.pve1] public_addr = 10.62.2.97 [mon.pve2] public_addr = 10.62.2.98 [mon.pve3] public_addr = 10.62.2.99

osd options for max_capacity range from 191-430

weight per node is ~7.2

not sure what else would be helpful.

The cluster is not in production yet but er need to get this sorted. it feels like a VM config issue, we can not find any evidence of a bottle neck in ceph and even with 40 vms running do not see an increase in iowa than when we are running 1?

All help would be appreciated.

aaron · Apr 25, 2023

baggins said:
and burned though ssd's in just under 2 years

Were those consumer SSDs?
There is a big difference in performance and endurance once you switch to enterprise/datacenter SSDs with real power loss protection.

HDDs are really bad at random IO. You could check if the cache on the HDDs is enabled:

Code:

hdparm -W /dev/sdX

Can you please post the following within [code][/code] tags for better readability?

ceph osd df tree
the ceph config again, but readable
the VM config: qm config <vmid>

Have you checked the following?

large MTU on the Ceph network (ping with large packets to ensure it works)
latency between the nodes on the ceph network
network performance between the nodes on the ceph network? iperf or iperf3

baggins · Apr 25, 2023

Code:

ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME     
 -1         36.38794         -   36 TiB  1.9 TiB  1.9 TiB  334 KiB   16 GiB   35 TiB  5.17  1.00    -          root default 
 -3          7.27759         -  7.3 TiB  314 GiB  311 GiB   68 KiB  2.2 GiB  7.0 TiB  4.21  0.81    -              host pve1
  0    hdd   1.81940   1.00000  1.8 TiB   85 GiB   84 GiB   16 KiB  694 MiB  1.7 TiB  4.54  0.88   17      up          osd.0
  1    hdd   1.81940   1.00000  1.8 TiB  101 GiB  100 GiB   25 KiB  937 MiB  1.7 TiB  5.41  1.05   18      up          osd.1
  2    hdd   1.81940   1.00000  1.8 TiB   22 GiB   22 GiB    1 KiB  173 MiB  1.8 TiB  1.17  0.23    7      up          osd.2
  3    hdd   1.81940   1.00000  1.8 TiB  106 GiB  106 GiB   26 KiB  446 MiB  1.7 TiB  5.70  1.10   26      up          osd.3
 -5          7.27759         -  7.3 TiB  317 GiB  314 GiB   72 KiB  3.1 GiB  7.0 TiB  4.26  0.82    -              host pve2
  4    hdd   1.81940   1.00000  1.8 TiB  124 GiB  123 GiB   27 KiB  1.1 GiB  1.7 TiB  6.67  1.29   23      up          osd.4
  5    hdd   1.81940   1.00000  1.8 TiB   84 GiB   84 GiB   25 KiB  749 MiB  1.7 TiB  4.53  0.88   17      up          osd.5
  6    hdd   1.81940   1.00000  1.8 TiB   59 GiB   58 GiB    9 KiB  500 MiB  1.8 TiB  3.14  0.61   14      up          osd.6
  7    hdd   1.81940   1.00000  1.8 TiB   50 GiB   49 GiB   11 KiB  887 MiB  1.8 TiB  2.69  0.52   15      up          osd.7
 -7          7.27759         -  7.3 TiB  408 GiB  404 GiB   58 KiB  3.7 GiB  6.9 TiB  5.47  1.06    -              host pve3
  8    hdd   1.81940   1.00000  1.8 TiB   86 GiB   85 GiB   12 KiB  966 MiB  1.7 TiB  4.62  0.89   15      up          osd.8
  9    hdd   1.81940   1.00000  1.8 TiB  122 GiB  121 GiB   18 KiB  1.1 GiB  1.7 TiB  6.57  1.27   20      up          osd.9
 10    hdd   1.81940   1.00000  1.8 TiB  106 GiB  105 GiB    7 KiB  1.0 GiB  1.7 TiB  5.68  1.10   19      up          osd.10
 11    hdd   1.81940   1.00000  1.8 TiB   93 GiB   93 GiB   21 KiB  687 MiB  1.7 TiB  5.02  0.97   21      up          osd.11
 -9          7.27759         -  7.3 TiB  470 GiB  466 GiB   66 KiB  3.9 GiB  6.8 TiB  6.31  1.22    -              host pve4
 12    hdd   1.81940   1.00000  1.8 TiB  105 GiB  104 GiB   10 KiB  1.1 GiB  1.7 TiB  5.65  1.09   18      up          osd.12
 13    hdd   1.81940   1.00000  1.8 TiB  110 GiB  108 GiB   18 KiB  1.2 GiB  1.7 TiB  5.88  1.14   25      up          osd.13
 14    hdd   1.81940   1.00000  1.8 TiB  130 GiB  129 GiB   20 KiB  773 MiB  1.7 TiB  6.98  1.35   22      up          osd.14
 15    hdd   1.81940   1.00000  1.8 TiB  125 GiB  124 GiB   18 KiB  876 MiB  1.7 TiB  6.71  1.30   27      up          osd.15
-11          7.27759         -  7.3 TiB  419 GiB  416 GiB   70 KiB  3.2 GiB  6.9 TiB  5.62  1.09    -              host pve5
 16    hdd   1.81940   1.00000  1.8 TiB   86 GiB   85 GiB   15 KiB  834 MiB  1.7 TiB  4.61  0.89   16      up          osd.16
 17    hdd   1.81940   1.00000  1.8 TiB  124 GiB  123 GiB   21 KiB  898 MiB  1.7 TiB  6.66  1.29   22      up          osd.17
 18    hdd   1.81940   1.00000  1.8 TiB  128 GiB  127 GiB   15 KiB  981 MiB  1.7 TiB  6.88  1.33   23      up          osd.18
 19    hdd   1.81940   1.00000  1.8 TiB   81 GiB   80 GiB   19 KiB  523 MiB  1.7 TiB  4.33  0.84   22      up          osd.19
                         TOTAL   36 TiB  1.9 TiB  1.9 TiB  344 KiB   16 GiB   35 TiB  5.17                                   
MIN/MAX VAR: 0.23/1.35  STDDEV: 1.50

baggins · Apr 25, 2023

Code:

boot: order=scsi0
cores: 1
memory: 32000
name: 2.31.zab1
net0: virtio=C2:63:E0:D8:7F:A9,bridge=vmbr93,firewall=1,tag=3002
numa: 0
ostype: l26
scsi0: images_2:vm-300231-disk-0,iothread=1,size=32G
scsi1: images_2:vm-300231-disk-1,iothread=1,size=5G
scsi2: images_2:vm-300231-disk-2,iothread=1,size=1G
scsi3: images_2:vm-300231-disk-3,iothread=1,size=10G
scsihw: virtio-scsi-single
smbios1: uuid=9752685b-5e76-4985-b6f5-05791c7f8fb4
sockets: 1
vmgenid: 8ae138b1-e65b-45d6-adbf-33ffbdf306f4

baggins · Apr 25, 2023

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.62.2.97/27
     fsid = 460edbc2-56d9-44d0-a07d-891edffe6f0b
     mon_allow_pool_delete = true
     mon_host = 10.62.2.97 10.62.2.99 10.62.2.101 10.62.2.98 10.62.2.100
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.62.2.97/27

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
     keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve1]
     host = pve1
     mds_standby_for_name = pve

[mds.pve2]
     host = pve2
     mds_standby_for_name = pve

[mds.pve3]
     host = pve3
     mds_standby_for_name = pve

[mds.pve4]
     host = pve4
     mds_standby_for_name = pve

[mds.pve5]
     host = pve5
     mds standby for name = pve

[mon.pve1]
     public_addr = 10.62.2.97

[mon.pve2]
     public_addr = 10.62.2.98

[mon.pve3]
     public_addr = 10.62.2.99

[mon.pve4]
     public_addr = 10.62.2.100

[mon.pve5]
     public_addr = 10.62.2.101

baggins · Apr 25, 2023

Thank you for the assistance i hope this helps. We provider did tell us that the ssd were enterprise/datacenter drives and it s reputable hosting provider, although the storage engineers had no real experience with ceph.

baggins · Apr 25, 2023

Code:

root@pve1:~# fdisk -l
Disk /dev/sde: 446.63 GiB, 479559942144 bytes, 936640512 sectors
Disk model: PERC H750 Adp   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes
Disklabel type: gpt
Disk identifier: F446AFD4-FE28-489B-BD3F-7517069F0207

Device       Start       End   Sectors   Size Type
/dev/sde1       34      2047      2014  1007K BIOS boot
/dev/sde2     2048   2099199   2097152     1G EFI System
/dev/sde3  2099200 936640478 934541279 445.6G Linux LVM

Partition 1 does not start on physical sector boundary.


Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD2000FYYZ-0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdd: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD2000FYYZ-0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sda: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD2000FYYZ-0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdc: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WDC WD2000FYYZ-0
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/pve-swap: 8 GiB, 8589934592 bytes, 16777216 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes


Disk /dev/mapper/pve-root: 96 GiB, 103079215104 bytes, 201326592 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 262144 bytes / 262144 bytes


Disk /dev/mapper/ceph--ec7e3abc--7868--4e38--98f2--80433a137987-osd--block--5929f394--e19f--44bd--b857--5c8b4568ee0a: 1.82 TiB, 2000397795328 bytes, 3907026944 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/ceph--90c1ad40--e20a--48f9--8330--c9b4b72af24b-osd--block--5f166f97--04ff--4f6f--a4e6--d26ae57a8835: 1.82 TiB, 2000397795328 bytes, 3907026944 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/ceph--52f83f8f--bcc2--4efa--b6a6--2e6b53820e83-osd--block--9e31d4b0--7d15--4b74--9394--ef3f5bf7390b: 1.82 TiB, 2000397795328 bytes, 3907026944 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/mapper/ceph--592169d5--2d7b--4761--a966--386f1a2582ce-osd--block--787a5bb7--70fa--4cb1--b2de--76566e2432b8: 1.82 TiB, 2000397795328 bytes, 3907026944 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

baggins · Apr 25, 2023

write-caching = 1 (on)

All disks

baggins · Apr 25, 2023

Code:

Connecting to host 10.62.2.98, port 5201
[  5] local 10.62.2.97 port 39472 connected to 10.62.2.98 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.10 GBytes  9.41 Gbits/sec    0   1.62 MBytes       
[  5]   1.00-2.00   sec  1.08 GBytes  9.26 Gbits/sec  764    983 KBytes       
[  5]   2.00-3.00   sec  1.09 GBytes  9.34 Gbits/sec  109   1.20 MBytes       
[  5]   3.00-4.00   sec  1.09 GBytes  9.32 Gbits/sec    4   1.33 MBytes       
[  5]   4.00-5.00   sec  1.09 GBytes  9.38 Gbits/sec    0   1.84 MBytes       
[  5]   5.00-6.00   sec  1.09 GBytes  9.35 Gbits/sec  320   1.53 MBytes       
[  5]   6.00-7.00   sec  1.09 GBytes  9.32 Gbits/sec  116   1.53 MBytes       
[  5]   7.00-8.00   sec  1.09 GBytes  9.34 Gbits/sec   63   1.16 MBytes       
[  5]   8.00-9.00   sec  1.09 GBytes  9.39 Gbits/sec    0   1.73 MBytes       
[  5]   9.00-10.00  sec  1.09 GBytes  9.34 Gbits/sec  104   1.62 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec  1480             sender
[  5]   0.00-10.04  sec  10.9 GBytes  9.31 Gbits/sec                  receiver

iperf Done.

baggins · Apr 25, 2023

I have asked if the provider switch support jumbo frames, our current net conf is at 1500

baggins · Apr 25, 2023

latency all round is between .1 - .23 ms

alexskysilk · Apr 25, 2023

baggins said:
latency all round is between .1 - .23 ms

Thats not unusual for spinning disks. you'd get into the 100s on a rebalance.

baggins · Apr 25, 2023

I did just check our old cluster and it is also at 1500, one noticeable difference ids there is 0/0 apply/commit latency where as the new cluster is between 20-60 with the average being around 30

baggins · Apr 25, 2023

It is actually doing a balance at the moment. its been going for about to hours:

Recovery/ Rebalance:

96.31% (14.58 MiB/s - 1h 23m 28.7s left)

baggins · Apr 25, 2023

Ok i have confirmed that our switches can support 9214 bytes and have updated all the configs for the private network on all Ppve hosts.

Thanks aaron for all the recommendations. We still do seem to have some good news and some bad. We still see iowa states at 3-12, then to 0 with the server under no or very little load. But the rebalance has jumped from 14 MiB to 60-80 MiB, the changes are definitely improving things.

I did reboot the servers after the MTU hike so the rebalance is still on going.

aaron · Apr 26, 2023

baggins said:
provider did tell us that the ssd were enterprise/datacenter

Without having some data of the actual hardware (vendor, make/model) we can only speculate. But I would be very surprised if DC SSDs failed en masse within 1.5 to 2 years unless it was a production / firmware problem.

Another thing that can improve the speed is if you configure the PGs per pool better. As you can see in the ceph osd df tree output, each OSD has only around 20 PGs, some even only single digits.

Ideally each OSD has around 100 PGs. That number is a rule of thumb to get fast recovery, being able to split data well enough and not cause too much load on the mgmt of the PGs.

The PG calculcator (set to All in One) can give you an idea of what the pg_num for your pools should be. But you can also utilize the autoscaler to calculate it for you. But for it to work well, you need to define the target_size or target_ratio for your pools. So that there is an estimate of how much absolute space (target_size) or ratio (target_ratio) a pool will probably utilize.

baggins · Apr 27, 2023

Thanks Aaron, so we increased out pg's to 128 based on the calculator and current percentage of us. Of course with the autoscaler it dropped it back down to 32. So we have set the autoscaler to warn and its overbalancing. after we will run more tests. We still see over 50% wa and some time as high as 80% on the vm's while the pve hosts are in the .3 - .7 range so still not wentirely sure while such a delta, as the network does not seem to be the culprit.We did add second cores to the vm's which did seem to help superficially.

aaron · Apr 28, 2023

Alternatively you could configure a target_size or target_ratio for the pools. (edit the pool and enable the "Advanced" checkbox). This way you tell the autoscaler what you told the pg calculator and it doesn't have to rely on the current space used by the pool anymore

In the end though, HDDs will be HDDs and not SSDs. They do have a much longer seek time when reading random data.

Another thing you could try is to enable the writeback cache for the VM disks.

It could also help to switch to a different way to connect to the RBD images. By either enabling or disabling the KRBD checkbox (depending on its current state).
With KRBD enabled, the kernel will map the disk image to a /dev/rbdX block device. With it disabled, the virtualization process (qemu) will talk directly with the disk image.
The setting is checked whenever a VM is started or live migrated to another node. So you can compare the results with minimal effort.

baggins · May 13, 2023

Thanks aaron, very much for the continued help it provided a lit path in the darkness....

It has been two weeks of 50% trial and error and 50% iterative planned improvement. We tried everything aaron suggested and more; however, as aaron stated, HDD's are, after all, HDDs and we were never able to hit our targets of io-wa < 2.

So our final setup is as follows and seems to hold together, and we are getting great results so far.... : (for anyone else faced with this, our final architecture is below, all of the above solutions helped through), All of what's below was done with a running cluster with over 60 VM's active.

As mentioned, our hardware config is 5 identical servers with 2 x 480 SSD's and 4x2TB spindles
1. We removed and deleted the osd's on the first server to be converted
2. We broke the R/1 on the mirrored OS disks so they were just JBOD's
3. We reinstalled the OS on 1 of the 2TB HDD's (I know its a bit of a waste)
  1. we went through the process to remove and re-add the host back in to the existing cluster
4. we recreated an osd on each of the remaining 3xHDD's NOT on the 2xSSD (for Now)
5. We let CEPH rebalance across the new osd's
6. Once the rebalance was complete, we moved to the next host and followed the same process
Once all five servers had been converted and CEPH was happy again on HDD's only
1. We followed aron's example of setting all the correct pg's for our projects volume usage, disabled autoscaler and used target sizes to mitigate the warnings
2. Double-checked all the items aaron mentioned regarding physical disk and CEPH layout:
  1. We found the fastest in our case was (this may differ on your environment):
    1. Disk: aio=native,disacrd=on,iothread=1
    2. We added a second core/socket to every VM
    3. Controller: WirtIO SCSI single
3. We created two additional crush rules (xxx-SSD and xxx-HDD) and applied them to the pools and HDDs and SSDs appropriately
4. We created a 400G cache pool and assigned the above-created crush rule for SSD
5. We created an additional images-3 pool using the above SSD cush rule
All of the testing done so far has yielded <2 wa's so we have great performance so far
We are also now adding additional disks on the images-3 pool to each VM and migrating the files with higher transactional volumes (time series DB's, relational DB, email and files, etc)
All of the VM's OS data is staying on the original volume with the HDD crush rules, but because of the caching we all readyalready see fantastic improvement.
We are also rolling out log2ram, which does what it says, and has done very well in testing however, we do also have remote logging to our wazuh cluster, which reduces even further the concern of lost logs in a crash, so you may want to way the pros and cons, but did reduce wa and increase VM performance just by reducing disk writes.

I hope this helps someone. Again thank you aaron.

alexskysilk · May 15, 2023

baggins said:
We created a 400G cache pool and assigned the above-created crush rule for SSD

Just so you know, cache tiering in ceph never actually worked, and has been officially deprecated from RHCS (see https://access.redhat.com/documenta...0/html/release_notes/deprecated_functionality)

ceph across 5 hosts and 20 osds on spindles no iowa on pve's but vm's very high even single vm

New Member

Proxmox Staff Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Distinguished Member

New Member

New Member

Recovery/ Rebalance:​

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Distinguished Member

Recovery/ Rebalance: