Slow performance on Ceph per VM

twey · Jul 15, 2024

Hi

I'm testing Ceph on 4 Proxmox Servers. Each Server have 4x Intel SSDPE2KE032T8 NVMe Disks. All 16 Disks have 3 Partitions witch represent 1 OSD (48 OSDs in Total).
If I'm testing the Diskspeed with DiskSpd in one Windows VM. I'm getting only 40MB/s. If I'm testing on 2 Windows VM at the same time on the same Host, I'm getting 40MB/s on each VM. Is there any limitations per VM? Like a fair use policy? And can this be disabled? FYI: The Speed with a Linux VM are the same or slightly better (45MB/s).

On the same Machines with Hyper-V and Storage Spaces Direct, nearly the same setup, I'm getting 190MB/s in a single Windows VM.

aaron · Jul 15, 2024

Can you post the output of ceph osd df tree inside of [code][/code] tags?

Please post the config of one such VM: qm config {vmid}

Ceph is very latency sensitive. How is the network for Ceph configured?

Please verify that the network used for Ceph can achieve the speeds you expect. For example, use iperf, install it with apt install iperf.
On one node (server), run

Code:

iperf -s -e -i 1

On the other node (client), run

Code:

iperf -c {ip/hostname of server} -e -i 1

If one CPU is not enough to saturate the network, you can add the -P 2 parameter. To reverse the direction, add the -R parameter.

Configure the BIOS of the servers for maximum performance and low latency. There might be guides from the server vendor for this. Otherwise, disable everything that looks like a power saving option.

twey · Jul 15, 2024

Hi Aaron

Thanks for you answer. I don't know what you mean with "

Code:

tags?" but here is the output of ceph osd df tree on the first host, on this host are also the VMs located.

Code:

root@PXN-A11:~# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP    META      AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
-1         46.54541         -   47 TiB  1.3 TiB  1.2 TiB  78 KiB    54 GiB   45 TiB  2.73  1.00    -          root default
-9         11.63635         -   12 TiB  330 GiB  316 GiB  14 KiB    14 GiB   11 TiB  2.77  1.01    -              host PXN-A11
36    ssd   0.96970   1.00000  993 GiB   23 GiB   22 GiB   3 KiB   1.1 GiB  970 GiB  2.36  0.86    7      up          osd.36
37    ssd   0.96970   1.00000  993 GiB   20 GiB   20 GiB   1 KiB   714 MiB  973 GiB  2.04  0.75    6      up          osd.37
38    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   1 KiB   1.4 GiB  969 GiB  2.44  0.89    7      up          osd.38
39    ssd   0.96970   1.00000  993 GiB   36 GiB   36 GiB   1 KiB   590 MiB  957 GiB  3.64  1.33   11      up          osd.39
40    ssd   0.96970   1.00000  993 GiB   17 GiB   16 GiB   1 KiB   496 MiB  976 GiB  1.69  0.62    5      up          osd.40
41    ssd   0.96970   1.00000  993 GiB   28 GiB   26 GiB   1 KiB   1.7 GiB  965 GiB  2.82  1.03    8      up          osd.41
42    ssd   0.96970   1.00000  993 GiB   17 GiB   16 GiB   1 KiB   1.4 GiB  976 GiB  1.75  0.64    5      up          osd.42
43    ssd   0.96970   1.00000  993 GiB   30 GiB   30 GiB   1 KiB   691 MiB  963 GiB  3.05  1.12    9      up          osd.43
44    ssd   0.96970   1.00000  993 GiB   37 GiB   36 GiB   1 KiB   1.7 GiB  956 GiB  3.75  1.37   11      up          osd.44
45    ssd   0.96970   1.00000  993 GiB   35 GiB   33 GiB   1 KiB   1.5 GiB  958 GiB  3.49  1.28   10      up          osd.45
46    ssd   0.96970   1.00000  993 GiB   38 GiB   36 GiB   1 KiB   1.7 GiB  955 GiB  3.78  1.38   11      up          osd.46
47    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   1 KiB   1.1 GiB  969 GiB  2.43  0.89    7      up          osd.47
-3         11.63635         -   12 TiB  337 GiB  325 GiB  20 KiB    12 GiB   11 TiB  2.83  1.04    -              host PXN-A12
 0    ssd   0.96970   1.00000  993 GiB   37 GiB   36 GiB   1 KiB   787 MiB  956 GiB  3.70  1.36   11      up          osd.0
 1    ssd   0.96970   1.00000  993 GiB   21 GiB   20 GiB   1 KiB   808 MiB  972 GiB  2.07  0.76    6      up          osd.1
 2    ssd   0.96970   1.00000  993 GiB   30 GiB   29 GiB   5 KiB   855 MiB  963 GiB  3.02  1.10    9      up          osd.2
 3    ssd   0.96970   1.00000  993 GiB   17 GiB   16 GiB   1 KiB   709 MiB  976 GiB  1.69  0.62    5      up          osd.3
 4    ssd   0.96970   1.00000  993 GiB   34 GiB   32 GiB   1 KiB   1.5 GiB  959 GiB  3.41  1.25   10      up          osd.4
 5    ssd   0.96970   1.00000  993 GiB   27 GiB   26 GiB   3 KiB   1.2 GiB  966 GiB  2.71  0.99    8      up          osd.5
 6    ssd   0.96970   1.00000  993 GiB   14 GiB   13 GiB   1 KiB   1.2 GiB  979 GiB  1.44  0.53    5      up          osd.6
 7    ssd   0.96970   1.00000  993 GiB   30 GiB   29 GiB   1 KiB   1.4 GiB  963 GiB  3.06  1.12    9      up          osd.7
 8    ssd   0.96970   1.00000  993 GiB   37 GiB   36 GiB   1 KiB   733 MiB  956 GiB  3.70  1.36   11      up          osd.8
 9    ssd   0.96970   1.00000  993 GiB   27 GiB   26 GiB   1 KiB   1.6 GiB  966 GiB  2.73  1.00    8      up          osd.9
10    ssd   0.96970   1.00000  993 GiB   36 GiB   35 GiB   3 KiB   866 MiB  957 GiB  3.65  1.34   11      up          osd.10
11    ssd   0.96970   1.00000  993 GiB   27 GiB   26 GiB   1 KiB   686 MiB  966 GiB  2.73  1.00    8      up          osd.11
-5         11.63635         -   12 TiB  287 GiB  273 GiB  19 KiB    14 GiB   11 TiB  2.41  0.88    -              host PXN-A13
12    ssd   0.96970   1.00000  993 GiB   14 GiB   13 GiB   1 KiB   1.2 GiB  979 GiB  1.40  0.51    4      up          osd.12
13    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   3 KiB   1.3 GiB  969 GiB  2.44  0.89    7      up          osd.13
14    ssd   0.96970   1.00000  993 GiB   21 GiB   19 GiB   1 KiB   1.5 GiB  972 GiB  2.12  0.78    6      up          osd.14
15    ssd   0.96970   1.00000  993 GiB   30 GiB   29 GiB   1 KiB  1003 MiB  963 GiB  3.03  1.11    9      up          osd.15
16    ssd   0.96970   1.00000  993 GiB   27 GiB   26 GiB   5 KiB   680 MiB  966 GiB  2.71  0.99    8      up          osd.16
17    ssd   0.96970   1.00000  993 GiB   24 GiB   22 GiB   1 KiB   1.4 GiB  969 GiB  2.41  0.88    8      up          osd.17
18    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   1 KiB   1.6 GiB  969 GiB  2.44  0.89    7      up          osd.18
19    ssd   0.96970   1.00000  993 GiB   18 GiB   16 GiB   2 KiB   1.7 GiB  975 GiB  1.77  0.65    5      up          osd.19
20    ssd   0.96970   1.00000  993 GiB   36 GiB   36 GiB   1 KiB   773 MiB  957 GiB  3.67  1.35   11      up          osd.20
21    ssd   0.96970   1.00000  993 GiB   17 GiB   17 GiB   1 KiB   684 MiB  976 GiB  1.73  0.63    5      up          osd.21
22    ssd   0.96970   1.00000  993 GiB   27 GiB   26 GiB   1 KiB   782 MiB  966 GiB  2.69  0.99    8      up          osd.22
23    ssd   0.96970   1.00000  993 GiB   25 GiB   23 GiB   1 KiB   1.8 GiB  968 GiB  2.50  0.92    7      up          osd.23
-7         11.63635         -   12 TiB  348 GiB  335 GiB  25 KiB    13 GiB   11 TiB  2.92  1.07    -              host PXN-A14
24    ssd   0.96970   1.00000  993 GiB   28 GiB   26 GiB   3 KiB   1.6 GiB  965 GiB  2.78  1.02    9      up          osd.24
25    ssd   0.96970   1.00000  993 GiB   14 GiB   13 GiB   1 KiB   1.1 GiB  979 GiB  1.43  0.52    4      up          osd.25
26    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   1 KiB   1.2 GiB  969 GiB  2.43  0.89    7      up          osd.26
27    ssd   0.96970   1.00000  993 GiB   36 GiB   36 GiB  10 KiB   769 MiB  957 GiB  3.65  1.34   11      up          osd.27
28    ssd   0.96970   1.00000  993 GiB   20 GiB   20 GiB   1 KiB   654 MiB  973 GiB  2.05  0.75    6      up          osd.28
29    ssd   0.96970   1.00000  993 GiB   34 GiB   32 GiB   1 KiB   1.8 GiB  959 GiB  3.44  1.26   10      up          osd.29
30    ssd   0.96970   1.00000  993 GiB   40 GiB   39 GiB   1 KiB   966 MiB  953 GiB  4.00  1.46   12      up          osd.30
31    ssd   0.96970   1.00000  993 GiB   37 GiB   36 GiB   3 KiB   738 MiB  956 GiB  3.68  1.35   11      up          osd.31
32    ssd   0.96970   1.00000  993 GiB   30 GiB   29 GiB   1 KiB   1.0 GiB  963 GiB  3.03  1.11    9      up          osd.32
33    ssd   0.96970   1.00000  993 GiB   24 GiB   23 GiB   1 KiB   1.7 GiB  969 GiB  2.46  0.90    7      up          osd.33
34    ssd   0.96970   1.00000  993 GiB   30 GiB   29 GiB   1 KiB   1.0 GiB  963 GiB  3.03  1.11    9      up          osd.34
35    ssd   0.96970   1.00000  993 GiB   30 GiB   30 GiB   1 KiB   815 MiB  963 GiB  3.06  1.12    9      up          osd.35
                        TOTAL   47 TiB  1.3 TiB  1.2 TiB  94 KiB    54 GiB   45 TiB  2.73
MIN/MAX VAR: 0.51/1.46  STDDEV: 0.72

Code:

root@PXN-A11:~# qm config 100
balloon: 0
boot: order=virtio0;ide2;net0
cores: 4
cpu: x86-64-v2-AES
machine: pc-i440fx-9.0
memory: 16384
meta: creation-qemu=9.0.0,ctime=1720788913
name: Win2019-Test
net0: virtio=BC:24:11:5B:67:7F,bridge=vmbr0,firewall=1,tag=102
numa: 0
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=818e9380-837b-4400-9d4e-e5e28f0dba4f
sockets: 1
virtio0: nvmePool:vm-100-disk-0,cache=writeback,iothread=1,size=50G
vmgenid: 7944a8ac-1543-488a-bd4f-024384dcd7f9

Code:

root@PXN-A11:~# qm config 101
agent: 1
balloon: 0
boot: order=virtio0;net0
cores: 8
cpu: x86-64-v2-AES
machine: pc-i440fx-9.0
memory: 16384
meta: creation-qemu=9.0.0,ctime=1720796676
name: Win2019-Test2
net0: virtio=BC:24:11:15:C2:B4,bridge=vmbr0,firewall=1,tag=102
numa: 1
ostype: win10
scsihw: virtio-scsi-single
smbios1: uuid=5368dfb5-4549-4856-90bc-73dae23b997e
sockets: 1
virtio0: nvmePool:vm-101-disk-0,cache=writeback,iothread=1,size=50G
vmgenid: ca24ccfa-ab85-4875-ae77-bec54c27c30e

The Hosts are connected to a Huawei CE6820H-48S6CQ Switch with 10Gb/s Ethernet over Fiber. So the latency should not be a problem. The Ports on the Switch are only 5-10% saturated. Jombo Frames are activated.

here are the output of the /etc/network/interfaces this are on every Node the same exept the IPs of course. The Ceph Treffic runs on the iscsi1 Network at the moment.

Code:

root@PXN-A11:/etc/network# cat interfaces
auto lo
iface lo inet loopback

auto ens6f0
iface ens6f0 inet manual

auto ens6f1
iface ens6f1 inet manual

auto ens7f0np0
iface ens7f0np0 inet manual
        mtu 9014

auto ens7f1np1
iface ens7f1np1 inet manual
        mtu 9014

auto bond0
iface bond0 inet manual
        ovs_bonds ens7f0np0 ens7f1np1
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_mtu 9014
        ovs_options bond_mode=balance-slb other_config:bond-detect-mode=miimon

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 mgmt
        ovs_mtu 9014

auto mgmt
iface mgmt inet static
        address 10.35.100.11/16
        gateway 10.35.3.1
        ovs_type OVSIntPort
        ovs_bridge vmbr0
        ovs_options tag=119

auto vlan107
iface vlan107 inet static
        address 172.17.100.11/16
        mtu 9014
        vlan-raw-device ens7f0np0
#iscsi1

auto vlan108
iface vlan108 inet static
        address 172.18.100.11/16
        mtu 9014
        vlan-raw-device ens7f1np1
#iscsi2

source /etc/network/interfaces.d/*

and this are the Ceph Config

Code:

root@PXN-A11:/etc/ceph# cat ceph.conf
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 172.17.100.11/16
        fsid = d075b9b1-7736-435e-923a-ff05925e5500
        mon_allow_pool_delete = true
        mon_host = 172.17.100.12 172.17.100.13 172.17.100.14 172.17.100.11
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 172.17.100.11/16

        debug asok = 0/0
        debug auth = 0/0
        debug buffer = 0/0
        debug client = 0/0
        debug context = 0/0
        debug crush = 0/0
        debug filer = 0/0
        debug filestore = 0/0
        debug finisher = 0/0
        debug heartbeatmap = 0/0
        debug journal = 0/0
        debug journaler = 0/0
        debug lockdep = 0/0
        debug mds = 0/0
        debug mds balancer = 0/0
        debug mds locker = 0/0
        debug mds log = 0/0
        debug mds log expire = 0/0
        debug mds migrator = 0/0
        debug mon = 0/0
        debug monc = 0/0
        debug ms = 0/0
        debug objclass = 0/0
        debug objectcacher = 0/0
        debug objecter = 0/0
        debug optracker = 0/0
        debug osd = 0/0
        debug paxos = 0/0
        debug perfcounter = 0/0
        debug rados = 0/0
        debug rbd = 0/0
        debug rgw = 0/0
        debug throttle = 0/0
        debug timer = 0/0
        debug tp = 0/0


[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.PXN-A11]
        public_addr = 172.17.100.11

[mon.PXN-A12]
        public_addr = 172.17.100.12

[mon.PXN-A13]
        public_addr = 172.17.100.13

[mon.PXN-A14]
        public_addr = 172.17.100.14

I have tested the network connection with iperf3. But i have now done the Test as you mentioned with iperf, see output below.
The Server is running on the first Node (PXN-A11).

Code:

root@PXN-A12:~# iperf -c 172.17.100.11 -e -i 1
------------------------------------------------------------
Client connecting to 172.17.100.11, TCP port 5001 with pid 1053916 (1 flows)
Write buffer size: 131072 Byte
TOS set to 0x0 (Nagle on)
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  1] local 172.17.100.12%vlan107 port 38920 connected with 172.17.100.11 port 5001 (sock=3) (icwnd/mss/irtt=87/8962/115) (ct=0.15 ms) on 2024-07-15 14:07:16 (CEST)
[ ID] Interval            Transfer    Bandwidth       Write/Err  Rtry     Cwnd/RTT(var)        NetPwr
[  1] 0.0000-1.0000 sec  1.15 GBytes  9.88 Gbits/sec  9426/0          0     2599K/1669(176) us  740254
[  1] 1.0000-2.0000 sec  1.15 GBytes  9.89 Gbits/sec  9434/0          0     2599K/1498(155) us  825456
[  1] 2.0000-3.0000 sec  1.15 GBytes  9.90 Gbits/sec  9440/0          0     2599K/1247(133) us  992237
[  1] 3.0000-4.0000 sec  1.15 GBytes  9.89 Gbits/sec  9434/0          0     2599K/1273(92) us  971354
[  1] 4.0000-5.0000 sec  1.15 GBytes  9.89 Gbits/sec  9429/0          0     2599K/1254(119) us  985549
[  1] 5.0000-6.0000 sec  1.15 GBytes  9.90 Gbits/sec  9445/0          0     2599K/1243(90) us  995957
[  1] 6.0000-7.0000 sec  1.15 GBytes  9.89 Gbits/sec  9432/0          0     2599K/1272(117) us  971911
[  1] 7.0000-8.0000 sec  1.15 GBytes  9.89 Gbits/sec  9431/0          0     2599K/1246(139) us  992087
[  1] 8.0000-9.0000 sec  1.15 GBytes  9.88 Gbits/sec  9426/0          0     2599K/1160(133) us  1065073
[  1] 9.0000-10.0000 sec  1.15 GBytes  9.90 Gbits/sec  9437/0          0     2599K/1221(76) us  1013044
[  1] 0.0000-10.0320 sec  11.5 GBytes  9.86 Gbits/sec  94335/0          0     2599K/1174(101) us  1049854

The BIOS Settings are all set to high Performance. We have used the same settings on Hyper-V in the Past.

The Servers are SuperMicro SYS-2029BT-HNC0R, this are 4 Nodes in one Chassis. each Server has 2x Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz CPU with 12 Cores (Total 24 Cores per Server = 96 Cores per Cluster).

Kind regards.
Thomas

aaron · Jul 15, 2024

twey said:
Thanks for you answer. I don't know what you mean with "

exactly that, either place the CLI outputs manually within [code]my output[/code] or alternately use the formatting options of the editor. Otherwise, all that output is barely readable if not formatted correctly.

Please edit your response accordingly. Reading it will be a lot easier for everyone once CLI output is formatted with the correct spacing

twey · Jul 15, 2024

Hi Aaron

Sorry for that stupid question about the tags. I have now edited my posting.

aaron · Jul 15, 2024

Aah, much nicer to read

So, a few things I noticed.

PGs per OSD. They are quite low, which can happen in a new cluster. I assume, you only have one main pool for your guests?
pveceph pool ls --noborder could be interesting, but make sure to run it in a window that is wide enough, otherwise output will be cut.

If you assign this pool a target_ratio, the autoscaler can determine the optimal number of PGs for that pool, so that in the end, each OSD has roughly around 100 PGs.
The .mgr pool can be ignored in this calculation as it usually will only have one PG.
If you have multiple pools, the target_ratio can be used to determine how much you estimate each pool to consume in the end. It is a ratio, therefore, with two pools, giving each a target_ratio of 1 would mean, that both pools are expected to consume about the same space.

With multiple pools, I prefer to either use values between 0.0 and 1.0 or 0 to 100 to make it easier to think in percentages.

Another factor, given that each node has a total of 24 real cores, 48 threads, is that the CPU might be a bit overprovisioned. Each Ceph service consumes roughly one CPU core and you have 12 OSDs plus a MON on each node and then one active MGR in the cluster.

If you check

Code:

cat /proc/pressure/cpu

you see the CPU pressure for the last 10, 60 and 300 seconds. Are these values in the "some" line anything besides 0.00? If so, then you are likely to have overprovisioned CPUs.

What kind of SSDs are they? If they are good enterprice/datacenter SSDs with PLP, it could be that the 10 Gbit network could become a bottleneck. Keep an eye on that in the future. Ideally with some dedicated performance monitoring.

twey · Jul 15, 2024

We had the NVMe Disks not partitioned in the first run, so we had only 4 OSDs per Node and 33 PG, on that run, after some reading I have increased the PG to 128 with autoscalling disabled and 3 Partitions per NVMe to try a second run (the actual configuration) but the Speeds are the same.
Can I increase the PG on a existing Pool, it fails on the GUI? Or do I have to create a new Pool?

Would this setting what you mention in your answer? Do I have to set den Min PG?

Below the output of the commands:

Code:

root@PXN-A11:~# pveceph pool ls --noborder
Name     Size Min Size PG Num min. PG Num Optimal PG Num PG Autoscale Mode PG Autoscale Target Size PG Autoscale Target Ratio Crush Rule Name               %-Used Used
.mgr        3        2      1           1              1 on                                                                   replicated_rule 6.36162056366629e-08 2961408
nvmePool    3        2    128         128            128 off                                                                  replicated_rule   0.0281219724565744 1346990332321

Code:

root@PXN-A11:~# cat /proc/pressure/cpu
some avg10=0.43 avg60=0.33 avg300=0.29 total=392589608
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

Yes the Hosts are overprovisioned for CEPH only but the same Hosts are virtualisation Hosts also for around 100VMs.
And yes the NVMe are Enterprise Disks witch can saturate the NICs but we use Zabbix to Monitor the performance and NIC utilisation.

twey · Jul 15, 2024

I have now set the PG Autoscale Mode to on and the Target Ration to 1. The # of PGs have now automatically increased to 1569. But the results are even worse. Now I'm getting results with the same test with 7 - 39MB/s and its not consistent on every test.
And I'm getting a warning "1 pool(s) have non-power-of-two pg_num".

aaron · Jul 15, 2024

Well, depending on what the autoscaler now determines to be the new optimal PG num for the pool nvmePool, it will automatically adapt it slowly.

You should see that the PGnum for the pool will increase slowly over time. Until then, the warning can show up, but once the optimal number of PGs is reached, it should be back at a power of 2.

twey said:
Would this setting what you mention in your answer? Do I have to set den Min PG?

Please leave the min PG setting at the default of 32. With a target ratio, any with just the one pool, the autoscaler will know that you expect this pool to take up all the available space in the cluster. It will determine the correct PG num.

twey said:
Can I increase the PG on a existing Pool, it fails on the GUI? Or do I have to create a new Pool?

On a replicated pool (erasure coded pools are another possibilty), you can change the PG num on a running cluster. Ceph will then rebalance the data and either split or merge PGs.
Did you get an error when trying to set it to 4800 PGs? It might be a bit too much or not a power of 2. The PG Calculator, thinks 2048 PGs are what works best for that number of OSDs. Therefore, the autoscaler should come to the same conclusion.

twey said:
Yes the Hosts are overprovisioned for CEPH only but the same Hosts are virtualisation Hosts also for around 100VMs.

To avoid any misunderstandings here. With overprovisioned I mean not that the CPU is sized too large, but rather, that there is too much load for the CPU. E.g. it is overprovisioned in what should be running on it, vs. what it can handle.

You already see some CPU pressure, which means, some processes have to wait until they get CPU time. Adding additional guests to the hosts will most likely only worsen the situation. That's why you need to scale your hosts considerably larger when you plan a hyperconverged setup with not just additional memory, but also additional CPU cores.

Ignoring the threads, as hyperthreading will only help in some situations, you have 12 OSDs per node, + 1 MON plus at minimum a core for Proxmox VE and its services, better two.
That means, of the 24 available cores, you are already spending 15 just to keep the cluster working. If you then hand out additional cores to VMs, the situation will get worse. Depending of course, on how CPU intensive your guests are. If they are idling almost all the time, you can overprovision (configure more CPU cores than physically available) the hosts quite a bit, but if the guests will run CPU intensive tasks, you will see the CPU pressure go up. The result can be worse performance and, in the worst case, an unstable Ceph cluster, once OSDs and other services need to wait too long for CPU time.

twey · Jul 16, 2024

Thanks for the explanation. After 14 hours of waiting, the PGs is still on 1569 (as before 14 hours) on the Pool, as you mentioned the optimal PGs are displayed as 2048 (see below in red). Is this only increasing if there is some load on the Pool? Here the settings of my Pool.

So maybe, it would be better to only have 4 OSDs (1 OSD per Disk) on each Node? Regarding to the CPU Count? That Results in only 6-7 CPU Cores have to be used without VMs. Or will this increase the CPU Pressure?

Back to my initial Question, is there a Limitation to one VM or some fair use policies that prevents a VM to get the full speed of the underlaying CEPH Storage? As you see in the first post, if a Diskspeed Test only runs in 1 VM it Transfers 40MB/s if I'm running the Diskspeed Test in 2 separated VMs simultaneously on the same Host I'm getting 80MB/s, so I think the limitation factor is not the CEPH part. Or am I wrong with this assumption.

aaron · Jul 16, 2024

twey said:
After 14 hours of waiting, the PGs is still on 1569 (as before 14 hours) on the Pool, as you mentioned the optimal PGs are displayed as 2048 (see below in red)

Then it is possible that something interrupted the autoscaler. You can fix it by manually setting the # of PGs to 2048.

twey said:
So maybe, it would be better to only have 4 OSDs (1 OSD per Disk) on each Node?

How are they currently set up? Multiple OSDs per SSD? If so, then yes, having only one OSD per SSD will most likely improve the situation on the CPU pressure side.

To do so in the safest way, set the OSDs that share one disk to OUT. Wait for Ceph to rebalance the data. Once the PG column in the OSD panel is down to zero and Ceph shows healthy, with all PGs in green state, you can stop and destroy these OSDs. Once all OSDs on a single SSD are destroyed, you can go ahead and create a new OSD on this SSD.
Wait for Ceph to be done with the rebalance and repeat for the next SSD.

twey said:
Back to my initial Question, is there a Limitation to one VM or some fair use policies that prevents a VM to get the full speed of the underlaying CEPH Storage? As you see in the first post, if a Diskspeed Test only runs in 1 VM it Transfers 40MB/s if I'm running the Diskspeed Test in 2 separated VMs simultaneously on the same Host I'm getting 80MB/s, so I think the limitation factor is not the CEPH part. Or am I wrong with this assumption.

If a Ceph client (a VM in this case) wants to read or write data, it will not be rate limited out of the box, if that is what you are asking.
A Ceph cluster, if built properly, can provide a lot of performance, more than a single client might be able to consume. See our benchmark paper from last fall.
If you want to rate limit guests, configure bandwidth limits on their disks on the Proxmox VE level. There is a "Bandwidth" tab when you edit the disks of a guest.

Should you not get the expected performance, then you need to figure out what is bottlenecking the cluster. Ceph needs low latency and bandwidth on the network side. But also low latency on the CPU side. If an OSD can't get the CPU time it requires quickly, it will slow down the entire cluster.

SSDs with PLP are recommended because most writes of an OSD will be in "sync" mode, which the SSDs are only allowed to ACK once the data is written in a way that power can be lost in that moment without any data loss. SSDs with PLP can do that with good conscious once the data is in the internal RAM, as the capacitors provide enough energy to write the RAM data to non-volatile memory should power be lost. Consumer SSD without PLP either appear slow, due to this, or if they appear similarly fast, chances are good that the firmware is lying about it.

twey · Jul 16, 2024

I'm not asking to limit the Diskspeed of 1 VM, I know it is possible in the settings of a VM.
Also I'm not thinking that CEPH will limit the VM Disk Speed. What I mean, if I'm having 1 VM that runs the Diskspeed Test and 40MB/s will be reached but if the Test runs in 2 VMs it will reach 80MB/s on the CEPH it have to be some limitations between the Virtualisation Part and the CEPH Part. I don't have set any limitations, so I think it have some default settings in Proxmox VE, KVM, Integration Tools or some other parts.
I mean the CEPH Part is definitely capable to do 80MB/s at least, but if the Test runs in 1 VM I'm getting only 40MB/s

aaron · Jul 17, 2024

Hmm, okay, so if you can achieve double the performance with 2 VMs in total, you can try to see if the VM configs can be improved upon.

One thing could be to switch from direct RBD to KRBD (host kernel connects to RBD instead of Qemu directly). To change this, edit the Storage in Datacenter->Storage and enable the KRBD checkbox. Once a VM is booting from scratch, or is live migrated, this option will be followed.

Use SCSI + virtio-scsi-single controller instead of virtio BLK.
This will be a bit tougher, especially on Windows VMs, if it is the boot disk. The procedure is basically the same as here: https://pve.proxmox.com/wiki/Paravirtualized_Block_Drivers_for_Windows
Attach a dummy disk with the scsi bus type and wait until windows detects it, before you attempt to switch the boot disk.

twey · Jul 19, 2024

Hi Aaron

I have now configured the CEPH Part from scratch with only one OSD per NVMe Disk. The Speed was at 40MB/s again. After I have set the KRBD Flag on the Storage and changed the Disk Controller to virtio-scsi-single and the Disk Interface to SCSI, I'm getting 107MB/s. I have testet a little bit further. Changing the Async IO from Default (io_uring to threads on the VM Disk make a big difference. With this setting I'm getting 173MB/s, witch is close to the Testing with Hyper-V with Storage Spaces Direct, this is OK for me. I will do some testing further if I'm finding some time.

Here my VM config for the sake of completeness

Code:

agent: 1
balloon: 0
boot: order=scsi0;ide0;net0
cores: 8
cpu: x86-64-v2-AES
ide0: none,media=cdrom
machine: pc-i440fx-9.0
memory: 16384
meta: creation-qemu=9.0.0,ctime=1721223854
name: Win2019-Test4
net0: virtio=BC:24:11:B4:3E:F8,bridge=vmbr0,firewall=1,tag=102
numa: 1
ostype: win10
scsi0: nvmePool1:vm-103-disk-0,aio=threads,iothread=1,size=50G
scsihw: virtio-scsi-single
smbios1: uuid=f74723d6-46b2-49b0-8005-e37bcbfdb2d7
sockets: 1
vmgenid: d61f0991-27d7-46da-bcd2-a1dee2e27d69

One other question:
I have 4 Nodes with 4 NVMe Disk in each Node for the CEPH Part. Crush Rule is configured as replicated_rule 3/2.
Is CEPH aware of the witch OSD is in witch Host, and the 3 Copys will not saved on 2 OSDs in the same Node. So one whole Node can crash without the CEPH have any Problem?

Azunai333 · Jul 19, 2024

twey said:
I have 4 Nodes with 4 NVMe Disk in each Node for the CEPH Part. Crush Rule is configured as replicated_rule 3/2.
Is CEPH aware of the witch OSD is in witch Host, and the 3 Copys will not saved on 2 OSDs in the same Node. So one whole Node can crash without the CEPH have any Problem?

Yes.
The Crush Rule defines on which level the replicas will be (for example OSD, host, server rack, etc.). The default is host failure domain. This means only one replica on one host. With this you can lose a complete host without any problems.

aaron · Jul 19, 2024

twey said:
I have 4 Nodes with 4 NVMe Disk in each Node for the CEPH Part. Crush Rule is configured as replicated_rule 3/2.
Is CEPH aware of the witch OSD is in witch Host, and the 3 Copys will not saved on 2 OSDs in the same Node. So one whole Node can crash without the CEPH have any Problem?

@Azunai333 already explained it, but now in more detail:

The Crush map, which you can see on the right side in the Ceph->Configuration panel defines how Ceph sees the cluster topology. You have several buckets, which form the tree of the cluster.
You most likely have the default hierarchy of the first object of type root, named default. It contains buckets of type host. The hosts contain the OSDs.

The replicated_rule that Ceph creates by default looks like this (further down in the crush map):

Code:

# rules
rule replicated_rule {
    id 0
    type replicated
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

If you look at the steps, you see that it takes the default bucket and the next step is already the chooseleaf which makes sure that the replicas are spread over the buckets of type host.

You can also run ceph pg dump pgs_brief which will list all PGs and on which OSDs they have their replicas stored. If you randomly choose some and check it, the OSDs listet should all be on different nodes.

twey · Jul 31, 2024

The CEPH Cluster is running good so far, but now I have warnings that the Monitors have low disk space. If I'm deleting some stuff on the Storage "local" the Warning disappears. Is the Storage for Monitors configurable? and how I have to calculate the space needed? Will the needed space grow over time or is the needed space defined with the size and settings of the Pool?

aaron · Jul 31, 2024

twey said:
Is the Storage for Monitors configurable?

They are stored on the root FS of the host they run on. /var/lib/ceph/ceph-mon/…

Ideally you have enough space available. If you don't need to `local-lvm` at all, you could think about removing the automatically created pve/data LV and expand the pve/root LV and the filesystem in /[icode].

Don't forget to remove the entry for [icode]local-lvm in Datacenter -> Storage as well.

Should you, or anyone else who reads this in the future, not need the local storage, you can set the maxvz disk option to 0 in the installer. Then there won't be a local-lvm and the root FS will get that space.
See https://pve.proxmox.com/pve-docs/pve-admin-guide.html#advanced_lvm_options

emptness · Aug 14, 2024

twey said:
Hi Aaron

I have now configured the CEPH Part from scratch with only one OSD per NVMe Disk. The Speed was at 40MB/s again. After I have set the KRBD Flag on the Storage and changed the Disk Controller to virtio-scsi-single and the Disk Interface to SCSI, I'm getting 107MB/s. I have testet a little bit further. Changing the Async IO from Default (io_uring to threads on the VM Disk make a big difference. With this setting I'm getting 173MB/s, witch is close to the Testing with Hyper-V with Storage Spaces Direct, this is OK for me. I will do some testing further if I'm finding some time.

Here my VM config for the sake of completeness

Code:

agent: 1 balloon: 0 boot: order=scsi0;ide0;net0 cores: 8 cpu: x86-64-v2-AES ide0: none,media=cdrom machine: pc-i440fx-9.0 memory: 16384 meta: creation-qemu=9.0.0,ctime=1721223854 name: Win2019-Test4 net0: virtio=BC:24:11:B4:3E:F8,bridge=vmbr0,firewall=1,tag=102 numa: 1 ostype: win10 scsi0: nvmePool1:vm-103-disk-0,aio=threads,iothread=1,size=50G scsihw: virtio-scsi-single smbios1: uuid=f74723d6-46b2-49b0-8005-e37bcbfdb2d7 sockets: 1 vmgenid: d61f0991-27d7-46da-bcd2-a1dee2e27d69

One other question:
I have 4 Nodes with 4 NVMe Disk in each Node for the CEPH Part. Crush Rule is configured as replicated_rule 3/2.
Is CEPH aware of the witch OSD is in witch Host, and the 3 Copys will not saved on 2 OSDs in the same Node. So one whole Node can crash without the CEPH have any Problem?

Hello!
Tell me please, after enabling the KBD option on the pool, did i need any more actions, for example, restarting all the servers of the CEPH cluster, rebooting VMs? or did the setup work immediately and the tests showed a speed increase in the guest OS?

aaron · Aug 14, 2024

emptness said:
Tell me please, after enabling the KBD option on the pool, did i need any more actions, for example, restarting all the servers of the CEPH cluster, rebooting VMs? or did the setup work immediately and the tests showed a speed increase in the guest OS?

The guest needs to do a clean restart or a live migration to another node to switch to KRBD.

Slow performance on Ceph per VM

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Active Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Member

Proxmox Staff Member