[SOLVED] Ceph OSD latency on one node

Jan 15, 2024
3
0
1
Hi,

We are using a three-node hyperconverged cluster. We are experiencing latency on only one of our nodes.
On two nodes, latency varies between 1 and 5.
On the last node it can be up to 300.
PG autoscale is On and the Crush Rule : replicated_rule.

The nodes are strictly identical:

Dell Poweredge R640
CPU Intel(R) Xeon(R) Gold 6130 CPU
RAM 6x 32Go DDR-4 2666Mhz
Controller Dell HBA330 v16.17.01.00
3 x SSD SATA VK001920GWSXK 1,92To
Network Intel(R) Ethernet 10G 4P X540
PVE 7.4.17 & Ceph 17.2.6

I haven't identified any hardware problems, the network seems OK.
Any idea where the latency is coming from? Any known issues with these versions?


Technical information :

root@asi-prd-01:~# ceph -s
Code:
cluster:
    id:     7aacf6bf-19ba-4abc-868b-68cbdc9a0bb8
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum asi-prd-01,asi-prd-02,asi-prd-03 (age 2d)
    mgr: asi-prd-03(active, since 10d), standbys: asi-prd-02, asi-prd-01
    osd: 9 osds: 9 up (since 2d), 9 in (since 2d)
 
  data:
    pools:   3 pools, 97 pgs
    objects: 287.00k objects, 1.0 TiB
    usage:   3.2 TiB used, 13 TiB / 16 TiB avail
    pgs:     97 active+clean
 
  io:
    client:   16 MiB/s rd, 315 KiB/s wr, 310 op/s rd, 35 op/s wr

root@asi-prd-01:~# hdparm -tT /dev/sdb
Code:
/dev/sdb:
 Timing cached reads:   18478 MB in  1.99 seconds = 9305.55 MB/sec
 Timing buffered disk reads: 1324 MB in  3.01 seconds = 440.27 MB/sec


root@asi-prd-01:~# pveversion -v
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.131-2-pve)
pve-manager: 7.4-17 (running version: 7.4-17/513c62be)
pve-kernel-5.15: 7.4-9
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.126-1-pve: 5.15.126-1
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.4-1
proxmox-backup-file-restore: 2.4.4-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-6
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.14-pve1

root@asi-prd-01:~# rados -p test bench 30 write
Code:
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_asi-prd-01_989622
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        16         0         0         0           -           0
    2      16        21         5   9.95536        10     1.97451     1.60941
    3      16        24         8    10.612        12       2.767     2.02831
    4      16        29        13   12.9299        20     1.77919     2.27624
    5      15        33        18   14.3197        20     3.77611       2.757
    6      16        43        27   17.8922        36     1.58408     2.88553
    7      16        50        34   19.2919        28      1.8946     2.79278
    8      16        54        38    18.867        16     2.14249      2.7758
    9      16        59        43   18.9795        20     2.09039     2.76092
   10      16        65        49   19.4603        24     2.68145     2.76326
   11      16        70        54   19.4949        20     1.46891     2.74555
   12      16        76        60    19.855        24     3.21977     2.80772
   13      16        81        65   19.8572        20     4.12265     2.82984
   14      16        87        71   20.1416        24     2.35672     2.84354
   15      16        93        77    20.388        24     2.44081     2.81861
   16      15        95        80    19.854        12     1.99903     2.83293
   17      16       101        85   19.8517        20     2.03777     2.84295
   18      16       110        94   20.7329        36     2.33814     2.88426
   19      15       113        98   20.4677        16     1.90445     2.84914
2024-01-15T14:22:34.919623+0100 min lat: 1.24598 max lat: 5.41299 avg lat: 2.85163
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   20      16       120       104   20.6389        24      2.8987     2.85163
   21      16       123       107   20.2227        12     1.70059     2.86109
   22      16       132       116   20.9259        36     2.46603     2.89194
   23      16       135       119   20.5351        12     2.41966     2.89056
   24      16       139       123   20.3403        16     3.17779     2.89162
   25      16       148       132   20.9553        36     3.34379     2.90209
   26      16       150       134   20.4556         8      2.6792        2.89
   27      16       159       143   21.0183        36     2.18752      2.9074
   28      15       164       149   21.1184        24     2.39634     2.91213
   29      16       169       153   20.9379        16     1.84626     2.88639
   30      14       175       161   21.2989        32     3.36069     2.86418
   31       8       175       167     21.38        24      2.8363     2.85464
Total time run:         31.8613
Total writes made:      175
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     21.9702
Stddev Bandwidth:       9.16046
Max bandwidth (MB/sec): 36
Min bandwidth (MB/sec): 0
Average IOPS:           5
Stddev IOPS:            2.31219
Max IOPS:               9
Min IOPS:               0
Average Latency(s):     2.84716
Stddev Latency(s):      0.879404
Max latency(s):         5.41299
Min latency(s):         1.24598
Cleaning up (deleting benchmark objects)
Removed 175 objects
Clean up completed and total clean up time :0.10798


root@asi-prd-01:~# iperf3 -c 10.0.50.2
Code:
Connecting to host 10.0.50.2, port 5201
[  5] local 10.0.50.1 port 35980 connected to 10.0.50.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.06 GBytes  9.15 Gbits/sec  1687    987 KBytes       
[  5]   1.00-2.00   sec  1.05 GBytes  9.02 Gbits/sec  1467   1.01 MBytes       
[  5]   2.00-3.00   sec  1.06 GBytes  9.09 Gbits/sec   17   1.47 MBytes       
[  5]   3.00-4.00   sec  1.08 GBytes  9.24 Gbits/sec  605   1.24 MBytes       
[  5]   4.00-5.00   sec   840 MBytes  7.05 Gbits/sec  1475    990 KBytes       
[  5]   5.00-6.00   sec  1.07 GBytes  9.20 Gbits/sec  512   1.17 MBytes       
[  5]   6.00-7.00   sec  1.07 GBytes  9.17 Gbits/sec  324   1.36 MBytes       
[  5]   7.00-8.00   sec  1.07 GBytes  9.23 Gbits/sec  1087   1.05 MBytes       
[  5]   8.00-9.00   sec  1.06 GBytes  9.07 Gbits/sec    0   1.48 MBytes       
[  5]   9.00-10.00  sec   860 MBytes  7.22 Gbits/sec  450   1.37 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  10.2 GBytes  8.74 Gbits/sec  7624             sender
[  5]   0.00-10.04  sec  10.2 GBytes  8.71 Gbits/sec                  receiver


iperf Done.
 
On the last node it can be up to 300.
Have you tried restarting these OSDs?

What is the output of ceph osd df tree?

In the future I would use iperf instead of iperf3. The former is properly multithreaded and will not be CPU bottlenecked when using the -P parameter. That can become an issue once you get into the realm of 100 Gbit networks.
 
On two nodes, latency varies between 1 and 5.
On the last node it can be up to 300.
Do you see that where? In the GUI on the OSD overview?

The problem always occurs, no matter which node you start the benchmark on, right?

As far as I could see, these aren't really the best SSDs, so maybe they themselves are a bit of a problem. Otherwise, have you looked at the smart values?

What do your metrics say regarding the latency of the individual OSDs? Are these sometimes lower?
 
All the OSDs were restarted without any improvement.
I also stopped and destroyed the OSDs then recreated them, the latency does not go to 300ms no but it's still high on the node.
I will try with iperf, but the network seems ok.

1705584447923.png

This is the latency I'm seeing on the GUI. When the OSDs are not being used, the latencies seem normal.

1705583438241.png
I can reproduce this easily by running a backup. I ran two tests :

Backup of one VM on the affected Node :

1705583664243.png

I can see that the CPU Usage is increasing quickly on the node.
1705583753935.png

The backup was made with random transfer rates from 50 to 300Mb/s
Code:
INFO: transferred 52.00 GiB in 324 seconds (164.3 MiB/s)

I migrate the VM on the node 2, when i start the backup, the latency on the server doesn't change but I have latency on node 1.
The CPU Usage of the node 2, where the backup is done, is constant.

Code:
INFO: transferred 52.00 GiB in 41 seconds (1.3 GiB/s)

1705583900178.png
I'm well aware that disks aren't the best performers, but I don't understand the difference between nodes. The hardware is exactly the same.
What's the best way to monitor this latency ? And Ceph perfs ?
 

Attachments

  • 1705583605791.png
    1705583605791.png
    47.7 KB · Views: 4