Ceph OSD slow OPS on all OSD's

bbn

New Member
Dec 23, 2020
4
0
1
48
Hi,

I recently moved from proxmox + ISCSI ZFS storage to a 3-node hyper converged proxmox cluster running proxmox 6.3 and ceph octopus.
The cluster has 1GbE interfaces for VM traffic and leverages a 40Gbps infiniband network for the proxmox cluster and ceph cluster.
I have a redundant pair of infiniband switches, with different partitions for the proxmox cluster, ceph frontend and ceph backend interfaces. I run the ceph frontend and backend across different switches to be sure they take a different path.
The ceph cluster has 2 storage pools with 66 disks in total (33 ssd & 30 hdd) equally distributed across the 3 nodes.

I tested the cluster before bringing it into operation (iperf, rados bench, dd) and I got good read and write performance of +1GB/sec, no issues, but once I started loading it with VM's with normal operations, I've started to get issues with slow OPS when VM's handle medium loads of data, such as copying large files, or recently with installing a new VM. The slow OPS are pretty much on all OSD's seemingly at random, across all nodes. The OSD's tend to 'hang', basically locking every vm on the cluster. Restarting the OSD's in question solves the problem. I've also seen slow OSD heartbeat messages, so I suspected the infiniband network. I set up a continuous ping during and was able to simulate issues. The ping however does not show a single RTT above 0,15ms on the front and the back has a ping that goes to 3ms occasionally, but is mostly als around 0.1ms. The traffic on the interfaces does not go over 4-5 gbps. The OSD ping delays are much bigger and again seem to be solved when restarting the OSD's, so it seems to me like the OSD processes are hanging. I upgrade from Nautilus to Octopus, but that did not improve the situation. I would appreciate any help or pointers.

Here is my pveversion -v:
# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Benchmarks (no issues):
(inside vm)
# dd if=/dev/zero of=here bs=20G count=1 oflag=direct
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 4.9323 s, 435 MB/s

(on cluster nodes)
# rados bench -p ssd_pool 30 write --no-cleanup
Total time run: 30.0452
Total writes made: 8675
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1154.93
Stddev Bandwidth: 56.661
Max bandwidth (MB/sec): 1256
Min bandwidth (MB/sec): 1028
Average IOPS: 288
Stddev IOPS: 14.1652
Max IOPS: 314
Min IOPS: 257
Average Latency(s): 0.055374
Stddev Latency(s): 0.0133843
Max latency(s): 0.275931
Min latency(s): 0.0245358

# rados bench -p ssd_pool 30 seq
Total time run: 26.2841
Total reads made: 8675
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1320.19
Average IOPS: 330
Stddev IOPS: 23.8767
Max IOPS: 397
Min IOPS: 281
Average Latency(s): 0.0474705
Max latency(s): 0.315668
Min latency(s): 0.0128199

# rados bench -p ssd_pool 30 rand
Cat /etc/
Total time run: 30.0651
Total reads made: 10072
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1340.02
Average IOPS: 335
Stddev IOPS: 25.6896
Max IOPS: 396
Min IOPS: 297
Average Latency(s): 0.046748
Max latency(s): 0.249803
Min latency(s): 0.00535442

In normal operations when issue occurs:

2020-12-23T16:14:56.492856+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97068 5.e 5:700fc2e6:::rbd_data.cf37edf65bf59f.0000000000001b6c:head [set-alloc-hint object_size 4194304 write_size 4194304,write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:23.322144+0100 currently waiting for sub ops


2020-12-23T16:14:56.492875+0100 osd.57 [WRN] slow request osd_op(client.10751556.0:1506598 5.e 5:70493bc6:::rbd_data.ebc2a19f8930c.0000000000000818:head [write 32768~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:54.920373+0100 currently delayed


2020-12-23T16:14:56.492887+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:98407 5.1b6 5:6deb6e6f:::rbd_data.cf37edf65bf59f.0000000000001d18:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:09:01.693417+0100 currently delayed


2020-12-23T16:14:56.492905+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:96803 5.1b6 5:6d96f34a:::rbd_data.cf37edf65bf59f.0000000000001b1c:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:19.537915+0100 currently waiting for sub ops


2020-12-23T16:14:56.492923+0100 osd.57 [WRN] slow request osd_op(client.13582959.0:97522 5.1b6 5:6dec277f:::rbd_data.cf37edf65bf59f.0000000000001bbb:head [write 262144~3932160 in=3932160b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:07:31.146953+0100 currently waiting for sub ops


2020-12-23T16:14:56.492935+0100 osd.57 [WRN] slow request osd_op(client.10437540.0:2942813 5.1b6 5:6de767e2:::rbd_data.450f3449e0b4c.00000000000000eb:head [write 3878912~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e5635) initiated 2020-12-23T16:08:27.815367+0100 currently delayed


When I run: ceph daemon osd.33 dump_ops_in_flight
Some OPS delayed, last event: waiting for readable
Some OPS waiting for subops, last event: sub_op_commit_rec


"ops": [
{
"description": "osd_op(client.10475257.0:213323 5.11e 5:78ef0274:::rbd_header.85751850770c50:head [watch ping cookie 140573457871744 gen 2] snapc 0=[] ondisk+write+known_if_redirected e5619)",
"initiated_at": "2020-12-23T09:10:47.469872+0100",
"age": 307.06381716800001,
"duration": 307.06390489900002,
"type_data": {
"flag_point": "delayed",
"client_info": {
"client": "client.10475257",
"client_addr": "10.2.20.112:0/1863972183",
"tid": 213323
},
"events": [
{
"event": "initiated",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 0
},
{
"event": "throttled",
"time": "2020-12-23T09:10:47.469872+0100",
"duration": 2.8059999999999999e-06
},
{
"event": "header_read",
"time": "2020-12-23T09:10:47.469875+0100",
"duration": 5.2959999999999998e-06
},
{
"event": "all_read",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.0100000000000005e-07
},
{
"event": "dispatched",
"time": "2020-12-23T09:10:47.469880+0100",
"duration": 6.2639999999999997e-06
},
{
"event": "queued_for_pg",
"time": "2020-12-23T09:10:47.469887+0100",
"duration": 5.1799999999999999e-05
},
{
"event": "reached_pg",
"time": "2020-12-23T09:10:47.469939+0100",
"duration": 2.7971999999999999e-05
},
{
"event": "waiting for readable",
"time": "2020-12-23T09:10:47.469966+0100",
"duration": 9.4739000000000002e-05
}
]
}
},


Slow heartbeat:
020-12-23T17:04:24.087770+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on back (longest 1258.257ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:24.087853+0100 mon.node1 [WRN] Health check failed: Slow OSD heartbeats on front (longest 1060.077ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:24.087890+0100 mon.node1 [WRN] Health check failed: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops (SLOW_OPS)
2020-12-23T17:04:28.116156+0100 mon.node1 [INF] Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 31 sec, osd.55 has slow ops)
2020-12-23T17:04:30.179199+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 1851.623ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:38.283704+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1295.957ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-23T17:04:54.435691+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on back (longest 2171.717ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-23T17:04:54.435744+0100 mon.node1 [WRN] Health check update: Slow OSD heartbeats on front (longest 1901.092ms) (OSD_SLOW_PING_TIME_FRONT)

# ceph daemon /var/run/ceph/ceph-mgr.node1.asok dump_osd_network 0|more
{
"threshold": 0,
"entries": [
{
"last update": "Wed Dec 23 17:08:08 2020",
"stale": false,
"from osd": 62,
"to osd": 19,
"interface": "back",
"average": {
"1min": 0.709,
"5min": 480.570,
"15min": 179.110
},
"min": {
"1min": 0.548,
"5min": 0.524,
"15min": 0.502
},
"max": {
"1min": 0.917,
"5min": 13553.647,
"15min": 13553.647
},
"last": 0.663
},
{
"last update": "Wed Dec 23 17:07:53 2020",
"stale": false,
"from osd": 62,
"to osd": 18,
"interface": "back",
"average": {
"1min": 0.728,
"5min": 480.443,
"15min": 171.896
},
"min": {
"1min": 0.537,
"5min": 0.486,
"15min": 0.486
},
"max": {
"1min": 0.961,
"5min": 13553.526,
"15min": 13553.526
},
"last": 0.682
},

Infiniband tunables per proxmox wiki recommendations:
net.ipv4.tcp_mem=1280000 1280000 1280000
net.ipv4.tcp_wmem = 32768 131072 1280000
net.ipv4.tcp_rmem = 32768 131072 1280000
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=16777216
net.core.wmem_default=16777216
net.core.optmem_max=1524288
net.ipv4.tcp_sack=0
net.ipv4.tcp_timestamps=0
 

Alwin

Proxmox Staff Member
Aug 1, 2017
4,617
451
88
The ceph cluster has 2 storage pools with 66 disks in total (33 ssd & 30 hdd) equally distributed across the 3 nodes.
Can you elaborate a little bit more on the hardware used? And how you configured Ceph with it?
 

bbn

New Member
Dec 23, 2020
4
0
1
48
Hi Alwin,

It is a standard proxmox hyperconverged cluster install. The hardware is 3 identical HP DL380 G9:
  1. Dual CPU Xeon E5-2690
  2. 192GB RAM
  3. 1 x P840ar RAID controller in HBA mode
  4. 14x 12Gbps SAS SSD
  5. 10x 12gbps SAS 10k HDD
  6. Dual port Mellanox ConnectX-3 QDR QSFP+ InfiniBand MCX354A-QCBT CX354A
the nodes are connected over 1Gbps ethernet, but the proxmox cluster runs over inifiband, 2 Mellanox 4036 QDR inifiband switches. I was running iscsi over ZFS on my old proxmox cluster on these as well.
I have 4 inifiband partitions:
1) for the proxmox cluster
2) for iscsi over ZFS (while migrating VM's from old cluster)
3) for ceph frontend network
4) for ceph backend network
I do not see the load on any infiniband port go over 5Gbps, ping on the ifiniband network is around 0,1ms.

I now have about 8 VM's running on it, and holding off further migration until I can solve this issue, since it impacts all VM's running on the cluster.

Is there any more information you are looking for?

Thanks,

Bart
 

bbn

New Member
Dec 23, 2020
4
0
1
48
Hi Alwin,

Here is ceph config if this helps:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.2.21.111/24
fsid = 907b139a-1bf2-4010-a3f6-d89fda2347e4
mon_allow_pool_delete = true
mon_host = 10.2.20.111 10.2.20.112 10.2.20.114
osd_pool_default_min_size = 2
osd_pool_default_size = 3
public_network = 10.2.20.111/24 [client]
keyring = /etc/pve/priv/$cluster.$name.keyring
[mon.node1] public_addr = 10.2.20.111
[mon.node2] public_addr = 10.2.20.112
[mon.node4] public_addr = 10.2.20.114

Crush map:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd
device 24 osd.24 class ssd
device 25 osd.25 class ssd
device 26 osd.26 class ssd
device 27 osd.27 class ssd
device 28 osd.28 class hdd
device 29 osd.29 class hdd
device 30 osd.30 class hdd
device 31 osd.31 class hdd
device 32 osd.32 class hdd
device 33 osd.33 class ssd
device 34 osd.34 class ssd
device 35 osd.35 class ssd
device 36 osd.36 class hdd
device 37 osd.37 class hdd
device 38 osd.38 class hdd
device 39 osd.39 class hdd
device 40 osd.40 class hdd
device 41 osd.41 class ssd
device 42 osd.42 class ssd
device 43 osd.43 class ssd
device 44 osd.44 class ssd
device 45 osd.45 class ssd
device 46 osd.46 class ssd
device 47 osd.47 class ssd
device 48 osd.48 class ssd
device 49 osd.49 class ssd
device 50 osd.50 class hdd
device 51 osd.51 class hdd
device 52 osd.52 class hdd
device 53 osd.53 class hdd
device 54 osd.54 class hdd
device 55 osd.55 class ssd
device 56 osd.56 class ssd
device 57 osd.57 class ssd
device 58 osd.58 class hdd
device 59 osd.59 class hdd
device 60 osd.60 class hdd
device 61 osd.61 class hdd
device 62 osd.62 class hdd
device 63 osd.63 class ssd
device 64 osd.64 class ssd
device 65 osd.65 class ssd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host node1 {
id -3 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
id -9 class hdd # do not change unnecessarily
# weight 12.552
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.364
item osd.2 weight 0.364
item osd.3 weight 0.364
item osd.4 weight 0.364
item osd.5 weight 0.364
item osd.6 weight 0.819
item osd.7 weight 0.819
item osd.8 weight 0.819
item osd.9 weight 0.819
item osd.10 weight 0.819
item osd.11 weight 0.364
item osd.12 weight 0.364
item osd.13 weight 0.364
item osd.14 weight 0.819
item osd.15 weight 0.819
item osd.16 weight 0.819
item osd.17 weight 0.819
item osd.18 weight 0.819
item osd.19 weight 0.364
item osd.20 weight 0.364
item osd.21 weight 0.364
item osd.1 weight 0.364
}
host node2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
id -10 class hdd # do not change unnecessarily
# weight 12.552
alg straw2
hash 0 # rjenkins1
item osd.22 weight 0.364
item osd.23 weight 0.364
item osd.24 weight 0.364
item osd.25 weight 0.364
item osd.26 weight 0.364
item osd.28 weight 0.819
item osd.29 weight 0.819
item osd.30 weight 0.819
item osd.31 weight 0.819
item osd.32 weight 0.819
item osd.33 weight 0.364
item osd.34 weight 0.364
item osd.36 weight 0.819
item osd.37 weight 0.819
item osd.38 weight 0.819
item osd.39 weight 0.819
item osd.40 weight 0.819
item osd.41 weight 0.364
item osd.42 weight 0.364
item osd.43 weight 0.364
item osd.27 weight 0.364
item osd.35 weight 0.364
}
host node4 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 12.552
alg straw2
hash 0 # rjenkins1
item osd.44 weight 0.364
item osd.45 weight 0.364
item osd.46 weight 0.364
item osd.47 weight 0.364
item osd.48 weight 0.364
item osd.50 weight 0.819
item osd.51 weight 0.819
item osd.53 weight 0.819
item osd.54 weight 0.819
item osd.56 weight 0.364
item osd.57 weight 0.364
item osd.58 weight 0.819
item osd.59 weight 0.819
item osd.60 weight 0.819
item osd.61 weight 0.819
item osd.62 weight 0.819
item osd.63 weight 0.364
item osd.64 weight 0.364
item osd.49 weight 0.364
item osd.52 weight 0.819
item osd.55 weight 0.364
item osd.65 weight 0.364
}
root default {
id -1 # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
id -12 class hdd # do not change unnecessarily
# weight 37.657
alg straw2
hash 0 # rjenkins1
item node1 weight 12.552
item node2 weight 12.552
item node4 weight 12.552
}
# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule replicated_hdd {
id 1
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
rule replicated_ssd {
id 2
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 0 type host
step emit
}
# end crush map
 

bbn

New Member
Dec 23, 2020
4
0
1
48
Update:
I commented out the /etc/sysctl.conf inifiband tuning parameters about a week and a half ago:
Code:
#net.ipv4.tcp_mem=1280000 1280000 1280000
#net.ipv4.tcp_wmem = 32768 131072 1280000
#net.ipv4.tcp_rmem = 32768 131072 1280000
#net.core.rmem_max=16777216
#net.core.wmem_max=16777216
#net.core.rmem_default=16777216
#net.core.wmem_default=16777216
#net.core.optmem_max=1524288
#net.ipv4.tcp_sack=0
#net.ipv4.tcp_timestamps=0

I did RADOS benchmarks after that, and found it did not really affect performance, so left them commented out.
I am still seeing the SLOW OSD issues, but there are less (1 per day), and they have self-resolved so far after several minutes (no need to restart OSD). This was over the Christmas break, so there was not a lot of usage, so I'm not sure at this point if this improved the situation, or just the fact there was less IO. It definitely did not solve the issue though.

I have a dual CPU system, so I am planning to install a second PCI riser and move the infiniband interfaces to the PCI bus connected to CPU 2, as both the HBA and infiniband interfaces are on the same PCI bus now. I do not think it will change anything though, as the issues seem to arise when there is not so much IO.

Appreciate any further ideas
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!