Ceph Slow Ops if one node is rebooting (Proxmox 7.0-14 Ceph 16.2.6)

dan.ger · Nov 10, 2021

Hello,

I've upgraded a Proxmox 6.4-13 Cluster with Ceph 15.2.x - which works fine without any issues to Proxmox 7.0-14 and Ceph 16.2.6. The cluster is working fine without any issues until a node is rebooted. OSDs which generates the slow ops for Front and Back Slow Ops are not predictable, each time there are other osd affected.

Cluster:
3x Dell 740Xd

Each Server:
2x Xeon Gold 6130
384GB Ram
1x Intel X550 10GBe Nic for Wan - VMs
1x Intel X550 10GBe Nic for Corosync
2x Bonded Intel X550 10 GBE for Mirgration - switchless
2x Mellanox Connectx-6 100GBe (Ethernet Mode) as Meshnetwork - switchless - for Ceph (public/cluster) with default drivers provided by Proxmox
8x NVME Intel P4500 NVmes

So I thought it might be a problem with mellanox cards, so I switch to the Intel nics, same issue. After the node is offline the ceph network is not reachable for long time. a ceph -s timed out. Nodes that are online are pingable (10.10.10.x). Disks have no S.M.A.R.T. Errors network pung is >= 0.030ms, also if one node went down.

Code:

Slow OSD heartbeats on back (longest 22272.685ms)
Slow OSD heartbeats on back from osd.8 [] to osd.2 [] 22272.685 msec
Slow OSD heartbeats on back from osd.8 [] to osd.5 [] 22271.643 msec
Slow OSD heartbeats on back from osd.13 [] to osd.2 [] 21997.950 msec
Slow OSD heartbeats on back from osd.13 [] to osd.5 [] 21997.931 msec
Slow OSD heartbeats on back from osd.11 [] to osd.2 [] 21806.339 msec
Slow OSD heartbeats on back from osd.5 [] to osd.9 [] 21188.398 msec
Slow OSD heartbeats on back from osd.5 [] to osd.14 [] 21188.013 msec possibly improving
Slow OSD heartbeats on back from osd.5 [] to osd.10 [] 21184.563 msec possibly improving
Slow OSD heartbeats on back from osd.5 [] to osd.8 [] 21184.539 msec
Slow OSD heartbeats on back from osd.5 [] to osd.11 [] 21184.367 msec
Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information

Code:

Slow OSD heartbeats on front (longest 22272.255ms)
Slow OSD heartbeats on front from osd.8 [] to osd.2 [] 22272.255 msec
Slow OSD heartbeats on front from osd.8 [] to osd.5 [] 22272.178 msec
Slow OSD heartbeats on front from osd.13 [] to osd.2 [] 21998.689 msec possibly improving
Slow OSD heartbeats on front from osd.13 [] to osd.5 [] 21998.052 msec possibly improving
Slow OSD heartbeats on front from osd.11 [] to osd.2 [] 21806.150 msec
Slow OSD heartbeats on front from osd.5 [] to osd.13 [] 21188.659 msec possibly improving
Slow OSD heartbeats on front from osd.5 [] to osd.11 [] 21188.538 msec
Slow OSD heartbeats on front from osd.5 [] to osd.12 [] 21188.376 msec
Slow OSD heartbeats on front from osd.5 [] to osd.8 [] 21187.845 msec possibly improving
Slow OSD heartbeats on front from osd.5 [] to osd.9 [] 21184.700 msec
Truncated long network list.  Use ceph daemon mgr.# dump_osd_network for more information

pveversion:

Code:

proxmox-ve: 7.0-2 (running kernel: 5.11.22-7-pve)
pve-manager: 7.0-14 (running version: 7.0-14/a9dbe7e3)
pve-kernel-helper: 7.1-4
pve-kernel-5.11: 7.0-10
pve-kernel-5.11.22-7-pve: 5.11.22-12
pve-kernel-5.11.22-5-pve: 5.11.22-10
pve-kernel-5.11.22-1-pve: 5.11.22-2
ceph: 16.2.6-pve2
ceph-fuse: 16.2.6-pve2
corosync: 3.1.5-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve1
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.0-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-12
libpve-guest-common-perl: 4.0-2
libpve-http-server-perl: 4.0-3
libpve-storage-perl: 7.0-13
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.9-4
lxcfs: 4.0.8-pve2
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.0.13-1
proxmox-backup-file-restore: 2.0.13-1
proxmox-mini-journalreader: 1.2-1
proxmox-widget-toolkit: 3.3-6
pve-cluster: 7.0-3
pve-container: 4.1-1
pve-docs: 7.0-5
pve-edk2-firmware: 3.20210831-1
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.5-1
pve-qemu-kvm: 6.1.0-1
pve-xtermjs: 4.12.0-1
qemu-server: 7.0-18
smartmontools: 7.2-1
spiceterm: 3.2-2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve1

interfaces for mellanox mesh:

Code:

auto enp59s0f0np0
iface enp59s0f0np0 inet manual
    mtu 9000

auto enp59s0f1np1
iface enp59s0f1np1 inet manual
    mtu 9000

auto bond1
iface bond1 inet static
    address 10.10.10.1/24
    bond-slaves enp59s0f0np0 enp59s0f1np1
    bond-miimon 100
    bond-mode broadcast
    mtu 9000
#San

Ceph configuration:

Code:

[global]
     auth_client_required = cephx
     auth_cluster_required = cephx
     auth_service_required = cephx
     cluster_network = 10.10.10.0/24
     fsid = d7fb8413-521b-43eb-9deb-c24fd2f8fec4
     mon_allow_pool_delete = true
     mon_host = 10.10.10.1 10.10.10.2 10.10.10.3
     ms_bind_ipv4 = true
     ms_bind_ipv6 = false
     osd_pool_default_min_size = 2
     osd_pool_default_size = 3
     public_network = 10.10.10.0/24

[client]
     keyring = /etc/pve/priv/$cluster.$name.keyring
         rbd_cache_size = 134217728

[mon.pve-01]
     public_addr = 10.10.10.1

[mon.pve-02]
     public_addr = 10.10.10.2

[mon.pve-03]
     public_addr = 10.10.10.3

configuration Database:

Code:

mon     auth_allow_insecure_global_id_reclaim         false
mgr        mgr/pg_autoscaler/autoscale_profilw            scale-down

Crush Map:

Code:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme
device 12 osd.12 class nvme
device 13 osd.13 class nvme
device 14 osd.14 class nvme
device 15 osd.15 class nvme
device 16 osd.16 class nvme
device 17 osd.17 class nvme
device 18 osd.18 class nvme
device 19 osd.19 class nvme
device 20 osd.20 class nvme
device 21 osd.21 class nvme
device 22 osd.22 class nvme
device 23 osd.23 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host pve-01 {
    id -3        # do not change unnecessarily
    id -2 class nvme        # do not change unnecessarily
    # weight 7.278
    alg straw2
    hash 0    # rjenkins1
    item osd.7 weight 0.910
    item osd.6 weight 0.910
    item osd.5 weight 0.910
    item osd.4 weight 0.910
    item osd.3 weight 0.910
    item osd.2 weight 0.910
    item osd.0 weight 0.910
    item osd.1 weight 0.910
}
host pve-02 {
    id -5        # do not change unnecessarily
    id -4 class nvme        # do not change unnecessarily
    # weight 7.278
    alg straw2
    hash 0    # rjenkins1
    item osd.15 weight 0.910
    item osd.14 weight 0.910
    item osd.13 weight 0.910
    item osd.12 weight 0.910
    item osd.11 weight 0.910
    item osd.10 weight 0.910
    item osd.9 weight 0.910
    item osd.8 weight 0.910
}
host pve-03 {
    id -7        # do not change unnecessarily
    id -6 class nvme        # do not change unnecessarily
    # weight 7.205
    alg straw2
    hash 0    # rjenkins1
    item osd.23 weight 0.873
    item osd.22 weight 0.910
    item osd.21 weight 0.910
    item osd.20 weight 0.910
    item osd.19 weight 0.873
    item osd.18 weight 0.910
    item osd.17 weight 0.910
    item osd.16 weight 0.910
}
root default {
    id -1        # do not change unnecessarily
    id -8 class nvme        # do not change unnecessarily
    # weight 21.760
    alg straw2
    hash 0    # rjenkins1
    item pve-01 weight 7.278
    item pve-02 weight 7.278
    item pve-03 weight 7.205
}

# rules
rule replicated_rule {
    id 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map
Server View
Logs

After node is back slow ops are gone after view minutes and everything works fine.

Any ideas, on Proxmox 6.4-13 and Ceph 15.2.x I didn't not have such issues.

dan.ger · Nov 10, 2021

Last Entry in ceph.log is:

Code:

2021-11-10T11:50:17.966964+0100 mon.pve-01 (mon.0) 4215 : cluster [WRN] Health check failed: 8 osds down (OSD_DOWN)
2021-11-10T11:50:17.967048+0100 mon.pve-01 (mon.0) 4216 : cluster [WRN] Health check failed: 1 host (8 osds) down (OSD_HOST_DOWN)
2021-11-10T11:50:18.015563+0100 mon.pve-01 (mon.0) 4217 : cluster [DBG] osdmap e3678: 24 total, 16 up, 24 in
2021-11-10T11:50:19.031263+0100 mon.pve-01 (mon.0) 4218 : cluster [DBG] osdmap e3679: 24 total, 16 up, 24 in
2021-11-10T11:50:19.075775+0100 mgr.pve-02 (mgr.1144129) 1372 : cluster [DBG] pgmap v1425: 1025 pgs: 62 activating+undersized, 126 peering, 270 stale+active+clean, 1 active+clean+laggy, 566 active+clean; 145 GiB data, 439 GiB used, 21 TiB / 22 TiB avail; 0 B/s rd, 1.1 KiB/s wr, 0 op/s; 2249/111465 objects degraded (2.018%)
2021-11-10T11:50:20.029598+0100 mon.pve-01 (mon.0) 4219 : cluster [WRN] Health check failed: Reduced data availability: 12 pgs inactive, 120 pgs peering (PG_AVAILABILITY)
2021-11-10T11:50:21.078033+0100 mgr.pve-02 (mgr.1144129) 1373 : cluster [DBG] pgmap v1426: 1025 pgs: 181 active+undersized+degraded, 62 activating+undersized, 126 peering, 214 stale+active+clean, 1 active+clean+laggy, 441 active+clean; 145 GiB data, 439 GiB used, 21 TiB / 22 TiB avail; 15 KiB/s wr, 0 op/s; 8783/111465 objects degraded (7.880%)
2021-11-10T11:50:22.035298+0100 mon.pve-01 (mon.0) 4220 : cluster [WRN] Health check failed: Degraded data redundancy: 8783/111465 objects degraded (7.880%), 181 pgs degraded (PG_DEGRADED)

and the entire ceph is frozen. So I got a timeout on ui and also on console with ceph -s.

Monitors are listening - checked with telnet 10.10.10.x 3000 => response is ok.

dan.ger · Nov 10, 2021

Seems to be autoscaler on cluster. I rebooted the entire cluster and now I do not see any slow ops again. I recreate all osd again with encrypted option and reboot the entire cluster. Let's see if slow ops are gone and vms are resposible

dan.ger · Nov 10, 2021

After recreating of encrypted osds ceph config was changed:

Code:

     osd_pool_default_min_size = 1
     osd_pool_default_size = 2

to incorrect values, so I got an unresponsible ceph storage.

dan.ger · Nov 10, 2021

After recreation of OSD ceph shows Outdated OSDs... but version 16.2.6 is installed on mons, mgrs and osds. Restart mons node by node, than mgrs and after that each osd. Destroying monitor and recreate,then manager... but no success. Still Outdated OSDs but vms working fine, also if a node is down.

dan.ger · Nov 11, 2021

Destroying the cluster, remove ceph and reinstall it solve the issue of outdated osds. Slow ops seems to be away. But I've got OSD_SLOW_PING_TIME_BACK and OSD_SLOW_PING_TIME_FRONT (Slow hartbeates) on Mellanox mesh interface, while rebooting a node. UI is getting also some timeouts. I use the mellanox driver which was provided by proxmox installation, so I cannot see any issues. Pings uf the mellanox links are fine >= 0.030ms, also if a node is rebooting.

Any suggestions?

dan.ger · Nov 11, 2021

Seems to be resolve with latest Mellanox firmware of ubuntu OFED driver package 5.4-3.0.3.0. No blocked queries and vms are responsible.

Nope, if write access occurs while rebooting a node, same thing slow ops which ends in unresponsive vms.

Klaus Steinberger · Nov 11, 2021

How did you configure the switchless networks?

STP or RSTP ?

it sounds like it takes too long a time to rebuild the Switch Root Tree

dan.ger · Nov 11, 2021

I do not specially configure spanning tree protocols or rapid spanning tree protocol. Each server has a dual port connectx-6 card and each server is connected to the other servers. Linux Bond is configured as Broadcast, this works fine in Proxmox 6.4-13, but on Proxmox 7.x the configuration produced slow ops which blocks and freezes vms.

I follow this proxmox HowTo, so I how do I configure that?

dan.ger · Nov 11, 2021

this is the thing that happens in OSD.log:

OSD.13 currently delayed

Code:

2021-11-11T21:41:00.063+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:00.063+0100 7f4184d78700 -1 osd.13 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:41:01.047+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:01.047+0100 7f4184d78700 -1 osd.13 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:41:02.015+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:02.015+0100 7f4184d78700 -1 osd.13 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:41:02.999+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:02.999+0100 7f4184d78700 -1 osd.13 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:41:04.035+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:04.035+0100 7f4184d78700 -1 osd.13 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13290 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 147456~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:29.659375+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13355 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 487424~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366247+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13356 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 966656~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366287+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13357 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 1007616~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366323+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13358 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 1576960~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366339+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13359 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 1830912~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366350+0100 currently delayed
2021-11-11T21:41:05.055+0100 7f4184d78700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.519295.0:13360 2.1b6 2:6d99ebb9:::rbd_data.7ec7985ce9a9f.0000000000000406:head [write 1867776~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:34.366362+0100 currently delayed

OSD.6 currently waiting for sub ops

Code:

2021-11-11T21:40:46.447+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:16.107728+0100 currently waiting for sub ops
2021-11-11T21:40:46.447+0100 7fa303fae700 -1 osd.6 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:40:47.431+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:16.107728+0100 currently waiting for sub ops
2021-11-11T21:40:47.431+0100 7fa303fae700 -1 osd.6 789 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:40:48.455+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:16.107728+0100 currently waiting for sub ops
2021-11-11T21:40:48.455+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.554308.0:438 2.b8 2:1d01ffe3:::rbd_data.832b24a201053.000000000000003b:head [write 2097152~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:18.273288+0100 currently waiting for sub ops
2021-11-11T21:40:48.455+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.554308.0:439 2.1be 2:7dc949e5:::rbd_data.832b24a201053.0000000000000e3b:head [write 2695168~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:18.273327+0100 currently waiting for sub ops
2021-11-11T21:40:48.455+0100 7fa303fae700 -1 osd.6 789 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789)
2021-11-11T21:40:49.439+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.704310.0:263 2.54 2:2a47a02c:::rbd_data.831bc93c8a8df.00000000000019ca:head [write 4067328~28672 in=28672b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:16.107728+0100 currently waiting for sub ops
2021-11-11T21:40:49.439+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.554308.0:438 2.b8 2:1d01ffe3:::rbd_data.832b24a201053.000000000000003b:head [write 2097152~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:18.273288+0100 currently waiting for sub ops
2021-11-11T21:40:49.439+0100 7fa303fae700  0 log_channel(cluster) log [WRN] : slow request osd_op(client.554308.0:439 2.1be 2:7dc949e5:::rbd_data.832b24a201053.0000000000000e3b:head [write 2695168~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected e789) initiated 2021-11-11T21:40:18.273327+0100 currently waiting for sub ops

I use this ceph tutorial to find the issue, but I cannot see something.

Kern.log and message.log reports only that 1 interface of mellanox is going down.

Restarting affected osd help to run vms and ceph again, but this couldn't be the solution. Does anyone else have this effects?

dan.ger · Nov 12, 2021

Using Mellanox OFED driver 5.4-3.0.3.0 does not resolve the issue. The performance is decreased with ofed drivers and mor slow ops occurs.

dan.ger · Nov 13, 2021

Same problem under Proxmox 7.0-2 and Ceph 15.2-15, so it must something to do with Debian 11 and Proxmox. Proxmox 6.4-x everything runs fine.

I have no more ideas. Any suggestions?

Klaus Steinberger · Nov 13, 2021

you _must_ configure Rapid Spanning Tree Protocol, STP is probably def

dan.ger said:
I do not specially configure spanning tree protocols or rapid spanning tree protocol. Each server has a dual port connectx-6 card and each server is connected to the other servers. Linux Bond is configured as Broadcast, this works fine in Proxmox 6.4-13, but on Proxmox 7.x the configuration produced slow ops which blocks and freezes vms.

I follow this proxmox HowTo, so I how do I configure that?

I would not recommend the broadcast setup in my opinion it is error prone.

Try instead the routed setup

dan.ger · Nov 13, 2021

Solved !!!!

Thank you so much, that's the problem. With routed mesh network it is working like before!

I don't know what are the changes between proxmox 6.4-x and proxmox 7.0-x for ceph and broadcast network. but now it works fine

michel.seicon · Jul 2, 2022

Hello, I have the same problem, can you give me the solution, how was your network configuration file?

dan.ger · Jul 2, 2022

Hello,

just use the same config as described here https://pve.proxmox.com/wiki/Network_Configuration#_routed_configuration

Just change the interfaces and ips to yours.

Kind regards,
Daniel

michel.seicon · Jul 4, 2022

Solved !!!! Too

Thank you so much, that's the problem. With routed mesh network it is working like before!
I don't know what are the changes between proxmox 6.4-x and proxmox 7.0-x for ceph and broadcast network. but now it works fine

Search

Search

Ceph Slow Ops if one node is rebooting (Proxmox 7.0-14 Ceph 16.2.6)

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

Klaus Steinberger

Renowned Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

dan.ger

Well-Known Member

Klaus Steinberger

Renowned Member

dan.ger

Well-Known Member

michel.seicon

Member

dan.ger

Well-Known Member

michel.seicon

Member

We value your privacy