Ceph Octopus - Monitor sometimes inconsistent

David Herselman · Dec 8, 2020

We appear to have an inconsistent experience with one of the monitors sometimes appearing to miss behave. Ceph health shows a warning with slow operations:

Code:

[admin@kvm6b ~]# ceph -s
  cluster:
    id:     2a554db9-5d56-4d6a-a1e2-e4f98ef1052f
    health: HEALTH_WARN
            17 slow ops, oldest one blocked for 337272 sec, mon.kvm6a has slow ops

  services:
    mon: 3 daemons, quorum kvm6a,kvm6b,kvm6c (age 3d)
    mgr: kvm6a(active, since 4d), standbys: kvm6b, kvm6c
    mds: cephfs:1 {0=kvm6c=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 3d), 24 in

  task status:
    scrub status:
        mds.kvm6c: idle

  data:
    pools:   6 pools, 225 pgs
    objects: 602.79k objects, 1.9 TiB
    usage:   4.9 TiB used, 30 TiB / 35 TiB avail
    pgs:     225 active+clean

  io:
    client:   15 KiB/s rd, 21 MiB/s wr, 33 op/s rd, 561 op/s wr
    cache:    14 MiB/s flush

Logging in to kvm6a and then run:

Code:

systemctl restart ceph-mon@kvm6a

Thereafter everything is immediately healthy:

Code:

[admin@kvm6b ~]# ceph -s
  cluster:
    id:     2a554db9-5d56-4d6a-a1e2-e4f98ef1052f
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum kvm6a,kvm6b,kvm6c (age 12m)
    mgr: kvm6a(active, since 13m), standbys: kvm6b, kvm6c
    mds: cephfs:1 {0=kvm6c=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 4d), 24 in

  task status:
    scrub status:
        mds.kvm6c: idle

  data:
    pools:   6 pools, 225 pgs
    objects: 605.09k objects, 1.9 TiB
    usage:   4.9 TiB used, 30 TiB / 35 TiB avail
    pgs:     225 active+clean

  io:
    client:   9.3 KiB/s rd, 6.7 MiB/s wr, 30 op/s rd, 75 op/s wr
    cache:    4.7 MiB/s flush, 0 op/s promote

All updates have been installed on all servers:

Code:

[admin@kvm6a ~]# pveversion -v
proxmox-ve: 6.3-1 (running kernel: 5.4.65-1-pve)
pve-manager: 6.3-2 (running version: 6.3-2/22f57405)
pve-kernel-5.4: 6.3-1
pve-kernel-helper: 6.3-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
ceph: 15.2.6-pve1
ceph-fuse: 15.2.6-pve1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-6
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.3-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.0-1
proxmox-backup-client: 1.0.5-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-1
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-1
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

PS: Any news on Ceph 15.2.7?

Alwin · Dec 9, 2020

David Herselman said:
17 slow ops, oldest one blocked for 337272 sec, mon.kvm6a has slow ops

The ceph-mon log should show you the operation that takes long.

Depending from which version Ceph was upgraded, the change for buffered IO could be taking effect.
https://forum.proxmox.com/threads/c...very-night-slowing-down-vms.71573/post-355081

David Herselman said:
PS: Any news on Ceph 15.2.7?

Are you running RGW? That's the only fix mentioned in the release notes.
https://docs.ceph.com/en/latest/releases/octopus/

David Herselman · Dec 10, 2020

Hi Alwin,

Thanks for the feedback, issue appears to affect Octopus monitors so I would expect more reports along these lines once more and more people upgrade from Nautilus. Outstanding operations were stuck for almost 4 days whilst everything was active+clean with virtually no load.

To answer your question though, yes we make use of RGW to handle incremental differential backups to a separate cluster using Benjie (https://benji-backup.me/).

Regards
David Herselman

Alwin · Dec 10, 2020

David Herselman said:
Thanks for the feedback, issue appears to affect Octopus monitors so I would expect more reports along these lines once more and more people upgrade from Nautilus. Outstanding operations were stuck for almost 4 days whilst everything was active+clean with virtually no load.

Can you please be more specific about what operations were stuck? Since slow operations can also happen in Nautilus or lower versions.

David Herselman · Dec 10, 2020

Hi Alwin,

The slow ops were reported by only one of the 3 monitors after having restarted OSD 22. The warnings continued for 4+ days until only that monitor was restarted. During this time all placement groups were active+clean, as shown in the first message above.

My understanding of the logs from the problematic monitor (kvm6a) is that it indicates kvm6c has slow operations whereas neither kvm6b nor kvm6c log anything. This is as per '2020-12-04T12:52:11.349+0200' in the monitor log from kvm6a:

kvm6a:

Code:

2020-12-04T12:52:04.329+0200 7f66797f9700  1 mon.kvm6a@0(leader).osd e57825 e57825: 24 total, 24 up, 24 in
2020-12-04T12:52:04.333+0200 7f66797f9700  0 log_channel(cluster) log [DBG] : osdmap e57825: 24 total, 24 up, 24 in
2020-12-04T12:52:05.313+0200 7f667d801700  0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2020-12-04T12:52:07.321+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 123239/1819005 objects degraded (6.775%), 17 pgs degraded (PG_DEGRADED)
2020-12-04T12:52:07.849+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:08.717+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:08.717+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/3647236440' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:08.765+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:08.765+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/2551954269' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:09.341+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: 1 slow ops, oldest one blocked for 32 sec, mon.kvm6c has slow ops (SLOW_OPS)
2020-12-04T12:52:09.341+0200 7f667d801700  0 log_channel(cluster) log [INF] : Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 123239/1819005 objects degraded (6.775%), 17 pgs degraded)
2020-12-04T12:52:11.349+0200 7f667d801700  0 log_channel(cluster) log [INF] : Health check cleared: SLOW_OPS (was: 1 slow ops, oldest one blocked for 32 sec, mon.kvm6c has slow ops)
2020-12-04T12:52:12.857+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:17.861+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:17.861+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)
2020-12-04T12:52:19.033+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:19.033+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/3731941632' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:19.085+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:19.085+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/1519995793' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:20.901+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check failed: 9 slow ops, oldest one blocked for 32 sec, mon.kvm6a has slow ops (SLOW_OPS)
2020-12-04T12:52:22.865+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:22.865+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 17 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)
2020-12-04T12:52:22.961+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: Slow OSD heartbeats on back (longest 3967.927ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-04T12:52:26.885+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: Slow OSD heartbeats on front (longest 7017.555ms) (OSD_SLOW_PING_TIME_FRONT)
2020-12-04T12:52:27.865+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: 17 slow ops, oldest one blocked for 37 sec, mon.kvm6a has slow ops (SLOW_OPS)
2020-12-04T12:52:27.865+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:27.869+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 17 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)
2020-12-04T12:52:28.329+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:28.329+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/3076981816' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:28.369+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:28.369+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/4089655273' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:32.869+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: Slow OSD heartbeats on back (longest 7478.382ms) (OSD_SLOW_PING_TIME_BACK)
2020-12-04T12:52:32.869+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: 17 slow ops, oldest one blocked for 42 sec, mon.kvm6a has slow ops (SLOW_OPS)
2020-12-04T12:52:32.869+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:32.873+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 17 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)
2020-12-04T12:52:37.873+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: 17 slow ops, oldest one blocked for 47 sec, mon.kvm6a has slow ops (SLOW_OPS)
2020-12-04T12:52:37.873+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:37.877+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 17 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)
2020-12-04T12:52:38.621+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:38.621+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/3482483891' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:38.665+0200 7f667affc700  0 mon.kvm6a@0(leader) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:38.665+0200 7f667affc700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.2:0/399616868' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:42.877+0200 7f667d801700  0 log_channel(cluster) log [WRN] : Health check update: 17 slow ops, oldest one blocked for 52 sec, mon.kvm6a has slow ops (SLOW_OPS)
2020-12-04T12:52:42.877+0200 7f667d801700  1 mon.kvm6a@0(leader).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:42.881+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 17 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)

kvm6b:

Code:

2020-12-04T12:52:03.295+0200 7fc1ccbbd700  1 mon.kvm6b@1(peon).osd e57824 e57824: 24 total, 24 up, 24 in
2020-12-04T12:52:03.299+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"prefix": "osd metadata", "id": 32} v 0) v1
2020-12-04T12:52:03.299+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='mgr.190540688 10.254.1.2:0/2009496' entity='mgr.kvm6a' cmd=[{"prefix": "osd metadata", "id": 32}]: dispatch
2020-12-04T12:52:04.359+0200 7fc1ccbbd700  1 mon.kvm6b@1(peon).osd e57825 e57825: 24 total, 24 up, 24 in
2020-12-04T12:52:04.943+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:06.595+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:06.599+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/1357251391' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:06.647+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:06.647+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/621942878' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:09.947+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:14.947+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:16.847+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:16.851+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/2385410752' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:16.895+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:16.895+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/1603647083' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:19.947+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:24.947+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:26.079+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:26.079+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/3537921673' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:26.119+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:26.119+0200 7fc1ccbbd700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.3:0/4155055808' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:29.947+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:34.951+0200 7fc1cf3c2700  1 mon.kvm6b@1(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:36.299+0200 7fc1ccbbd700  0 mon.kvm6b@1(peon) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1

David Herselman · Dec 10, 2020

kvm6c:

Code:

2020-12-04T12:52:04.363+0200 7ff0f1dc4700  1 mon.kvm6c@2(peon).osd e57825 e57825: 24 total, 24 up, 24 in
2020-12-04T12:52:04.815+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:06.547+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:06.547+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/2772873966' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:06.603+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:06.603+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/605218279' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:09.816+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:14.816+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:16.848+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:16.848+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/458464263' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:16.896+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:16.896+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/3072053022' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:19.820+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:24.816+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:26.124+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:26.124+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/2451772925' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:26.168+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"prefix":"df","format":"json"} v 0) v1
2020-12-04T12:52:26.168+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/1877622460' entity='client.admin' cmd=[{"prefix":"df","format":"json"}]: dispatch
2020-12-04T12:52:29.820+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:34.816+0200 7ff0f45c9700  1 mon.kvm6c@2(peon).osd e57825 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 553648128 kv_alloc: 390070272
2020-12-04T12:52:36.392+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:36.392+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/4213175342' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch
2020-12-04T12:52:36.440+0200 7ff0f1dc4700  0 mon.kvm6c@2(peon) e6 handle_command mon_command({"format":"json","prefix":"df"} v 0) v1
2020-12-04T12:52:36.440+0200 7ff0f1dc4700  0 log_channel(audit) log [DBG] : from='client.? 10.254.1.4:0/3635481511' entity='client.admin' cmd=[{"format":"json","prefix":"df"}]: dispatch

David Herselman · Dec 10, 2020

kvm6a then goes on to indicate osd 22, the OSD that had just been restarted, had slow ops but again only from kvm6a's monitor's perspective...

Alwin · Dec 11, 2020

David Herselman said:
2020-12-04T12:52:09.341+0200 7f667d801700 0 log_channel(cluster) log [WRN] : Health check update: 1 slow ops, oldest one blocked for 32 sec, mon.kvm6c has slow ops (SLOW_OPS)

kvm6a seems to complain about the other MONs as well.

David Herselman said:
2020-12-04T12:52:17.861+0200 7f667d801700 -1 mon.kvm6a@0(leader) e6 get_health_metrics reporting 9 slow ops, oldest is osd_failure(failed timeout osd.22 [v2:10.254.1.3:6802/4170055,v1:10.254.1.3:6804/4170055] for 21sec e57816 v57816)

What did the OSD.22 get stuck on? Maybe the MON still had requests for that OSD in the queue that timed out (> 30 sec).

Search

Search

Ceph Octopus - Monitor sometimes inconsistent

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff

David Herselman

Renowned Member

David Herselman

Renowned Member

David Herselman

Renowned Member

Alwin

Proxmox Retired Staff