CEPH Outage "active+clean+laggy" resulted in task kmmpd-rbd*:7998 blocked

Nov 28, 2016
245
94
93
Hamburg
uniquoo.com
Hello,

tonight we've had quite the outage.

  • Cluster has been healthy and not overloaded
  • NVMe/SSD-Discs are all fine, 2-4% wearout

It all started with:

Code:
2022-06-22T01:35:34.335404+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351345 : cluster [DBG] pgmap v2353839: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 297 KiB/s wr, 24 op/s
2022-06-22T01:35:36.345523+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351346 : cluster [DBG] pgmap v2353840: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 282 KiB/s wr, 23 op/s
2022-06-22T01:35:38.369582+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351347 : cluster [DBG] pgmap v2353841: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 411 KiB/s wr, 29 op/s
2022-06-22T01:35:40.377989+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351348 : cluster [DBG] pgmap v2353842: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 264 KiB/s wr, 24 op/s

And resulted in:

Bash:
[4723292.988045] libceph: osd13 down
[4723294.005820] libceph: osd1 down
[4723294.005825] libceph: osd7 down
[4723386.083746] libceph: osd1 up
[4723395.327156] libceph: osd7 up
[4723434.069818] libceph: osd5 down
[4723456.172685] libceph: osd4 down
[4723464.222986] libceph: osd5 up
[4723478.364016] libceph: osd4 up
[4723503.019297] libceph: osd3 down
[4723514.553360] libceph: osd10 down
[4723515.570309] libceph: osd10 up
[4723531.521479] INFO: task kmmpd-rbd4:7998 blocked for more than 120 seconds.
[4723531.521760]       Tainted: P           O      5.13.19-6-pve #1
[4723531.522005] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4723531.522213] task:kmmpd-rbd4      state:D stack:    0 pid: 7998 ppid:     2 flags:0x00004000
[4723531.522217] Call Trace:
[4723531.522219]  <TASK>
[4723531.522220]  ? bit_wait+0x70/0x70
[4723531.522230]  __schedule+0x2fa/0x910
[4723531.522233]  ? bit_wait+0x70/0x70
[4723531.522236]  schedule+0x4f/0xc0
[4723531.522239]  io_schedule+0x46/0x70
[4723531.522242]  bit_wait_io+0x11/0x70
[4723531.522244]  __wait_on_bit+0x33/0xa0
[4723531.522247]  ? submit_bio+0x4f/0x1b0
[4723531.522252]  out_of_line_wait_on_bit+0x8d/0xb0
[4723531.522256]  ? var_wake_function+0x30/0x30
[4723531.522260]  __wait_on_buffer+0x34/0x40
[4723531.522264]  write_mmp_block+0xd5/0x130
[4723531.522267]  kmmpd+0x1b9/0x450
[4723531.522269]  ? write_mmp_block+0x130/0x130
[4723531.522271]  kthread+0x12b/0x150
[4723531.522276]  ? set_kthread_struct+0x50/0x50
[4723531.522279]  ret_from_fork+0x22/0x30
[4723531.522285]  </TASK>
[4723630.068398] libceph: osd5 down
[4723632.643646] libceph: osd5 up

No special operations were made during that time. The rbd in question (rbd4) is part of our MariaDB-Database-Cluster which had near to none writes at that moment (middle of the night).

Would be happy to hear any suggestions as i think this might be a Ceph or Kernel Problem.

Regards

Florian