CEPH Outage "active+clean+laggy" resulted in task kmmpd-rbd*:7998 blocked

fstrankowski

Renowned Member
Nov 28, 2016
78
18
73
40
Hamburg
Hello,

tonight we've had quite the outage.

  • Cluster has been healthy and not overloaded
  • NVMe/SSD-Discs are all fine, 2-4% wearout

It all started with:

Code:
2022-06-22T01:35:34.335404+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351345 : cluster [DBG] pgmap v2353839: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 297 KiB/s wr, 24 op/s
2022-06-22T01:35:36.345523+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351346 : cluster [DBG] pgmap v2353840: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 282 KiB/s wr, 23 op/s
2022-06-22T01:35:38.369582+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351347 : cluster [DBG] pgmap v2353841: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 411 KiB/s wr, 29 op/s
2022-06-22T01:35:40.377989+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351348 : cluster [DBG] pgmap v2353842: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 264 KiB/s wr, 24 op/s

And resulted in:

Bash:
[4723292.988045] libceph: osd13 down
[4723294.005820] libceph: osd1 down
[4723294.005825] libceph: osd7 down
[4723386.083746] libceph: osd1 up
[4723395.327156] libceph: osd7 up
[4723434.069818] libceph: osd5 down
[4723456.172685] libceph: osd4 down
[4723464.222986] libceph: osd5 up
[4723478.364016] libceph: osd4 up
[4723503.019297] libceph: osd3 down
[4723514.553360] libceph: osd10 down
[4723515.570309] libceph: osd10 up
[4723531.521479] INFO: task kmmpd-rbd4:7998 blocked for more than 120 seconds.
[4723531.521760]       Tainted: P           O      5.13.19-6-pve #1
[4723531.522005] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4723531.522213] task:kmmpd-rbd4      state:D stack:    0 pid: 7998 ppid:     2 flags:0x00004000
[4723531.522217] Call Trace:
[4723531.522219]  <TASK>
[4723531.522220]  ? bit_wait+0x70/0x70
[4723531.522230]  __schedule+0x2fa/0x910
[4723531.522233]  ? bit_wait+0x70/0x70
[4723531.522236]  schedule+0x4f/0xc0
[4723531.522239]  io_schedule+0x46/0x70
[4723531.522242]  bit_wait_io+0x11/0x70
[4723531.522244]  __wait_on_bit+0x33/0xa0
[4723531.522247]  ? submit_bio+0x4f/0x1b0
[4723531.522252]  out_of_line_wait_on_bit+0x8d/0xb0
[4723531.522256]  ? var_wake_function+0x30/0x30
[4723531.522260]  __wait_on_buffer+0x34/0x40
[4723531.522264]  write_mmp_block+0xd5/0x130
[4723531.522267]  kmmpd+0x1b9/0x450
[4723531.522269]  ? write_mmp_block+0x130/0x130
[4723531.522271]  kthread+0x12b/0x150
[4723531.522276]  ? set_kthread_struct+0x50/0x50
[4723531.522279]  ret_from_fork+0x22/0x30
[4723531.522285]  </TASK>
[4723630.068398] libceph: osd5 down
[4723632.643646] libceph: osd5 up

No special operations were made during that time. The rbd in question (rbd4) is part of our MariaDB-Database-Cluster which had near to none writes at that moment (middle of the night).

Would be happy to hear any suggestions as i think this might be a Ceph or Kernel Problem.

Regards

Florian
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!