Hello,
tonight we've had quite the outage.
It all started with:
And resulted in:
No special operations were made during that time. The rbd in question (rbd4) is part of our MariaDB-Database-Cluster which had near to none writes at that moment (middle of the night).
Would be happy to hear any suggestions as i think this might be a Ceph or Kernel Problem.
Regards
Florian
tonight we've had quite the outage.
- Cluster has been healthy and not overloaded
- NVMe/SSD-Discs are all fine, 2-4% wearout
It all started with:
Code:
2022-06-22T01:35:34.335404+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351345 : cluster [DBG] pgmap v2353839: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 297 KiB/s wr, 24 op/s
2022-06-22T01:35:36.345523+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351346 : cluster [DBG] pgmap v2353840: 513 pgs: 1 active+clean+laggy, 512 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 0 B/s rd, 282 KiB/s wr, 23 op/s
2022-06-22T01:35:38.369582+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351347 : cluster [DBG] pgmap v2353841: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 411 KiB/s wr, 29 op/s
2022-06-22T01:35:40.377989+0200 mgr.PXMGMT-AAA-N01 (mgr.172269982) 2351348 : cluster [DBG] pgmap v2353842: 513 pgs: 2 active+clean+laggy, 511 active+clean; 598 GiB data, 1.7 TiB used, 24 TiB / 26 TiB avail; 12 KiB/s rd, 264 KiB/s wr, 24 op/s
And resulted in:
Bash:
[4723292.988045] libceph: osd13 down
[4723294.005820] libceph: osd1 down
[4723294.005825] libceph: osd7 down
[4723386.083746] libceph: osd1 up
[4723395.327156] libceph: osd7 up
[4723434.069818] libceph: osd5 down
[4723456.172685] libceph: osd4 down
[4723464.222986] libceph: osd5 up
[4723478.364016] libceph: osd4 up
[4723503.019297] libceph: osd3 down
[4723514.553360] libceph: osd10 down
[4723515.570309] libceph: osd10 up
[4723531.521479] INFO: task kmmpd-rbd4:7998 blocked for more than 120 seconds.
[4723531.521760] Tainted: P O 5.13.19-6-pve #1
[4723531.522005] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[4723531.522213] task:kmmpd-rbd4 state:D stack: 0 pid: 7998 ppid: 2 flags:0x00004000
[4723531.522217] Call Trace:
[4723531.522219] <TASK>
[4723531.522220] ? bit_wait+0x70/0x70
[4723531.522230] __schedule+0x2fa/0x910
[4723531.522233] ? bit_wait+0x70/0x70
[4723531.522236] schedule+0x4f/0xc0
[4723531.522239] io_schedule+0x46/0x70
[4723531.522242] bit_wait_io+0x11/0x70
[4723531.522244] __wait_on_bit+0x33/0xa0
[4723531.522247] ? submit_bio+0x4f/0x1b0
[4723531.522252] out_of_line_wait_on_bit+0x8d/0xb0
[4723531.522256] ? var_wake_function+0x30/0x30
[4723531.522260] __wait_on_buffer+0x34/0x40
[4723531.522264] write_mmp_block+0xd5/0x130
[4723531.522267] kmmpd+0x1b9/0x450
[4723531.522269] ? write_mmp_block+0x130/0x130
[4723531.522271] kthread+0x12b/0x150
[4723531.522276] ? set_kthread_struct+0x50/0x50
[4723531.522279] ret_from_fork+0x22/0x30
[4723531.522285] </TASK>
[4723630.068398] libceph: osd5 down
[4723632.643646] libceph: osd5 up
No special operations were made during that time. The rbd in question (rbd4) is part of our MariaDB-Database-Cluster which had near to none writes at that moment (middle of the night).
Would be happy to hear any suggestions as i think this might be a Ceph or Kernel Problem.
Regards
Florian