Kernel Crashes on IO error on dev dm-X

jakoberpf

Member
Jan 19, 2022
6
0
6
31
Hey People,

I have some devastating issues with my Backup Ceph Filesystem. It started with a failed disk some days ago, which is set it into recovery/rebuild mode. Its based on two erasure coding pools, one being 8/7 and the other 12/11 with a total of 16 disks. I didn't worry about anything as I had more than enough spare capacity and a single disk still fits into mal failure domain. But now since two days the system is crashing after a short time (10-20min) after booting due to (pulled from `/var/log/syslog` after restarting).

Code:
Mar 14 18:21:28 backup kernel: [  282.441797] Buffer I/O error on dev dm-4, logical block 13945792, async page read
Mar 12 11:26:45 backup kernel: [ 1050.328095] Buffer I/O error on dev dm-5, logical block 13945792, async page read
Mar 15 18:49:38 backup kernel: [  203.866160] Buffer I/O error on dev dm-6, logical block 13945792, async page read
Mar 15 18:36:44 backup kernel: [  561.143378] Buffer I/O error on dev dm-7, logical block 13945792, async page read

And this what my dashboard looks like before proxmox crashes...

Screenshot 2023-03-15 at 19.24.49.png

So from what I understand I have 4 OSDs which have corrupted sectors.

- Damn how did this happen ?
- But more importantly, can I fix this and how ?
- Removing all 4 corrupted OSDs is not an option, as my 12/11 EC pool can only loose 2 OSDs


I realise that an EC pool this big was not a good solution, but it is what it is now. Any hints how I could improve the state of my CephFS would be much appreciated. As last resort I do have the option to restore everything from my cloud backup, but I would like to avoid this if possible.

Best Regards,
Jakob

Complete Error Logs

Code:
Mar 15 19:06:32 backup kernel: [  605.867967] INFO: task bstore_kv_sync:11549 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [  605.868192]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.868908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.869623] task:bstore_kv_sync  state:D stack:    0 pid:11549 ppid:     1 flags:0x00000224
Mar 15 19:06:32 backup kernel: [  605.869626] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.869628]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.869629]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.869633]  ? blk_flush_plug_list+0xdd/0x110
Mar 15 19:06:32 backup kernel: [  605.869636]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.869637]  io_schedule+0x46/0x80
Mar 15 19:06:32 backup kernel: [  605.869638]  wait_on_page_bit_common+0x114/0x3e0
Mar 15 19:06:32 backup kernel: [  605.869642]  ? filemap_invalidate_unlock_two+0x50/0x50
Mar 15 19:06:32 backup kernel: [  605.869644]  wait_on_page_bit+0x3f/0x50
Mar 15 19:06:32 backup kernel: [  605.869646]  wait_on_page_writeback+0x26/0x80
Mar 15 19:06:32 backup kernel: [  605.869647]  __filemap_fdatawait_range+0x97/0x120
Mar 15 19:06:32 backup kernel: [  605.869649]  ? filemap_fdatawrite_wbc+0x94/0xe0
Mar 15 19:06:32 backup kernel: [  605.869651]  ? __filemap_fdatawrite_range+0x54/0x70
Mar 15 19:06:32 backup kernel: [  605.869653]  file_fdatawait_range+0x1a/0x30
Mar 15 19:06:32 backup kernel: [  605.869655]  sync_file_range+0xca/0x100
Mar 15 19:06:32 backup kernel: [  605.869658]  __x64_sys_sync_file_range+0x44/0x90
Mar 15 19:06:32 backup kernel: [  605.869660]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.869661]  ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [  605.869664]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.869666]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.869668]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.869670]  ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [  605.869671]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.869674] RIP: 0033:0x7fbaf2cf8598
Mar 15 19:06:32 backup kernel: [  605.869675] RSP: 002b:00007fbae3003930 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
Mar 15 19:06:32 backup kernel: [  605.869677] RAX: ffffffffffffffda RBX: 000055fd1f74cfc0 RCX: 00007fbaf2cf8598
Mar 15 19:06:32 backup kernel: [  605.869678] RDX: 0000000000001000 RSI: 000000001b163000 RDI: 000000000000002d
Mar 15 19:06:32 backup kernel: [  605.869679] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.869679] R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000000001
Mar 15 19:06:32 backup kernel: [  605.869680] R13: 0000000000000001 R14: 000000001b163000 R15: 000055fd1f7d2c00
Mar 15 19:06:32 backup kernel: [  605.869682]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.869749] INFO: task vgs:15284 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [  605.870328]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.871030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.871735] task:vgs             state:D stack:    0 pid:15284 ppid:  3651 flags:0x00000000
Mar 15 19:06:32 backup kernel: [  605.871737] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.871738]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.871738]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.871740]  ? __smp_call_single_queue+0x59/0x90
Mar 15 19:06:32 backup kernel: [  605.871742]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.871744]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.871745]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.871747]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.871748]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.871750]  blkdev_put+0x3a/0x210
Mar 15 19:06:32 backup kernel: [  605.871752]  blkdev_close+0x27/0x40
Mar 15 19:06:32 backup kernel: [  605.871754]  __fput+0x9c/0x280
Mar 15 19:06:32 backup kernel: [  605.871757]  ____fput+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.871758]  task_work_run+0x6a/0xb0
Mar 15 19:06:32 backup kernel: [  605.871760]  exit_to_user_mode_prepare+0x1a8/0x1b0
Mar 15 19:06:32 backup kernel: [  605.871762]  syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.871764]  ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [  605.871765]  do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871766]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.871768]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.871769]  ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.871771]  ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [  605.871772]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871773]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871774]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.871776]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.871778]  ? sysvec_apic_timer_interrupt+0x4e/0x90
Mar 15 19:06:32 backup kernel: [  605.871779]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.871782] RIP: 0033:0x7f6a44e6efc3
Mar 15 19:06:32 backup kernel: [  605.871783] RSP: 002b:00007fffab3fd318 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Mar 15 19:06:32 backup kernel: [  605.871784] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f6a44e6efc3
Mar 15 19:06:32 backup kernel: [  605.871785] RDX: 0000558752624020 RSI: 000055875269dac0 RDI: 0000000000000007
Mar 15 19:06:32 backup kernel: [  605.871786] RBP: 00007fffab3fd340 R08: 000055875269dac0 R09: 0000558751dda010
Mar 15 19:06:32 backup kernel: [  605.871786] R10: 00007f6a44f51b80 R11: 0000000000000246 R12: 000055875127af00
Mar 15 19:06:32 backup kernel: [  605.871787] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.871788]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.871789] INFO: task systemd-udevd:15529 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [  605.872448]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.873162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.873874] task:systemd-udevd   state:D stack:    0 pid:15529 ppid:   773 flags:0x00000224
Mar 15 19:06:32 backup kernel: [  605.873876] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.873877]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.873877]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.873879]  ? page_counter_uncharge+0x22/0x40
Mar 15 19:06:32 backup kernel: [  605.873882]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.873883]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.873884]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.873886]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.873887]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.873889]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.873890]  blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [  605.873892]  blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [  605.873894]  ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [  605.873895]  blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [  605.873897]  do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [  605.873899]  vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [  605.873900]  path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [  605.873902]  do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [  605.873904]  ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [  605.873906]  do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [  605.873908]  __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [  605.873909]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.873911]  ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [  605.873913]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.873915]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.873916]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.873918]  ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [  605.873919]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.873922] RIP: 0033:0x7f979d188767
Mar 15 19:06:32 backup kernel: [  605.873922] RSP: 002b:00007ffe4f2dfe90 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [  605.873923] RAX: ffffffffffffffda RBX: 00007ffe4f2dffc4 RCX: 00007f979d188767
Mar 15 19:06:32 backup kernel: [  605.873924] RDX: 00000000000a0800 RSI: 0000564af4ccc930 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [  605.873925] RBP: 0000564af4ccc930 R08: 0000564af4b37600 R09: 0000564af4c38110
Mar 15 19:06:32 backup kernel: [  605.873926] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
Mar 15 19:06:32 backup kernel: [  605.873926] R13: 0000000000000000 R14: 00007ffe4f2dff20 R15: 00007ffe4f2dffc4
Mar 15 19:06:32 backup kernel: [  605.873928]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.873928] INFO: task vgs:16253 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [  605.874596]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.875320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.876126] task:vgs             state:D stack:    0 pid:16253 ppid:  4359 flags:0x00000000
Mar 15 19:06:32 backup kernel: [  605.876128] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.876128]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.876129]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.876131]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.876132]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.876134]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.876135]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.876137]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.876138]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.876140]  blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [  605.876142]  blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [  605.876144]  ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [  605.876145]  blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [  605.876147]  do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [  605.876148]  vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [  605.876149]  path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [  605.876151]  ? filename_lookup+0xcb/0x1d0
Mar 15 19:06:32 backup kernel: [  605.876152]  do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [  605.876153]  ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [  605.876156]  do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [  605.876158]  __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [  605.876159]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.876160]  ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.876162]  ? __x64_sys_newstat+0x16/0x20
Mar 15 19:06:32 backup kernel: [  605.876163]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.876165]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.876166]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.876168] RIP: 0033:0x7f208eb964e7
Mar 15 19:06:32 backup kernel: [  605.876169] RSP: 002b:00007ffc85497db0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [  605.876170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f208eb964e7
Mar 15 19:06:32 backup kernel: [  605.876171] RDX: 0000000000044000 RSI: 00005626e096da00 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [  605.876171] RBP: 00005626e096da00 R08: 0000000000000001 R09: 00007f208ec79be0
Mar 15 19:06:32 backup kernel: [  605.876172] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000044000
Mar 15 19:06:32 backup kernel: [  605.876173] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.876174]  </TASK>
 
Last edited: