Kernel Crashes on IO error on dev dm-X

jakoberpf

Member
Jan 19, 2022
3
0
6
30
Hey People,

I have some devastating issues with my Backup Ceph Filesystem. It started with a failed disk some days ago, which is set it into recovery/rebuild mode. Its based on two erasure coding pools, one being 8/7 and the other 12/11 with a total of 16 disks. I didn't worry about anything as I had more than enough spare capacity and a single disk still fits into mal failure domain. But now since two days the system is crashing after a short time (10-20min) after booting due to (pulled from `/var/log/syslog` after restarting).

Code:
Mar 14 18:21:28 backup kernel: [  282.441797] Buffer I/O error on dev dm-4, logical block 13945792, async page read
Mar 12 11:26:45 backup kernel: [ 1050.328095] Buffer I/O error on dev dm-5, logical block 13945792, async page read
Mar 15 18:49:38 backup kernel: [  203.866160] Buffer I/O error on dev dm-6, logical block 13945792, async page read
Mar 15 18:36:44 backup kernel: [  561.143378] Buffer I/O error on dev dm-7, logical block 13945792, async page read

And this what my dashboard looks like before proxmox crashes...

Screenshot 2023-03-15 at 19.24.49.png

So from what I understand I have 4 OSDs which have corrupted sectors.

- Damn how did this happen ?
- But more importantly, can I fix this and how ?
- Removing all 4 corrupted OSDs is not an option, as my 12/11 EC pool can only loose 2 OSDs


I realise that an EC pool this big was not a good solution, but it is what it is now. Any hints how I could improve the state of my CephFS would be much appreciated. As last resort I do have the option to restore everything from my cloud backup, but I would like to avoid this if possible.

Best Regards,
Jakob

Complete Error Logs

Code:
Mar 15 19:06:32 backup kernel: [  605.867967] INFO: task bstore_kv_sync:11549 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [  605.868192]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.868908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.869623] task:bstore_kv_sync  state:D stack:    0 pid:11549 ppid:     1 flags:0x00000224
Mar 15 19:06:32 backup kernel: [  605.869626] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.869628]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.869629]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.869633]  ? blk_flush_plug_list+0xdd/0x110
Mar 15 19:06:32 backup kernel: [  605.869636]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.869637]  io_schedule+0x46/0x80
Mar 15 19:06:32 backup kernel: [  605.869638]  wait_on_page_bit_common+0x114/0x3e0
Mar 15 19:06:32 backup kernel: [  605.869642]  ? filemap_invalidate_unlock_two+0x50/0x50
Mar 15 19:06:32 backup kernel: [  605.869644]  wait_on_page_bit+0x3f/0x50
Mar 15 19:06:32 backup kernel: [  605.869646]  wait_on_page_writeback+0x26/0x80
Mar 15 19:06:32 backup kernel: [  605.869647]  __filemap_fdatawait_range+0x97/0x120
Mar 15 19:06:32 backup kernel: [  605.869649]  ? filemap_fdatawrite_wbc+0x94/0xe0
Mar 15 19:06:32 backup kernel: [  605.869651]  ? __filemap_fdatawrite_range+0x54/0x70
Mar 15 19:06:32 backup kernel: [  605.869653]  file_fdatawait_range+0x1a/0x30
Mar 15 19:06:32 backup kernel: [  605.869655]  sync_file_range+0xca/0x100
Mar 15 19:06:32 backup kernel: [  605.869658]  __x64_sys_sync_file_range+0x44/0x90
Mar 15 19:06:32 backup kernel: [  605.869660]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.869661]  ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [  605.869664]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.869666]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.869668]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.869670]  ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [  605.869671]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.869674] RIP: 0033:0x7fbaf2cf8598
Mar 15 19:06:32 backup kernel: [  605.869675] RSP: 002b:00007fbae3003930 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
Mar 15 19:06:32 backup kernel: [  605.869677] RAX: ffffffffffffffda RBX: 000055fd1f74cfc0 RCX: 00007fbaf2cf8598
Mar 15 19:06:32 backup kernel: [  605.869678] RDX: 0000000000001000 RSI: 000000001b163000 RDI: 000000000000002d
Mar 15 19:06:32 backup kernel: [  605.869679] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.869679] R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000000001
Mar 15 19:06:32 backup kernel: [  605.869680] R13: 0000000000000001 R14: 000000001b163000 R15: 000055fd1f7d2c00
Mar 15 19:06:32 backup kernel: [  605.869682]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.869749] INFO: task vgs:15284 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [  605.870328]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.871030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.871735] task:vgs             state:D stack:    0 pid:15284 ppid:  3651 flags:0x00000000
Mar 15 19:06:32 backup kernel: [  605.871737] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.871738]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.871738]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.871740]  ? __smp_call_single_queue+0x59/0x90
Mar 15 19:06:32 backup kernel: [  605.871742]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.871744]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.871745]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.871747]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.871748]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.871750]  blkdev_put+0x3a/0x210
Mar 15 19:06:32 backup kernel: [  605.871752]  blkdev_close+0x27/0x40
Mar 15 19:06:32 backup kernel: [  605.871754]  __fput+0x9c/0x280
Mar 15 19:06:32 backup kernel: [  605.871757]  ____fput+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.871758]  task_work_run+0x6a/0xb0
Mar 15 19:06:32 backup kernel: [  605.871760]  exit_to_user_mode_prepare+0x1a8/0x1b0
Mar 15 19:06:32 backup kernel: [  605.871762]  syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.871764]  ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [  605.871765]  do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871766]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.871768]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.871769]  ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.871771]  ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [  605.871772]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871773]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.871774]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.871776]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.871778]  ? sysvec_apic_timer_interrupt+0x4e/0x90
Mar 15 19:06:32 backup kernel: [  605.871779]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.871782] RIP: 0033:0x7f6a44e6efc3
Mar 15 19:06:32 backup kernel: [  605.871783] RSP: 002b:00007fffab3fd318 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Mar 15 19:06:32 backup kernel: [  605.871784] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f6a44e6efc3
Mar 15 19:06:32 backup kernel: [  605.871785] RDX: 0000558752624020 RSI: 000055875269dac0 RDI: 0000000000000007
Mar 15 19:06:32 backup kernel: [  605.871786] RBP: 00007fffab3fd340 R08: 000055875269dac0 R09: 0000558751dda010
Mar 15 19:06:32 backup kernel: [  605.871786] R10: 00007f6a44f51b80 R11: 0000000000000246 R12: 000055875127af00
Mar 15 19:06:32 backup kernel: [  605.871787] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.871788]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.871789] INFO: task systemd-udevd:15529 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [  605.872448]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.873162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.873874] task:systemd-udevd   state:D stack:    0 pid:15529 ppid:   773 flags:0x00000224
Mar 15 19:06:32 backup kernel: [  605.873876] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.873877]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.873877]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.873879]  ? page_counter_uncharge+0x22/0x40
Mar 15 19:06:32 backup kernel: [  605.873882]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.873883]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.873884]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.873886]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.873887]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.873889]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.873890]  blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [  605.873892]  blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [  605.873894]  ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [  605.873895]  blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [  605.873897]  do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [  605.873899]  vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [  605.873900]  path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [  605.873902]  do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [  605.873904]  ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [  605.873906]  do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [  605.873908]  __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [  605.873909]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.873911]  ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [  605.873913]  ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [  605.873915]  ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [  605.873916]  ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [  605.873918]  ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [  605.873919]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.873922] RIP: 0033:0x7f979d188767
Mar 15 19:06:32 backup kernel: [  605.873922] RSP: 002b:00007ffe4f2dfe90 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [  605.873923] RAX: ffffffffffffffda RBX: 00007ffe4f2dffc4 RCX: 00007f979d188767
Mar 15 19:06:32 backup kernel: [  605.873924] RDX: 00000000000a0800 RSI: 0000564af4ccc930 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [  605.873925] RBP: 0000564af4ccc930 R08: 0000564af4b37600 R09: 0000564af4c38110
Mar 15 19:06:32 backup kernel: [  605.873926] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
Mar 15 19:06:32 backup kernel: [  605.873926] R13: 0000000000000000 R14: 00007ffe4f2dff20 R15: 00007ffe4f2dffc4
Mar 15 19:06:32 backup kernel: [  605.873928]  </TASK>
Mar 15 19:06:32 backup kernel: [  605.873928] INFO: task vgs:16253 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [  605.874596]       Tainted: P           O      5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [  605.875320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [  605.876126] task:vgs             state:D stack:    0 pid:16253 ppid:  4359 flags:0x00000000
Mar 15 19:06:32 backup kernel: [  605.876128] Call Trace:
Mar 15 19:06:32 backup kernel: [  605.876128]  <TASK>
Mar 15 19:06:32 backup kernel: [  605.876129]  __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [  605.876131]  schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [  605.876132]  schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [  605.876134]  __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [  605.876135]  ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [  605.876137]  __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [  605.876138]  mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [  605.876140]  blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [  605.876142]  blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [  605.876144]  ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [  605.876145]  blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [  605.876147]  do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [  605.876148]  vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [  605.876149]  path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [  605.876151]  ? filename_lookup+0xcb/0x1d0
Mar 15 19:06:32 backup kernel: [  605.876152]  do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [  605.876153]  ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [  605.876156]  do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [  605.876158]  __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [  605.876159]  do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [  605.876160]  ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [  605.876162]  ? __x64_sys_newstat+0x16/0x20
Mar 15 19:06:32 backup kernel: [  605.876163]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.876165]  ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [  605.876166]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [  605.876168] RIP: 0033:0x7f208eb964e7
Mar 15 19:06:32 backup kernel: [  605.876169] RSP: 002b:00007ffc85497db0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [  605.876170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f208eb964e7
Mar 15 19:06:32 backup kernel: [  605.876171] RDX: 0000000000044000 RSI: 00005626e096da00 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [  605.876171] RBP: 00005626e096da00 R08: 0000000000000001 R09: 00007f208ec79be0
Mar 15 19:06:32 backup kernel: [  605.876172] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000044000
Mar 15 19:06:32 backup kernel: [  605.876173] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [  605.876174]  </TASK>
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!