Hey People,
I have some devastating issues with my Backup Ceph Filesystem. It started with a failed disk some days ago, which is set it into recovery/rebuild mode. Its based on two erasure coding pools, one being 8/7 and the other 12/11 with a total of 16 disks. I didn't worry about anything as I had more than enough spare capacity and a single disk still fits into mal failure domain. But now since two days the system is crashing after a short time (10-20min) after booting due to (pulled from `/var/log/syslog` after restarting).
And this what my dashboard looks like before proxmox crashes...
So from what I understand I have 4 OSDs which have corrupted sectors.
- Damn how did this happen ?
- But more importantly, can I fix this and how ?
- Removing all 4 corrupted OSDs is not an option, as my 12/11 EC pool can only loose 2 OSDs
I realise that an EC pool this big was not a good solution, but it is what it is now. Any hints how I could improve the state of my CephFS would be much appreciated. As last resort I do have the option to restore everything from my cloud backup, but I would like to avoid this if possible.
Best Regards,
Jakob
Complete Error Logs
I have some devastating issues with my Backup Ceph Filesystem. It started with a failed disk some days ago, which is set it into recovery/rebuild mode. Its based on two erasure coding pools, one being 8/7 and the other 12/11 with a total of 16 disks. I didn't worry about anything as I had more than enough spare capacity and a single disk still fits into mal failure domain. But now since two days the system is crashing after a short time (10-20min) after booting due to (pulled from `/var/log/syslog` after restarting).
Code:
Mar 14 18:21:28 backup kernel: [ 282.441797] Buffer I/O error on dev dm-4, logical block 13945792, async page read
Mar 12 11:26:45 backup kernel: [ 1050.328095] Buffer I/O error on dev dm-5, logical block 13945792, async page read
Mar 15 18:49:38 backup kernel: [ 203.866160] Buffer I/O error on dev dm-6, logical block 13945792, async page read
Mar 15 18:36:44 backup kernel: [ 561.143378] Buffer I/O error on dev dm-7, logical block 13945792, async page read
And this what my dashboard looks like before proxmox crashes...
So from what I understand I have 4 OSDs which have corrupted sectors.
- Damn how did this happen ?
- But more importantly, can I fix this and how ?
- Removing all 4 corrupted OSDs is not an option, as my 12/11 EC pool can only loose 2 OSDs
I realise that an EC pool this big was not a good solution, but it is what it is now. Any hints how I could improve the state of my CephFS would be much appreciated. As last resort I do have the option to restore everything from my cloud backup, but I would like to avoid this if possible.
Best Regards,
Jakob
Complete Error Logs
Code:
Mar 15 19:06:32 backup kernel: [ 605.867967] INFO: task bstore_kv_sync:11549 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [ 605.868192] Tainted: P O 5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [ 605.868908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [ 605.869623] task:bstore_kv_sync state:D stack: 0 pid:11549 ppid: 1 flags:0x00000224
Mar 15 19:06:32 backup kernel: [ 605.869626] Call Trace:
Mar 15 19:06:32 backup kernel: [ 605.869628] <TASK>
Mar 15 19:06:32 backup kernel: [ 605.869629] __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [ 605.869633] ? blk_flush_plug_list+0xdd/0x110
Mar 15 19:06:32 backup kernel: [ 605.869636] schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [ 605.869637] io_schedule+0x46/0x80
Mar 15 19:06:32 backup kernel: [ 605.869638] wait_on_page_bit_common+0x114/0x3e0
Mar 15 19:06:32 backup kernel: [ 605.869642] ? filemap_invalidate_unlock_two+0x50/0x50
Mar 15 19:06:32 backup kernel: [ 605.869644] wait_on_page_bit+0x3f/0x50
Mar 15 19:06:32 backup kernel: [ 605.869646] wait_on_page_writeback+0x26/0x80
Mar 15 19:06:32 backup kernel: [ 605.869647] __filemap_fdatawait_range+0x97/0x120
Mar 15 19:06:32 backup kernel: [ 605.869649] ? filemap_fdatawrite_wbc+0x94/0xe0
Mar 15 19:06:32 backup kernel: [ 605.869651] ? __filemap_fdatawrite_range+0x54/0x70
Mar 15 19:06:32 backup kernel: [ 605.869653] file_fdatawait_range+0x1a/0x30
Mar 15 19:06:32 backup kernel: [ 605.869655] sync_file_range+0xca/0x100
Mar 15 19:06:32 backup kernel: [ 605.869658] __x64_sys_sync_file_range+0x44/0x90
Mar 15 19:06:32 backup kernel: [ 605.869660] do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [ 605.869661] ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [ 605.869664] ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [ 605.869666] ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [ 605.869668] ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [ 605.869670] ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [ 605.869671] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [ 605.869674] RIP: 0033:0x7fbaf2cf8598
Mar 15 19:06:32 backup kernel: [ 605.869675] RSP: 002b:00007fbae3003930 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
Mar 15 19:06:32 backup kernel: [ 605.869677] RAX: ffffffffffffffda RBX: 000055fd1f74cfc0 RCX: 00007fbaf2cf8598
Mar 15 19:06:32 backup kernel: [ 605.869678] RDX: 0000000000001000 RSI: 000000001b163000 RDI: 000000000000002d
Mar 15 19:06:32 backup kernel: [ 605.869679] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Mar 15 19:06:32 backup kernel: [ 605.869679] R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000000001
Mar 15 19:06:32 backup kernel: [ 605.869680] R13: 0000000000000001 R14: 000000001b163000 R15: 000055fd1f7d2c00
Mar 15 19:06:32 backup kernel: [ 605.869682] </TASK>
Mar 15 19:06:32 backup kernel: [ 605.869749] INFO: task vgs:15284 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [ 605.870328] Tainted: P O 5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [ 605.871030] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [ 605.871735] task:vgs state:D stack: 0 pid:15284 ppid: 3651 flags:0x00000000
Mar 15 19:06:32 backup kernel: [ 605.871737] Call Trace:
Mar 15 19:06:32 backup kernel: [ 605.871738] <TASK>
Mar 15 19:06:32 backup kernel: [ 605.871738] __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [ 605.871740] ? __smp_call_single_queue+0x59/0x90
Mar 15 19:06:32 backup kernel: [ 605.871742] schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [ 605.871744] schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [ 605.871745] __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [ 605.871747] __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [ 605.871748] mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [ 605.871750] blkdev_put+0x3a/0x210
Mar 15 19:06:32 backup kernel: [ 605.871752] blkdev_close+0x27/0x40
Mar 15 19:06:32 backup kernel: [ 605.871754] __fput+0x9c/0x280
Mar 15 19:06:32 backup kernel: [ 605.871757] ____fput+0xe/0x20
Mar 15 19:06:32 backup kernel: [ 605.871758] task_work_run+0x6a/0xb0
Mar 15 19:06:32 backup kernel: [ 605.871760] exit_to_user_mode_prepare+0x1a8/0x1b0
Mar 15 19:06:32 backup kernel: [ 605.871762] syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [ 605.871764] ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [ 605.871765] do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [ 605.871766] ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [ 605.871768] ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [ 605.871769] ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [ 605.871771] ? __x64_sys_close+0x12/0x50
Mar 15 19:06:32 backup kernel: [ 605.871772] ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [ 605.871773] ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [ 605.871774] ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [ 605.871776] ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [ 605.871778] ? sysvec_apic_timer_interrupt+0x4e/0x90
Mar 15 19:06:32 backup kernel: [ 605.871779] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [ 605.871782] RIP: 0033:0x7f6a44e6efc3
Mar 15 19:06:32 backup kernel: [ 605.871783] RSP: 002b:00007fffab3fd318 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Mar 15 19:06:32 backup kernel: [ 605.871784] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f6a44e6efc3
Mar 15 19:06:32 backup kernel: [ 605.871785] RDX: 0000558752624020 RSI: 000055875269dac0 RDI: 0000000000000007
Mar 15 19:06:32 backup kernel: [ 605.871786] RBP: 00007fffab3fd340 R08: 000055875269dac0 R09: 0000558751dda010
Mar 15 19:06:32 backup kernel: [ 605.871786] R10: 00007f6a44f51b80 R11: 0000000000000246 R12: 000055875127af00
Mar 15 19:06:32 backup kernel: [ 605.871787] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [ 605.871788] </TASK>
Mar 15 19:06:32 backup kernel: [ 605.871789] INFO: task systemd-udevd:15529 blocked for more than 362 seconds.
Mar 15 19:06:32 backup kernel: [ 605.872448] Tainted: P O 5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [ 605.873162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [ 605.873874] task:systemd-udevd state:D stack: 0 pid:15529 ppid: 773 flags:0x00000224
Mar 15 19:06:32 backup kernel: [ 605.873876] Call Trace:
Mar 15 19:06:32 backup kernel: [ 605.873877] <TASK>
Mar 15 19:06:32 backup kernel: [ 605.873877] __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [ 605.873879] ? page_counter_uncharge+0x22/0x40
Mar 15 19:06:32 backup kernel: [ 605.873882] schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [ 605.873883] schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [ 605.873884] __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [ 605.873886] ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [ 605.873887] __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [ 605.873889] mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [ 605.873890] blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [ 605.873892] blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [ 605.873894] ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [ 605.873895] blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [ 605.873897] do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [ 605.873899] vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [ 605.873900] path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [ 605.873902] do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [ 605.873904] ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [ 605.873906] do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [ 605.873908] __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [ 605.873909] do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [ 605.873911] ? handle_mm_fault+0xd8/0x2c0
Mar 15 19:06:32 backup kernel: [ 605.873913] ? exit_to_user_mode_prepare+0x37/0x1b0
Mar 15 19:06:32 backup kernel: [ 605.873915] ? irqentry_exit_to_user_mode+0x9/0x20
Mar 15 19:06:32 backup kernel: [ 605.873916] ? irqentry_exit+0x1d/0x30
Mar 15 19:06:32 backup kernel: [ 605.873918] ? exc_page_fault+0x89/0x170
Mar 15 19:06:32 backup kernel: [ 605.873919] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [ 605.873922] RIP: 0033:0x7f979d188767
Mar 15 19:06:32 backup kernel: [ 605.873922] RSP: 002b:00007ffe4f2dfe90 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [ 605.873923] RAX: ffffffffffffffda RBX: 00007ffe4f2dffc4 RCX: 00007f979d188767
Mar 15 19:06:32 backup kernel: [ 605.873924] RDX: 00000000000a0800 RSI: 0000564af4ccc930 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [ 605.873925] RBP: 0000564af4ccc930 R08: 0000564af4b37600 R09: 0000564af4c38110
Mar 15 19:06:32 backup kernel: [ 605.873926] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000a0800
Mar 15 19:06:32 backup kernel: [ 605.873926] R13: 0000000000000000 R14: 00007ffe4f2dff20 R15: 00007ffe4f2dffc4
Mar 15 19:06:32 backup kernel: [ 605.873928] </TASK>
Mar 15 19:06:32 backup kernel: [ 605.873928] INFO: task vgs:16253 blocked for more than 241 seconds.
Mar 15 19:06:32 backup kernel: [ 605.874596] Tainted: P O 5.15.85-1-pve #1
Mar 15 19:06:32 backup kernel: [ 605.875320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 19:06:32 backup kernel: [ 605.876126] task:vgs state:D stack: 0 pid:16253 ppid: 4359 flags:0x00000000
Mar 15 19:06:32 backup kernel: [ 605.876128] Call Trace:
Mar 15 19:06:32 backup kernel: [ 605.876128] <TASK>
Mar 15 19:06:32 backup kernel: [ 605.876129] __schedule+0x34e/0x1740
Mar 15 19:06:32 backup kernel: [ 605.876131] schedule+0x69/0x110
Mar 15 19:06:32 backup kernel: [ 605.876132] schedule_preempt_disabled+0xe/0x20
Mar 15 19:06:32 backup kernel: [ 605.876134] __mutex_lock.constprop.0+0x255/0x480
Mar 15 19:06:32 backup kernel: [ 605.876135] ? __cond_resched+0x1a/0x50
Mar 15 19:06:32 backup kernel: [ 605.876137] __mutex_lock_slowpath+0x13/0x20
Mar 15 19:06:32 backup kernel: [ 605.876138] mutex_lock+0x38/0x50
Mar 15 19:06:32 backup kernel: [ 605.876140] blkdev_get_by_dev.part.0+0x55/0x350
Mar 15 19:06:32 backup kernel: [ 605.876142] blkdev_get_by_dev+0x55/0x70
Mar 15 19:06:32 backup kernel: [ 605.876144] ? blkdev_close+0x40/0x40
Mar 15 19:06:32 backup kernel: [ 605.876145] blkdev_open+0x50/0x90
Mar 15 19:06:32 backup kernel: [ 605.876147] do_dentry_open+0x167/0x3f0
Mar 15 19:06:32 backup kernel: [ 605.876148] vfs_open+0x2d/0x40
Mar 15 19:06:32 backup kernel: [ 605.876149] path_openat+0xb69/0x12f0
Mar 15 19:06:32 backup kernel: [ 605.876151] ? filename_lookup+0xcb/0x1d0
Mar 15 19:06:32 backup kernel: [ 605.876152] do_filp_open+0xb6/0x160
Mar 15 19:06:32 backup kernel: [ 605.876153] ? __check_object_size+0x14f/0x160
Mar 15 19:06:32 backup kernel: [ 605.876156] do_sys_openat2+0x9f/0x160
Mar 15 19:06:32 backup kernel: [ 605.876158] __x64_sys_openat+0x56/0xa0
Mar 15 19:06:32 backup kernel: [ 605.876159] do_syscall_64+0x59/0xc0
Mar 15 19:06:32 backup kernel: [ 605.876160] ? syscall_exit_to_user_mode+0x27/0x50
Mar 15 19:06:32 backup kernel: [ 605.876162] ? __x64_sys_newstat+0x16/0x20
Mar 15 19:06:32 backup kernel: [ 605.876163] ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [ 605.876165] ? do_syscall_64+0x69/0xc0
Mar 15 19:06:32 backup kernel: [ 605.876166] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Mar 15 19:06:32 backup kernel: [ 605.876168] RIP: 0033:0x7f208eb964e7
Mar 15 19:06:32 backup kernel: [ 605.876169] RSP: 002b:00007ffc85497db0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
Mar 15 19:06:32 backup kernel: [ 605.876170] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f208eb964e7
Mar 15 19:06:32 backup kernel: [ 605.876171] RDX: 0000000000044000 RSI: 00005626e096da00 RDI: 00000000ffffff9c
Mar 15 19:06:32 backup kernel: [ 605.876171] RBP: 00005626e096da00 R08: 0000000000000001 R09: 00007f208ec79be0
Mar 15 19:06:32 backup kernel: [ 605.876172] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000044000
Mar 15 19:06:32 backup kernel: [ 605.876173] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Mar 15 19:06:32 backup kernel: [ 605.876174] </TASK>
Last edited: