Hi, we see now a kernel panic while medium usage of a node that results in complete crash of the device.
Sadly not reproduce able, It happens 3-4 times a week at random times (full load, no load ...).
This never happen before the upgrade from 6.4 -> 7.1 (two weeks ago) so chances are good this is somehow connected.
Full log from that time
http://ix.io/3HyP
Pointed me to dm-9, so I replaced (different size, different model) the OS SSD just in case, but still get the same errors.
I am not sure if pve-root is handled different and why it could trigger a kernel panic.
Is this something known ? Is there something I could test ?
Sadly not reproduce able, It happens 3-4 times a week at random times (full load, no load ...).
This never happen before the upgrade from 6.4 -> 7.1 (two weeks ago) so chances are good this is somehow connected.
Code:
[Fri Dec 10 08:38:19 2021] libceph: osd15 down
[Fri Dec 10 08:38:35 2021] libceph: osd13 down
[Fri Dec 10 08:38:36 2021] rbd: rbd6: encountered watch error: -107
[Fri Dec 10 08:38:37 2021] rbd: rbd4: encountered watch error: -107
[Fri Dec 10 08:39:43 2021] INFO: task jbd2/dm-9-8:649 blocked for more than 120 seconds.
[Fri Dec 10 08:39:43 2021] Tainted: P O 5.13.19-1-pve #1
[Fri Dec 10 08:39:43 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Dec 10 08:39:43 2021] task:jbd2/dm-9-8 state:D stack: 0 pid: 649 ppid: 2 flags:0x00004000
[Fri Dec 10 08:39:43 2021] Call Trace:
[Fri Dec 10 08:39:43 2021] ? wbt_cleanup_cb+0x20/0x20
[Fri Dec 10 08:39:43 2021] __schedule+0x2fa/0x910
[Fri Dec 10 08:39:43 2021] ? wbt_cleanup_cb+0x20/0x20
[Fri Dec 10 08:39:43 2021] schedule+0x4f/0xc0
[Fri Dec 10 08:39:43 2021] io_schedule+0x46/0x70
[Fri Dec 10 08:39:43 2021] rq_qos_wait+0xbd/0x150
[Fri Dec 10 08:39:43 2021] ? sysv68_partition+0x280/0x280
[Fri Dec 10 08:39:43 2021] ? wbt_cleanup_cb+0x20/0x20
[Fri Dec 10 08:39:43 2021] wbt_wait+0x9b/0xe0
[Fri Dec 10 08:39:43 2021] __rq_qos_throttle+0x28/0x40
[Fri Dec 10 08:39:43 2021] blk_mq_submit_bio+0x119/0x590
[Fri Dec 10 08:39:43 2021] submit_bio_noacct+0x2dc/0x4f0
[Fri Dec 10 08:39:43 2021] submit_bio+0x4f/0x1b0
[Fri Dec 10 08:39:43 2021] ? bio_add_page+0x6a/0x90
[Fri Dec 10 08:39:43 2021] submit_bh_wbc+0x18d/0x1c0
[Fri Dec 10 08:39:43 2021] submit_bh+0x13/0x20
[Fri Dec 10 08:39:43 2021] jbd2_journal_commit_transaction+0x8ee/0x1910
[Fri Dec 10 08:39:43 2021] kjournald2+0xa9/0x280
[Fri Dec 10 08:39:43 2021] ? wait_woken+0x80/0x80
[Fri Dec 10 08:39:43 2021] ? load_superblock.part.0+0xb0/0xb0
[Fri Dec 10 08:39:43 2021] kthread+0x12b/0x150
[Fri Dec 10 08:39:43 2021] ? set_kthread_struct+0x50/0x50
[Fri Dec 10 08:39:43 2021] ret_from_fork+0x22/0x30
[Fri Dec 10 08:39:43 2021] INFO: task cfs_loop:1691 blocked for more than 120 seconds.
[Fri Dec 10 08:39:43 2021] Tainted: P O 5.13.19-1-pve #1
[Fri Dec 10 08:39:43 2021] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Dec 10 08:39:43 2021] task:cfs_loop state:D stack: 0 pid: 1691 ppid: 1 flags:0x00000000
...
Full log from that time
http://ix.io/3HyP
Pointed me to dm-9, so I replaced (different size, different model) the OS SSD just in case, but still get the same errors.
Code:
root@server:~# dmsetup info /dev/dm-9
Name: pve-root
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 9
Number of targets: 1
UUID: LVM-oAidPr2xUSz6O90Zc1Exu5QLWukpvuTRFAhhr8ejUmP0mnBaYvW8SpZ210OkURKI
I am not sure if pve-root is handled different and why it could trigger a kernel panic.
Is this something known ? Is there something I could test ?