PVE 9.1 with Kernel 6.17 - Unstable?

Dec 30, 2024
54
14
8
Munich, Germany
Dear Community,

I've experienced many issues with PVE 9.1 and the new kernel 6.17.
I'm unable to name all exactly but there were I/O hangs, kernel stack traces and so on.

6.17.2-1-pve was worst, it got a bit better with 6.17.2-2-pve.

Yesterday I had I/O timeouts while the PBS backup was running.
Today RBD started to have issued during a normal migration:

Code:
[34409.788606] INFO: task kworker/u128:3:723688 blocked for more than 245 seconds.
[34409.788932]       Tainted: P           O        6.17.2-2-pve #1
[34409.789223] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[34409.789525] task:kworker/u128:3  state:D stack:0     pid:723688 tgid:723688 ppid:2      task_flags:0x4208060 flags:0x00004000
[34409.789825] Workqueue: rbd28-tasks rbd_release_lock_work [rbd]
[34409.790128] Call Trace:
[34409.790406]  <TASK>
[34409.790707]  __schedule+0x468/0x1310
[34409.790986]  ? srso_alias_return_thunk+0x5/0xfbef5
[34409.791258]  ? sched_clock_noinstr+0x9/0x10
[34409.791578]  ? srso_alias_return_thunk+0x5/0xfbef5
[34409.791850]  ? sched_clock+0x10/0x30
[34409.792119]  ? srso_alias_return_thunk+0x5/0xfbef5
[34409.792392]  schedule+0x27/0xf0
[34409.792671]  schedule_timeout+0xcf/0x110
[34409.792941]  __wait_for_common+0x98/0x1b0
[34409.793211]  ? __pfx_schedule_timeout+0x10/0x10
[34409.793499]  wait_for_completion+0x24/0x40
[34409.793771]  rbd_quiesce_lock+0xa6/0xf0 [rbd]
[34409.794050]  rbd_release_lock_work+0x2f/0xc0 [rbd]
[34409.794327]  process_one_work+0x18b/0x370
[34409.794606]  worker_thread+0x33a/0x480
[34409.794872]  ? __pfx_worker_thread+0x10/0x10
[34409.795134]  kthread+0x10b/0x220
[34409.795391]  ? __pfx_kthread+0x10/0x10
[34409.795652]  ret_from_fork+0x208/0x240
[34409.795935]  ? __pfx_kthread+0x10/0x10
[34409.796246]  ret_from_fork_asm+0x1a/0x30
[34409.796522]  </TASK>


My solution is currently to pin back kernel 6.14.11-4-pve which was working fine in my environment.

I wonder if others experience similar issues within their environment.

Is it maybe related to me having KRBD enabled? Is it advisable to have this option disabled?

Please excuse the crudity of the post, I try to collect some information which might lead to the root cause of the issue.

Thanks!
Best regards,
Bernhard
 
Could be related to this:
I have the same situation. The only solution I've seen is going to 6.14 but in my case starting with a new proxmox 9.1 install I don't even have 6.14 installed, and installing it always causes this bug and freezes.