Proxmox Oops / page fault

AlpsView · Thursday at 13:50

I run into a Page Fault leaving PVE in an only partly responsive state and completely crashing the involved VM. Recovery was only possible after a hard reboot.

The crash seems to correlate with a backup job started in a VM. There is an external USB disk passed trough to the VM, the VM is performing a simple local rsync.
From what I think i can read from the logs, it seems PVE (kvm) has tried to call the memory address of the device, but that pointer has been lost or corrupted (for what ever reason)?

I can't rule out a hdd hardware issue (as i haven't done the related analysis yet), however, even if this would lead to an unresponsive storage device, shouldn't PVE been able to handle this gracefully? I'l go on investigating, however it would be nice to get some feedback if my interpretation of the logs are correct so far.

[Edit] -> Also happening with the related USB device disconnected

Code:

May 22 07:35:20 proxmox kernel: BUG: unable to handle page fault for address: 0000000000b8838b
May 22 07:35:20 proxmox kernel: #PF: supervisor write access in kernel mode
May 22 07:35:20 proxmox kernel: #PF: error_code(0x0002) - not-present page
May 22 07:35:20 proxmox kernel: PGD 0 P4D 0
May 22 07:35:20 proxmox kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
May 22 07:35:20 proxmox kernel: CPU: 4 PID: 214740 Comm: kvm Tainted: P W O 6.8.12-10-pve #1
May 22 07:35:20 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/25/2024
May 22 07:35:20 proxmox kernel: RIP: 0010:blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 18 48 89 fb a8 01 0f 85 07 01 00 00 89 f7 e8 fc d2 a6 ff <49> 89 84 48 8b 83 b8 00 00 00 83 78 34 01 0f 84 ad 00 00 00 8b 83
May 22 07:35:20 proxmox kernel: RSP: 0018:ffffae0d08a0f730 EFLAGS: 00010246
May 22 07:35:20 proxmox kernel: RAX: 0000000000000000 RBX: ffff9a0f13788000 RCX: 0000000000000000
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 07:35:20 proxmox kernel: RBP: ffffae0d08a0f748 R08: 0000000000000000 R09: 0000000000000000
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
May 22 07:35:20 proxmox kernel: R13: ffff9a0f16d2b0e0 R14: 0000000000000000 R15: 0000000000000001
May 22 07:35:20 proxmox kernel: FS: 00007d1a16c1f5c0(0000) GS:ffff9a165fa00000(0000) knlGS:0000000000000000
May 22 07:35:20 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b CR3: 0000000480112000 CR4: 0000000000f52ef0
May 22 07:35:20 proxmox kernel: PKRU: 55555554
May 22 07:35:20 proxmox kernel: Call Trace:
May 22 07:35:20 proxmox kernel: <TASK>
May 22 07:35:20 proxmox kernel: ? show_regs+0x6d/0x80
May 22 07:35:20 proxmox kernel: ? __die+0x24/0x80
May 22 07:35:20 proxmox kernel: ? page_fault_oops+0x176/0x500
May 22 07:35:20 proxmox kernel: ? raw_spin_rq_unlock+0x10/0x40
May 22 07:35:20 proxmox kernel: ? load_balance+0x96d/0xfd0
May 22 07:35:20 proxmox kernel: ? do_user_addr_fault+0x2f5/0x660
May 22 07:35:20 proxmox kernel: ? exc_page_fault+0x83/0x1b0
May 22 07:35:20 proxmox kernel: ? asm_exc_page_fault+0x27/0x30
May 22 07:35:20 proxmox kernel: ? blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: blk_mq_run_hw_queue+0x1fa/0x350
May 22 07:35:20 proxmox kernel: blk_mq_submit_bio+0x280/0x690
May 22 07:35:20 proxmox kernel: __submit_bio+0xb3/0x1c0
May 22 07:35:20 proxmox kernel: submit_bio_noacct_nocheck+0x2b7/0x390
May 22 07:35:20 proxmox kernel: submit_bio_noacct+0x1f3/0x650
May 22 07:35:20 proxmox kernel: submit_bio+0xb2/0x110
May 22 07:35:20 proxmox kernel: blkdev_direct_IO.part.0+0x23b/0x5c0
May 22 07:35:20 proxmox kernel: ? current_time+0x3c/0xf0
May 22 07:35:20 proxmox kernel: ? atime_needs_update+0xa8/0x130
May 22 07:35:20 proxmox kernel: blkdev_read_iter+0xbd/0x160
May 22 07:35:20 proxmox kernel: ? rw_verify_area+0xc7/0x140
May 22 07:35:20 proxmox kernel: __io_read+0xf6/0x590
May 22 07:35:20 proxmox kernel: io_read+0x17/0x50
May 22 07:35:20 proxmox kernel: io_issue_sqe+0x61/0x400
May 22 07:35:20 proxmox kernel: ? io_prep_rwv+0x27/0xd0
May 22 07:35:20 proxmox kernel: io_submit_sqes+0x207/0x6d0
May 22 07:35:20 proxmox kernel: __do_sys_io_uring_enter+0x465/0xc10
May 22 07:35:20 proxmox kernel: __x64_sys_io_uring_enter+0x22/0x40
May 22 07:35:20 proxmox kernel: x64_sys_call+0x2312/0x2480
May 22 07:35:20 proxmox kernel: do_syscall_64+0x81/0x170
May 22 07:35:20 proxmox kernel: ? io_clean_op+0xdf/0x1b0
May 22 07:35:20 proxmox kernel: ? __io_submit_flush_completions+0x181/0x410
May 22 07:35:20 proxmox kernel: ? ctx_flush_and_put+0x50/0xd0
May 22 07:35:20 proxmox kernel: ? tctx_task_work+0x122/0x210
May 22 07:35:20 proxmox kernel: ? task_work_run+0x66/0xa0
May 22 07:35:20 proxmox kernel: ? get_signal+0xa6/0xab0
May 22 07:35:20 proxmox kernel: ? arch_do_signal_or_restart+0x42/0x280
May 22 07:35:20 proxmox kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
May 22 07:35:20 proxmox kernel: ? irqentry_exit+0x43/0x50
May 22 07:35:20 proxmox kernel: ? common_interrupt+0x54/0xb0
May 22 07:35:20 proxmox kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
May 22 07:35:20 proxmox kernel: RIP: 0033:0x7d1a1a973b95
May 22 07:35:20 proxmox kernel: Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
May 22 07:35:20 proxmox kernel: RSP: 002b:00007ffe4fa64d68 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
May 22 07:35:20 proxmox kernel: RAX: ffffffffffffffda RBX: 0000575848cf9370 RCX: 00007d1a1a973b95
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000011b
May 22 07:35:20 proxmox kernel: RBP: 0000575848cf9378 R08: 0000000000000000 R09: 0000000000000008
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000575848cf9460
May 22 07:35:20 proxmox kernel: R13: 0000000000000000 R14: 000057582465b0c8 R15: 00005758496d68c0
May 22 07:35:20 proxmox kernel: </TASK>
May 22 07:35:20 proxmox kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe drm_gpuvm drm_exec x86_pkg_temp_thermal gpu_sched intel_powerclamp drm_suballoc_helper drm_ttm_helper kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl i915 cmdlinepart spi_nor mei_me intel_cstate mtd wmi_bmof ov13858 drm_buddy pcspkr v4l2_fwnode mei ttm v4l2_async drm_display_helper videodev cec mc intel_pmc_core rc_core i2c_algo_bit intel_vsec igen6_edac pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio
May 22 07:35:20 proxmox kernel: libcrc32c nvme nvme_core xhci_pci xhci_pci_renesas crc32_pclmul ixgbe nvme_auth xhci_hcd xfrm_algo igc i2c_i801 spi_intel_pci ahci dca spi_intel i2c_smbus mdio libahci video wmi
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b
May 22 07:35:20 proxmox kernel: ---[ end trace 0000000000000000 ]---

After this intial event, the following error message repeats constantly:

Code:

May 22 07:35:20 proxmox kernel: RIP: 0010:blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 18 48 89 fb a8 01 0f 85 07 01 00 00 89 f7 e8 fc d2 a6 ff <49> 89 c4 48 8b 83 b8 00 00 00 83 78 34 01 0f 84 ad 00 00 00 8b 83
May 22 07:35:20 proxmox kernel: RSP: 0018:ffffae0d08a0f730 EFLAGS: 00010246
May 22 07:35:20 proxmox kernel: RAX: 0000000000000000 RBX: ffff9a0f13788000 RCX: 0000000000000000
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 07:35:20 proxmox kernel: RBP: ffffae0d08a0f748 R08: 0000000000000000 R09: 0000000000000000
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
May 22 07:35:20 proxmox kernel: R13: ffff9a0f16d2b0e0 R14: 0000000000000000 R15: 0000000000000001
May 22 07:35:20 proxmox kernel: FS: 00007d1a16c1f5c0(0000) GS:ffff9a165fb00000(0000) knlGS:0000000000000000
May 22 07:35:20 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b CR3: 0000000480112000 CR4: 0000000000f52ef0
May 22 07:35:20 proxmox kernel: PKRU: 55555554
May 22 07:35:20 proxmox kernel: note: iou-wrk-214740[824986] exited with irqs disabled
May 22 07:35:21 proxmox pvedaemon[807254]: VM 1007 qmp command failed - VM 1007 qmp command 'guest-ping' failed - got timeout

VM1007 is the VM executing the backup job. It seems the VM has died after the first issue. However I think the VM died as a result of the error condition on PVE, not the other way round.

In the logs on VM1007 i can't find any information indicating an error prior to the VM getting crashed. I can see the rsync running and all drives mounted and working. Last file copying process started at 22.5.2025, 07:35:15, that's 5 seconds before the kernel crashed.

AlpsView · Thursday at 15:57

That's what I found in the VM1007 logs. Fully coresponding to the PVE logs and another indication the error started on PVE (refering to the sequence of events and their timestamps):

Code:

May 22 07:35:20 proxmox kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe drm_gpuvm drm_exec x86_pkg_temp_thermal gpu_sched intel_powerclamp drm_suballoc_helper drm_ttm_helper kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl i915 cmdlinepart spi_nor mei_me intel_cstate mtd wmi_bmof ov13858 drm_buddy pcspkr v4l2_fwnode mei ttm v4l2_async drm_display_helper videodev cec mc intel_pmc_core rc_core i2c_algo_bit intel_vsec igen6_edac pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio
May 22 07:35:20 proxmox kernel: CPU: 3 PID: 214539 Comm: usb-storage Tainted: P      D W  O       6.8.12-10-pve #1
May 22 07:35:20 proxmox kernel:  usb_stor_control_thread+0x24e/0x2b0 [usb_storage]
May 22 07:35:20 proxmox kernel:  ? __pfx_usb_stor_control_thread+0x10/0x10 [usb_storage]
May 22 07:35:20 proxmox kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe drm_gpuvm drm_exec x86_pkg_temp_thermal gpu_sched intel_powerclamp drm_suballoc_helper drm_ttm_helper kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl i915 cmdlinepart spi_nor mei_me intel_cstate mtd wmi_bmof ov13858 drm_buddy pcspkr v4l2_fwnode mei ttm v4l2_async drm_display_helper videodev cec mc intel_pmc_core rc_core i2c_algo_bit intel_vsec igen6_edac pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio
May 22 07:35:20 proxmox kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe drm_gpuvm drm_exec x86_pkg_temp_thermal gpu_sched intel_powerclamp drm_suballoc_helper drm_ttm_helper kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl i915 cmdlinepart spi_nor mei_me intel_cstate mtd wmi_bmof ov13858 drm_buddy pcspkr v4l2_fwnode mei ttm v4l2_async drm_display_helper videodev cec mc intel_pmc_core rc_core i2c_algo_bit intel_vsec igen6_edac pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio
May 22 07:35:20 proxmox kernel: note: usb-storage[214539] exited with irqs disabled

AlpsView · Thursday at 21:10

It just happend again. Same situation, same behaviour, same log entries.
However, the USB Drive I suspected earlier as being involved has not been connected to the host anymore. So we can rule it out.

Code:

May 22 20:30:11 proxmox kernel: perf: interrupt took too long (4125 > 4117), lowering kernel.perf_event_max_sample_rate to 48000
May 22 20:30:28 proxmox kernel: BUG: unable to handle page fault for address: 0000000060de7b98
May 22 20:30:28 proxmox kernel: #PF: supervisor read access in kernel mode
May 22 20:30:28 proxmox kernel: #PF: error_code(0x0000) - not-present page
May 22 20:30:28 proxmox kernel: PGD 0 P4D 0
May 22 20:30:28 proxmox kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
May 22 20:30:28 proxmox kernel: CPU: 7 PID: 101924 Comm: iou-wrk-2030 Tainted: P O 6.8.12-10-pve #1
May 22 20:30:28 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/25/2024
May 22 20:30:28 proxmox kernel: RIP: 0010:blkdev_direct_IO.part.0+0x124/0x5c0
May 22 20:30:28 proxmox kernel: Code: 40 38 a0 6d b5 9b 41 0f b7 46 24 66 41 89 45 16 80 3b 02 0f 84 84 04 00 00 e8 68 3f 00 00 85 c0 0f 85 88 03 00 00 41 8b 45 28 <80> 7d 98 00 49 89 45 c8 0f 84 c6 00 00 00 65 48 8b 14 25 c0 43 03
May 22 20:30:28 proxmox kernel: RSP: 0018:ffffae9260de7b70 EFLAGS: 00010246
May 22 20:30:28 proxmox kernel: RAX: 0000000000007000 RBX: ffff89733d3d1c00 RCX: 0000000000000000
May 22 20:30:28 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 20:30:28 proxmox kernel: RBP: 0000000060de7c00 R08: 0000000000000000 R09: 0000000000000000
May 22 20:30:28 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000221ca90
May 22 20:30:28 proxmox kernel: R13: ffff8971d0782900 R14: ffff89761b8d4000 R15: 0000000000800000
May 22 20:30:28 proxmox kernel: FS: 000077c669fc56c0(0000) GS:ffff89791fb80000(0000) knlGS:0000000000000000
May 22 20:30:28 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 20:30:28 proxmox kernel: CR2: 0000000060de7b98 CR3: 0000000323fd6000 CR4: 0000000000f52ef0
May 22 20:30:28 proxmox kernel: PKRU: 55555554
May 22 20:30:28 proxmox kernel: Call Trace:
May 22 20:30:28 proxmox kernel: <TASK>
May 22 20:30:28 proxmox kernel: ? show_regs+0x6d/0x80
May 22 20:30:28 proxmox kernel: ? __die+0x24/0x80
May 22 20:30:28 proxmox kernel: ? page_fault_oops+0x176/0x500
May 22 20:30:28 proxmox kernel: ? do_user_addr_fault+0x2f5/0x660
May 22 20:30:28 proxmox kernel: ? exc_page_fault+0x83/0x1b0
May 22 20:30:28 proxmox kernel: ? asm_exc_page_fault+0x27/0x30
May 22 20:30:28 proxmox kernel: ? blkdev_direct_IO.part.0+0x124/0x5c0
May 22 20:30:28 proxmox kernel: ? current_time+0x3c/0xf0
May 22 20:30:28 proxmox kernel: ? atime_needs_update+0xa8/0x130
May 22 20:30:28 proxmox kernel: ? blkdev_read_iter+0xbd/0x160
May 22 20:30:28 proxmox kernel: ? __io_read+0xf6/0x590
May 22 20:30:28 proxmox kernel: ? io_read+0x17/0x50
May 22 20:30:28 proxmox kernel: ? io_issue_sqe+0x61/0x400
May 22 20:30:28 proxmox kernel: ? lock_timer_base+0x72/0xa0
May 22 20:30:28 proxmox kernel: ? io_wq_submit_work+0xe2/0x360
May 22 20:30:28 proxmox kernel: ? __timer_delete_sync+0x8c/0x100
May 22 20:30:28 proxmox kernel: ? io_worker_handle_work+0x149/0x580
May 22 20:30:28 proxmox kernel: ? io_wq_worker+0x136/0x3f0
May 22 20:30:28 proxmox kernel: ? raw_spin_rq_unlock+0x10/0x40
May 22 20:30:28 proxmox kernel: ? finish_task_switch.isra.0+0x8c/0x310
May 22 20:30:28 proxmox kernel: ? __pfx_io_wq_worker+0x10/0x10
May 22 20:30:28 proxmox kernel: ? ret_from_fork+0x44/0x70
May 22 20:30:28 proxmox kernel: ? __pfx_io_wq_worker+0x10/0x10
May 22 20:30:28 proxmox kernel: ? ret_from_fork_asm+0x1b/0x30
May 22 20:30:28 proxmox kernel: </TASK>
May 22 20:30:28 proxmox kernel: Modules linked in: veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe x86_pkg_temp_thermal intel_powerclamp kvm_intel drm_gpuvm drm_exec gpu_sched drm_suballoc_helper drm_ttm_helper kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_pxp mei_hdcp i915 rapl cmdlinepart drm_buddy spi_nor ttm wmi_bmof mei_me intel_cstate pcspkr mtd drm_display_helper mei cec rc_core ov13858 i2c_algo_bit v4l2_fwnode igen6_edac v4l2_async videodev intel_pmc_core intel_vsec mc pmt_telemetry pmt_class acpi_pad acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvme ixgbe
May 22 20:30:28 proxmox kernel: xfrm_algo nvme_core xhci_pci dca xhci_pci_renesas crc32_pclmul mdio nvme_auth xhci_hcd ahci i2c_i801 spi_intel_pci igc spi_intel i2c_smbus libahci video wmi
May 22 20:30:28 proxmox kernel: CR2: 0000000060de7b98
May 22 20:30:28 proxmox kernel: ---[ end trace 0000000000000000 ]---

AlpsView · Thursday at 22:36

This could be related to: https://bugzilla.proxmox.com/show_bug.cgi?id=6288

AlpsView · Friday at 12:35

And again.
This time, no log entries at all. PVE just freezed all of a sudden, no indications at all in the logs. Guest where idleing, host was idleing, no backups running, ....

In this state, PVE isn't useable. I had been using Hyper-V for more than 12 years, not one single crash or freeze. Have been using PVE since a couple of weeks and running into a situation with 2+ crashes a day....

cheiss · Friday at 12:48

Hi,

have you tried the 6.14 kernel yet? https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/
Often this can solve problems, especially in combination with newer hardware.

I'd also recommend to the latest UEFI/firmware as possible, as these are - especially on consumer hardware - unfortunately often quite shoddy.

What hardware are you running on? You don't provide any information regarding that.
Can you also please provide the output of pveversion -v?

AlpsView said:
I can't rule out a hdd hardware issue (as i haven't done the related analysis yet), however, even if this would lead to an unresponsive storage device, shouldn't PVE been able to handle this gracefully?

Generally, this is not something that can be controlled. If a storage devices hangs/takes long, the kernel will wait for it to respond - as it cannot really know whether it's just slow or faulty. Esp. USB storage devices are somewhat problematic, as USB implementations vary in quality and of the course the electrical interface also isn't as reliable as e.g. SATA/SAS.

AlpsView · Friday at 13:01

pveversion –v

proxmox-ve: 8.4.0 (running kernel: 6.8.12-10-pve)
pve-manager: 8.4.1 (running version: 8.4.1/2a5fa54a8503f96d)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8.12-10-pve-signed: 6.8.12-10
proxmox-kernel-6.8: 6.8.12-10
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
ceph-fuse: 18.2.6-pve1
corosync: 3.1.9-pve1
criu: 3.17.1-2+deb12u1
frr-pythontools: 10.2.2-1+pve1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.30-pve2
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.2
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.1.0
libpve-cluster-perl: 8.1.0
libpve-common-perl: 8.3.1
libpve-guest-common-perl: 5.2.2
libpve-http-server-perl: 5.2.2
libpve-network-perl: 0.11.2
libpve-rs-perl: 0.9.4
libpve-storage-perl: 8.3.6
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.6.0-2
proxmox-backup-client: 3.4.1-1
proxmox-backup-file-restore: 3.4.1-1
proxmox-firewall: 0.7.1
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.2
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.10
pve-cluster: 8.1.0
pve-container: 5.2.6
pve-docs: 8.4.0
pve-edk2-firmware: 4.2025.02-3
pve-esxi-import-tools: 0.7.4
pve-firewall: 5.1.1
pve-firmware: 3.15-3
pve-ha-manager: 4.0.7
pve-i18n: 3.4.2
pve-qemu-kvm: 9.2.0-5
pve-xtermjs: 5.5.0-2
qemu-server: 8.3.12
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve2

AlpsView · Friday at 13:14

cheiss said:
have you tried the 6.14 kernel yet? https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/
Often this can solve problems, especially in combination with newer hardware.

I haven't tried yet, will do after I ruled out as many hardware components as possible.

cheiss said:
What hardware are you running on? You don't provide any information regarding that.

It's one of these embedded industrial pc boxes:
i3 N305
1 x 32 GB RAM DDR5 4800 MH SO-DIMM (Crucial CT32G48C40S5 )
2x 2TB M.2 WD_BLACK SN850X NVMe
USB attached Storage, JMicron Controller
-> more info to follow, I'm just about to take out one after the other to start ruling out components

cheiss said:
Generally, this is not something that can be controlled. If a storage devices hangs/takes long, the kernel will wait for it to respond - as it cannot really know whether it's just slow or faulty. Esp. USB storage devices are somewhat problematic, as USB implementations vary in quality and of the course the electrical interface also isn't as reliable as e.g. SATA/SAS.

From what I see, it looks strongly related to USB. I'm about to detach components one by one to hopefully get a clearer picture.

AlpsView · Friday at 13:20

Additional info:

The last configuration change on PVE (it was running stable for about 2 - 3 weeks prior to this) was to setup the agents in the guests (ubuntu and debian, the VM involved is running on debian). However, the crashed did not start but 2 or 3 days later, so there is no direct correlation if at all.
The VM involved is running OMV, with 3 disks passed through, 2 via an USB/SATA bridge, one directyl via USB. Running on Super Speed (3.0 5GBs).

AlpsView · Friday at 13:25

Do you know if someone already had a look into this one, as it seems to be the same problem?
https://bugzilla.proxmox.com/show_bug.cgi?id=6288

cheiss · Friday at 13:47

AlpsView said:
It's one of these embedded industrial pc boxes:
i3 N305
1 x 32 GB RAM DDR5 4800 MH SO-DIMM (Crucial CT32G48C40S5 )
2x 2TB M.2 WD_BLACK SN850X NVMe
USB attached Storage, JMicron Controller

Is it fanless and/or ventilated enough? Maybe it's also a heat issue, not to uncommon with these boxes.

AlpsView said:
Do you know if someone already had a look into this one, as it seems to be the same problem?
https://bugzilla.proxmox.com/show_bug.cgi?id=6288

Doesn't look (strongly) correlated to this problem, IMHO. But again, I'd try installing and booting the current 6.14 kernel, since it's easy and might just solve the problem, before investigating something that might have been fixed already.

AlpsView · Friday at 13:59

cheiss said:
Is it fanless and/or ventilated enough? Maybe it's also a heat issue, not to uncommon with these boxes.

It's both, big alu casing with large fins and an integrated fan.

Heat isn't an isssue from what i see. I'm monitoring the temps, neither CPU nor NVMe have ever come close to a critical condition.
I also did a load test with s-tui to see where temps peak and there was no problem.

AlpsView · Friday at 15:12

cheiss said:
But again, I'd try installing and booting the current 6.14 kernel, since it's easy and might just solve the problem, before investigating something that might have been fixed already.

I will do. Just thought it might be of interest to find the root cause and maybe even being able to reproduce it in terms of improving the kernel.

One more thing I can rule out is power management on USB/SATA the USB/SATA Bridge as there simply is none

This again leads to another direction, as since the JMicron Bridge does not have power management implemented, one of the drives is a SATA SSD. Now SSDs do their own power management in terms of going into an idle state. However, afaik that's a completely encapsulated functionality, meaning, there is no need to control this from outside the disk. Which again makes me think even if the disk is "sleeping" when accessed, this shouldn't lead to negative side effects but rather just to wake up.

Search

Search

Proxmox Oops / page fault

AlpsView

New Member

AlpsView

New Member

AlpsView

New Member

AlpsView

New Member

AlpsView

New Member

cheiss

Proxmox Staff Member

AlpsView

New Member

AlpsView

New Member

AlpsView

New Member

AlpsView

New Member

cheiss

Proxmox Staff Member

AlpsView

New Member

AlpsView

New Member

We value your privacy