I run into a Page Fault leaving PVE in an only partly responsive state and completely crashing the involved VM. Recovery was only possible after a hard reboot.
The crash seems to correlate with a backup job started in a VM. There is an external USB disk passed trough to the VM, the VM is performing a simple local rsync.
From what I think i can read from the logs, it seems PVE (kvm) has tried to call the memory address of the device, but that pointer has been lost or corrupted (for what ever reason)?
I can't rule out a hdd hardware issue (as i haven't done the related analysis yet), however, even if this would lead to an unresponsive storage device, shouldn't PVE been able to handle this gracefully? I'l go on investigating, however it would be nice to get some feedback if my interpretation of the logs are correct so far.
[Edit] -> Also happening with the related USB device disconnected
After this intial event, the following error message repeats constantly:
VM1007 is the VM executing the backup job. It seems the VM has died after the first issue. However I think the VM died as a result of the error condition on PVE, not the other way round.
In the logs on VM1007 i can't find any information indicating an error prior to the VM getting crashed. I can see the rsync running and all drives mounted and working. Last file copying process started at 22.5.2025, 07:35:15, that's 5 seconds before the kernel crashed.
The crash seems to correlate with a backup job started in a VM. There is an external USB disk passed trough to the VM, the VM is performing a simple local rsync.
From what I think i can read from the logs, it seems PVE (kvm) has tried to call the memory address of the device, but that pointer has been lost or corrupted (for what ever reason)?
[Edit] -> Also happening with the related USB device disconnected
Code:
May 22 07:35:20 proxmox kernel: BUG: unable to handle page fault for address: 0000000000b8838b
May 22 07:35:20 proxmox kernel: #PF: supervisor write access in kernel mode
May 22 07:35:20 proxmox kernel: #PF: error_code(0x0002) - not-present page
May 22 07:35:20 proxmox kernel: PGD 0 P4D 0
May 22 07:35:20 proxmox kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
May 22 07:35:20 proxmox kernel: CPU: 4 PID: 214740 Comm: kvm Tainted: P W O 6.8.12-10-pve #1
May 22 07:35:20 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 09/25/2024
May 22 07:35:20 proxmox kernel: RIP: 0010:blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 18 48 89 fb a8 01 0f 85 07 01 00 00 89 f7 e8 fc d2 a6 ff <49> 89 84 48 8b 83 b8 00 00 00 83 78 34 01 0f 84 ad 00 00 00 8b 83
May 22 07:35:20 proxmox kernel: RSP: 0018:ffffae0d08a0f730 EFLAGS: 00010246
May 22 07:35:20 proxmox kernel: RAX: 0000000000000000 RBX: ffff9a0f13788000 RCX: 0000000000000000
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 07:35:20 proxmox kernel: RBP: ffffae0d08a0f748 R08: 0000000000000000 R09: 0000000000000000
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
May 22 07:35:20 proxmox kernel: R13: ffff9a0f16d2b0e0 R14: 0000000000000000 R15: 0000000000000001
May 22 07:35:20 proxmox kernel: FS: 00007d1a16c1f5c0(0000) GS:ffff9a165fa00000(0000) knlGS:0000000000000000
May 22 07:35:20 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b CR3: 0000000480112000 CR4: 0000000000f52ef0
May 22 07:35:20 proxmox kernel: PKRU: 55555554
May 22 07:35:20 proxmox kernel: Call Trace:
May 22 07:35:20 proxmox kernel: <TASK>
May 22 07:35:20 proxmox kernel: ? show_regs+0x6d/0x80
May 22 07:35:20 proxmox kernel: ? __die+0x24/0x80
May 22 07:35:20 proxmox kernel: ? page_fault_oops+0x176/0x500
May 22 07:35:20 proxmox kernel: ? raw_spin_rq_unlock+0x10/0x40
May 22 07:35:20 proxmox kernel: ? load_balance+0x96d/0xfd0
May 22 07:35:20 proxmox kernel: ? do_user_addr_fault+0x2f5/0x660
May 22 07:35:20 proxmox kernel: ? exc_page_fault+0x83/0x1b0
May 22 07:35:20 proxmox kernel: ? asm_exc_page_fault+0x27/0x30
May 22 07:35:20 proxmox kernel: ? blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: blk_mq_run_hw_queue+0x1fa/0x350
May 22 07:35:20 proxmox kernel: blk_mq_submit_bio+0x280/0x690
May 22 07:35:20 proxmox kernel: __submit_bio+0xb3/0x1c0
May 22 07:35:20 proxmox kernel: submit_bio_noacct_nocheck+0x2b7/0x390
May 22 07:35:20 proxmox kernel: submit_bio_noacct+0x1f3/0x650
May 22 07:35:20 proxmox kernel: submit_bio+0xb2/0x110
May 22 07:35:20 proxmox kernel: blkdev_direct_IO.part.0+0x23b/0x5c0
May 22 07:35:20 proxmox kernel: ? current_time+0x3c/0xf0
May 22 07:35:20 proxmox kernel: ? atime_needs_update+0xa8/0x130
May 22 07:35:20 proxmox kernel: blkdev_read_iter+0xbd/0x160
May 22 07:35:20 proxmox kernel: ? rw_verify_area+0xc7/0x140
May 22 07:35:20 proxmox kernel: __io_read+0xf6/0x590
May 22 07:35:20 proxmox kernel: io_read+0x17/0x50
May 22 07:35:20 proxmox kernel: io_issue_sqe+0x61/0x400
May 22 07:35:20 proxmox kernel: ? io_prep_rwv+0x27/0xd0
May 22 07:35:20 proxmox kernel: io_submit_sqes+0x207/0x6d0
May 22 07:35:20 proxmox kernel: __do_sys_io_uring_enter+0x465/0xc10
May 22 07:35:20 proxmox kernel: __x64_sys_io_uring_enter+0x22/0x40
May 22 07:35:20 proxmox kernel: x64_sys_call+0x2312/0x2480
May 22 07:35:20 proxmox kernel: do_syscall_64+0x81/0x170
May 22 07:35:20 proxmox kernel: ? io_clean_op+0xdf/0x1b0
May 22 07:35:20 proxmox kernel: ? __io_submit_flush_completions+0x181/0x410
May 22 07:35:20 proxmox kernel: ? ctx_flush_and_put+0x50/0xd0
May 22 07:35:20 proxmox kernel: ? tctx_task_work+0x122/0x210
May 22 07:35:20 proxmox kernel: ? task_work_run+0x66/0xa0
May 22 07:35:20 proxmox kernel: ? get_signal+0xa6/0xab0
May 22 07:35:20 proxmox kernel: ? arch_do_signal_or_restart+0x42/0x280
May 22 07:35:20 proxmox kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
May 22 07:35:20 proxmox kernel: ? irqentry_exit+0x43/0x50
May 22 07:35:20 proxmox kernel: ? common_interrupt+0x54/0xb0
May 22 07:35:20 proxmox kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
May 22 07:35:20 proxmox kernel: RIP: 0033:0x7d1a1a973b95
May 22 07:35:20 proxmox kernel: Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
May 22 07:35:20 proxmox kernel: RSP: 002b:00007ffe4fa64d68 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
May 22 07:35:20 proxmox kernel: RAX: ffffffffffffffda RBX: 0000575848cf9370 RCX: 00007d1a1a973b95
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000011b
May 22 07:35:20 proxmox kernel: RBP: 0000575848cf9378 R08: 0000000000000000 R09: 0000000000000008
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000575848cf9460
May 22 07:35:20 proxmox kernel: R13: 0000000000000000 R14: 000057582465b0c8 R15: 00005758496d68c0
May 22 07:35:20 proxmox kernel: </TASK>
May 22 07:35:20 proxmox kernel: Modules linked in: tcp_diag inet_diag veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables nvme_fabrics nvme_keyring bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink intel_rapl_msr intel_rapl_common xe drm_gpuvm drm_exec x86_pkg_temp_thermal gpu_sched intel_powerclamp drm_suballoc_helper drm_ttm_helper kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd mei_hdcp mei_pxp rapl i915 cmdlinepart spi_nor mei_me intel_cstate mtd wmi_bmof ov13858 drm_buddy pcspkr v4l2_fwnode mei ttm v4l2_async drm_display_helper videodev cec mc intel_pmc_core rc_core i2c_algo_bit intel_vsec igen6_edac pmt_telemetry acpi_pad pmt_class acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq uas usb_storage dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio
May 22 07:35:20 proxmox kernel: libcrc32c nvme nvme_core xhci_pci xhci_pci_renesas crc32_pclmul ixgbe nvme_auth xhci_hcd xfrm_algo igc i2c_i801 spi_intel_pci ahci dca spi_intel i2c_smbus mdio libahci video wmi
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b
May 22 07:35:20 proxmox kernel: ---[ end trace 0000000000000000 ]---
After this intial event, the following error message repeats constantly:
Code:
May 22 07:35:20 proxmox kernel: RIP: 0010:blk_mq_delay_run_hw_queue+0x24/0x140
May 22 07:35:20 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 47 18 48 89 fb a8 01 0f 85 07 01 00 00 89 f7 e8 fc d2 a6 ff <49> 89 c4 48 8b 83 b8 00 00 00 83 78 34 01 0f 84 ad 00 00 00 8b 83
May 22 07:35:20 proxmox kernel: RSP: 0018:ffffae0d08a0f730 EFLAGS: 00010246
May 22 07:35:20 proxmox kernel: RAX: 0000000000000000 RBX: ffff9a0f13788000 RCX: 0000000000000000
May 22 07:35:20 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
May 22 07:35:20 proxmox kernel: RBP: ffffae0d08a0f748 R08: 0000000000000000 R09: 0000000000000000
May 22 07:35:20 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
May 22 07:35:20 proxmox kernel: R13: ffff9a0f16d2b0e0 R14: 0000000000000000 R15: 0000000000000001
May 22 07:35:20 proxmox kernel: FS: 00007d1a16c1f5c0(0000) GS:ffff9a165fb00000(0000) knlGS:0000000000000000
May 22 07:35:20 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 07:35:20 proxmox kernel: CR2: 0000000000b8838b CR3: 0000000480112000 CR4: 0000000000f52ef0
May 22 07:35:20 proxmox kernel: PKRU: 55555554
May 22 07:35:20 proxmox kernel: note: iou-wrk-214740[824986] exited with irqs disabled
May 22 07:35:21 proxmox pvedaemon[807254]: VM 1007 qmp command failed - VM 1007 qmp command 'guest-ping' failed - got timeout
VM1007 is the VM executing the backup job. It seems the VM has died after the first issue. However I think the VM died as a result of the error condition on PVE, not the other way round.
In the logs on VM1007 i can't find any information indicating an error prior to the VM getting crashed. I can see the rsync running and all drives mounted and working. Last file copying process started at 22.5.2025, 07:35:15, that's 5 seconds before the kernel crashed.
Last edited: