[SOLVED] Getting pagefaults in every boot after power outage

rgzr

New Member
Jul 13, 2024
5
1
3
I had some power outages today at home and after that my proxmox machine boots but after a while gets a pagefault and lxcs stop working.

I had an UPS but it seems to not have helped with that.
I did a memtest and it went OK.
Tried to boot with a previous kernel and the pagefault was still happening.
I have zfs pool as mirror of two nvme drives. I did a scrub on the zfs pools and it didn't found any error.
SMART looks good also.

I'm thinking maybe the CPU or motherboard got damaged? Or maybe something on the filesystem got corrupted and a reinstall will help?
Tomorrow I will do an stress test on cpu booting from a usb to see if it also crashes.

I am not an expert and will appreciate any help or guidance on how to diagnose what is happening.

Here is the pagefault:

Code:
Feb 15 01:03:29 proxmox pveproxy[12941]: got inotify poll request in wrong process - disabling inotify
Feb 15 01:08:31 proxmox systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Feb 15 01:08:31 proxmox systemd-tmpfiles[15967]: /usr/lib/tmpfiles.d/legacy.conf:14: Duplicate line for path "/run/lock", ignoring.
Feb 15 01:08:31 proxmox systemd-tmpfiles[15967]: /usr/lib/tmpfiles.d/nut-common-tmpfiles.conf:8: Duplicate line for path "/run/nut", ignoring.
Feb 15 01:08:31 proxmox systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Feb 15 01:08:31 proxmox systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Feb 15 01:09:02 proxmox pvedaemon[1909]: <root@pam> successful auth for user 'root@pam'
Feb 15 01:10:11 proxmox zed[16878]: eid=9 class=scrub_finish pool='rpool'
Feb 15 01:11:00 proxmox kernel: BUG: unable to handle page fault for address: 00000000b3a80000
Feb 15 01:11:00 proxmox kernel: #PF: supervisor write access in kernel mode
Feb 15 01:11:00 proxmox kernel: #PF: error_code(0x0002) - not-present page
Feb 15 01:11:00 proxmox kernel: PGD 0 P4D 0
Feb 15 01:11:00 proxmox kernel: Oops: Oops: 0002 [#1] SMP NOPTI
Feb 15 01:11:00 proxmox kernel: CPU: 15 UID: 0 PID: 335 Comm: kworker/u80:5 Tainted: P S         O        6.17.9-1-pve #1 PREEMPT(voluntary)
Feb 15 01:11:00 proxmox kernel: Tainted: [P]=PROPRIETARY_MODULE, [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
Feb 15 01:11:00 proxmox kernel: Hardware name: To Be Filled By O.E.M. Z690 Pro RS/Z690 Pro RS, BIOS 9.02 06/06/2022
Feb 15 01:11:00 proxmox kernel: Workqueue: xprtiod xs_stream_data_receive_workfn [sunrpc]
Feb 15 01:11:00 proxmox kernel: RIP: 0010:__pfx_memcpy_orig+0x1/0x10
Feb 15 01:11:00 proxmox kernel: Code: cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 f3 a4 c3 cc cc cc cc 90 90 <90> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20
Feb 15 01:11:00 proxmox kernel: RSP: 0018:ffffcdf54154f980 EFLAGS: 00010286
Feb 15 01:11:00 proxmox kernel: RAX: ffff8d00b3a80000 RBX: 0000000000006f94 RCX: 0000000000001000
Feb 15 01:11:00 proxmox kernel: RDX: 0000000000001000 RSI: ffff8cff24a1006c RDI: 00000000b3a80000
Feb 15 01:11:00 proxmox kernel: RBP: ffffcdf54154fa20 R08: ffff8cff24a1006c R09: 0000000000000000
Feb 15 01:11:00 proxmox kernel: R10: 0000000000000000 R11: ffff8cf78d6a0a00 R12: ffffcdf54154fd68
Feb 15 01:11:00 proxmox kernel: R13: ffff8d0060b41600 R14: 0000000000001000 R15: 0000000000001000
Feb 15 01:11:00 proxmox kernel: FS:  0000000000000000(0000) GS:ffff8d0274106000(0000) knlGS:0000000000000000
Feb 15 01:11:00 proxmox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 15 01:11:00 proxmox kernel: CR2: 00000000b3a80000 CR3: 000000038803a000 CR4: 0000000000f52ef0
Feb 15 01:11:00 proxmox kernel: PKRU: 55555554
Feb 15 01:11:00 proxmox kernel: Call Trace:
Feb 15 01:11:00 proxmox kernel:  <TASK>
Feb 15 01:11:00 proxmox kernel:  ? _copy_to_iter+0x27f/0x610
Feb 15 01:11:00 proxmox kernel:  ? __ip_queue_xmit+0x1ce/0x560
Feb 15 01:11:00 proxmox kernel:  ? __check_object_size+0xb4/0x240
Feb 15 01:11:00 proxmox kernel:  ? __pfx_simple_copy_to_iter+0x10/0x10
Feb 15 01:11:00 proxmox kernel:  simple_copy_to_iter+0x3e/0x70
Feb 15 01:11:00 proxmox kernel:  __skb_datagram_iter+0x1b8/0x2f0
Feb 15 01:11:00 proxmox kernel:  ? __pfx_simple_copy_to_iter+0x10/0x10
Feb 15 01:11:00 proxmox kernel:  skb_copy_datagram_iter+0x37/0xa0
Feb 15 01:11:00 proxmox kernel:  tcp_recvmsg_locked+0x847/0xaf0
Feb 15 01:11:00 proxmox kernel:  ? __tcp_send_ack.part.0+0xdc/0x1c0
Feb 15 01:11:00 proxmox kernel:  tcp_recvmsg+0x83/0x210
Feb 15 01:11:00 proxmox kernel:  inet_recvmsg+0x51/0x130
Feb 15 01:11:00 proxmox kernel:  ? security_socket_recvmsg+0x44/0x80
Feb 15 01:11:00 proxmox kernel:  sock_recvmsg+0xc6/0xf0
Feb 15 01:11:00 proxmox kernel:  xs_sock_recvmsg.constprop.0+0x2c/0xa0 [sunrpc]
Feb 15 01:11:00 proxmox kernel:  xs_read_stream_request.constprop.0+0x255/0x4f0 [sunrpc]
Feb 15 01:11:00 proxmox kernel:  xs_read_stream.constprop.0+0x2b3/0x440 [sunrpc]
Feb 15 01:11:00 proxmox kernel:  xs_stream_data_receive_workfn+0x71/0x150 [sunrpc]
Feb 15 01:11:00 proxmox kernel:  process_one_work+0x188/0x370
Feb 15 01:11:00 proxmox kernel:  worker_thread+0x33a/0x480
Feb 15 01:11:00 proxmox kernel:  ? __pfx_worker_thread+0x10/0x10
Feb 15 01:11:00 proxmox kernel:  kthread+0x108/0x220
Feb 15 01:11:00 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Feb 15 01:11:00 proxmox kernel:  ret_from_fork+0x205/0x240
Feb 15 01:11:00 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Feb 15 01:11:00 proxmox kernel:  ret_from_fork_asm+0x1a/0x30
Feb 15 01:11:00 proxmox kernel:  </TASK>
Feb 15 01:11:00 proxmox kernel: Modules linked in: tcp_diag inet_diag act_police cls_basic sch_ingress sch_htb cfg80211 veth rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls sunrpc binfmt_misc nfnetlink_log xe gpu_sched drm_gpuvm drm_gpusvm_helper drm_ttm_helper drm_exec drm_suballoc_helper snd_hda_codec_intelhdmi snd_hda_codec_alc662 snd_hda_codec_realtek_lib snd_hda_codec_generic snd_hda_intel snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel snd_sof_intel_hda_sdw_bpt intel_rapl_msr intel_rapl_common snd_sof_intel_hda_common intel_uncore_frequency snd_soc_hdac_hda intel_uncore_frequency_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_hda_codec_hdmi soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_acpi_intel_match snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi
Feb 15 01:11:00 proxmox kernel:  soundwire_bus snd_soc_sdca crc8 snd_soc_avs snd_soc_hda_codec snd_hda_ext_core x86_pkg_temp_thermal intel_powerclamp snd_hda_codec snd_hda_core snd_intel_dspcfg snd_intel_sdw_acpi kvm_intel snd_hwdep i915 snd_soc_core kvm snd_compress ac97_bus snd_pcm_dmaengine drm_buddy ttm snd_pcm irqbypass polyval_clmulni snd_timer drm_display_helper cmdlinepart ghash_clmulni_intel aesni_intel snd mei_hdcp mei_pxp spi_nor cec rapl mtd ee1004 soundcore intel_cstate wmi_bmof pcspkr mei_me rc_core mei i2c_algo_bit intel_pmc_core pmt_telemetry pmt_discovery pmt_class input_leds intel_pmc_ssram_telemetry intel_vsec acpi_pad acpi_tad joydev mac_hid sch_fq_codel vhost_net vhost vhost_iotlb tap nct6775 nct6775_core hwmon_vid coretemp efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq hid_logitech_hidpp hid_logitech_dj hid_generic usbkbd usbmouse usbhid uas hid usb_storage nvme xhci_pci r8169 intel_lpss_pci nvme_core ahci i2c_i801 spi_intel_pci xhci_hcd intel_lpss i2c_mux
Feb 15 01:11:00 proxmox kernel:  realtek nvme_keyring libahci spi_intel idma64 i2c_smbus nvme_auth video wmi
Feb 15 01:11:00 proxmox kernel: CR2: 00000000b3a80000
Feb 15 01:11:00 proxmox kernel: ---[ end trace 0000000000000000 ]---
Feb 15 01:11:00 proxmox kernel: RIP: 0010:__pfx_memcpy_orig+0x1/0x10
Feb 15 01:11:00 proxmox kernel: Code: cc cc cc cc cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 f3 a4 c3 cc cc cc cc 90 90 <90> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20
Feb 15 01:11:00 proxmox kernel: RSP: 0018:ffffcdf54154f980 EFLAGS: 00010286
Feb 15 01:11:00 proxmox kernel: RAX: ffff8d00b3a80000 RBX: 0000000000006f94 RCX: 0000000000001000
Feb 15 01:11:00 proxmox kernel: RDX: 0000000000001000 RSI: ffff8cff24a1006c RDI: 00000000b3a80000
Feb 15 01:11:00 proxmox kernel: RBP: ffffcdf54154fa20 R08: ffff8cff24a1006c R09: 0000000000000000
Feb 15 01:11:00 proxmox kernel: R10: 0000000000000000 R11: ffff8cf78d6a0a00 R12: ffffcdf54154fd68
Feb 15 01:11:00 proxmox kernel: R13: ffff8d0060b41600 R14: 0000000000001000 R15: 0000000000001000
Feb 15 01:11:00 proxmox kernel: FS:  0000000000000000(0000) GS:ffff8d0274106000(0000) knlGS:0000000000000000
Feb 15 01:11:00 proxmox kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 15 01:11:00 proxmox kernel: CR2: 00000000b3a80000 CR3: 000000038803a000 CR4: 0000000000f52ef0
Feb 15 01:11:00 proxmox kernel: PKRU: 55555554
Feb 15 01:11:00 proxmox kernel: note: kworker/u80:5[335] exited with irqs disabled
 
tcp_recvmsg

If it's referencing a non-existent address during this process, it might be the network or kernel driver, but I'm not familiar with the details so I don't know.

If disabling (or removing) the NIC fixes it, you should suspect either a NIC failure or data corruption.

*Since it's after a power loss, it's either hardware or data. There's an issue occurring on the network-related stack, but that's all I know.
 
Last edited:
  • Like
Reactions: rgzr
If it's referencing a non-existent address during this process, it might be the network or kernel driver, but I'm not familiar with the details so I don't know.

If disabling (or removing) the NIC fixes it, you should suspect either a NIC failure or data corruption.

*Since it's after a power loss, it's either hardware or data. There's an issue occurring on the network-related stack, but that's all I know.

I disabled the network device from bios and the pagefault did not happen after 5 hours.. But I think that makes sense because no network traffic code was called on the kernel. So maybe it's not conclusive as to the NIC to be the culprit.

Since I had unraid virtualized in a proxmox VM, I booted directly from the unraid USB and the system seems stable, haven't failed in a few hours although network traffic was in place.

I suspect maybe the proxmox install became corrupted by incorrect shutdowns or something like that? I have backups of the PVE host on PBS made with the proxmox-backup-client, but I am not sure how to recover from the backups.

I understood boot from proxmox ISO USB into recovery mode. Mount the zfs pool (subvolume of host root into /mnt for example) and then issue the proxmox-backup-client restore command into /mnt??

More detailed guidance would be really appreciated, since I am not being able to find through searching on forum, google or guides.
 
So maybe it's not conclusive as to the NIC to be the culprit.
That's right. While we haven't reached a conclusion, we're presenting this to narrow it down to either the data or the NIC.

I run the VM on a separate SSD to quickly restore the environment. To back up the boot configuration and reconstruct the boot process, I use the same procedure as during setup and restore from the backup.
Since I use this procedure, I'm not very familiar with recovering existing data.

If the driver is included in the kernel, why not try rolling back the kernel to see if that fixes it?

If reinstalling the kernel resolves the issue, I believe it is also an option to consider.

*If there's any suspicion of data corruption, I would choose to rebuild the boot configuration from scratch rather than attempt repairs.
 
Last edited:
That's right. While we haven't reached a conclusion, we're presenting this to narrow it down to either the data or the NIC.

I run the VM on a separate SSD to quickly restore the environment. To back up the boot configuration and reconstruct the boot process, I use the same procedure as during setup and restore from the backup.
Since I use this procedure, I'm not very familiar with recovering existing data.

If the driver is included in the kernel, why not try rolling back the kernel to see if that fixes it?

If reinstalling the kernel resolves the issue, I believe it is also an option to consider.

*If there's any suspicion of data corruption, I would choose to rebuild the boot configuration from scratch rather than attempt repairs.

I replaced the motherboard with another one and the same pagefault happened, so this narrows a bit down the options.

I already tried to boot with another earlier kernel, and it happened also.

I will try to reinstall and see if that works.
 
I reinstalled proxmox and pagefault is gone. I managed to recover previous configuration copying files from the old backup and the CTs and VMs with zfs snapshots to another drive.

Thank you very much!
 
  • Like
Reactions: uzumo