Ideas about "general protection fault"

RudyBzh · Jun 12, 2024

Hi,

It's been 3 ou 4 times my Proxmox VE crashes from 1 to several days, making it hard to come back again (several reboot nedeed, ...).
I managed to get the following dmesg logs with the "general protection fault" crash :

Code:

[  146.688704] NFSD: all clients done reclaiming, ending NFSv4 grace period (net f0000545)
[53612.986411] hrtimer: interrupt took 4848 ns
[91048.989565] usb 1-3: USB disconnect, device number 3
[96621.312241] perf: interrupt took too long (2517 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[196838.201854] CIFS: VFS: \\192.168.1.2 has not responded in 15 seconds. Reconnecting...
[196860.472386] CIFS: VFS: \\192.168.1.2 has not responded in 15 seconds. Reconnecting...
[204008.243315] CIFS: VFS: \\192.168.1.2 No task to wake, unknown frame received! NumMids 3
[204008.243331] 00000000: 424d53fe 00000040 00000000 00000012  .SMB@...........
[204008.243332] 00000010: 00000001 00000000 ffffffff ffffffff  ................
[204008.243333] 00000020: 00000000 00000000 00000000 00000000  ................
[204008.243333] 00000030: 00000000 00000000 00000000 00000000  ................
[204067.022925] general protection fault, probably for non-canonical address 0x99abbb4e1e2aa20: 0000 [#1] PREEMPT SMP NOPTI
[204067.022988] CPU: 8 PID: 195 Comm: ksmd Tainted: P           O       6.8.4-3-pve #1
[204067.023033] Hardware name: ASUS System Product Name/ProArt Z790-CREATOR WIFI, BIOS 2302 05/28/2024
[204067.023085] RIP: 0010:get_ksm_page+0x32/0x2b0
[204067.023112] Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
[204067.023222] RSP: 0018:ffffac4fc0853db0 EFLAGS: 00010282
[204067.023254] RAX: 93806840772f0eae RBX: ffff98348e0f34c0 RCX: 0000000000000000
[204067.023297] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 099abbb4e1e2a9f0
[204067.023340] RBP: ffffac4fc0853de0 R08: 0000000000000000 R09: 0000000000000000
[204067.023382] R10: 0000000000000000 R11: 0000000000000000 R12: 099abbb4e1e2a9f3
[204067.023425] R13: 099abbb4e1e2a9f0 R14: 099abbb4e1e2a9f0 R15: ffffac4fc0853e88
[204067.023468] FS:  0000000000000000(0000) GS:ffff9839bee00000(0000) knlGS:0000000000000000
[204067.023516] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[204067.023551] CR2: 00007f859b6140e4 CR3: 0000001c95636000 CR4: 0000000000f52ef0
[204067.023594] PKRU: 55555554
[204067.023611] Call Trace:
[204067.023627]  <TASK>
[204067.023641]  ? show_regs+0x6d/0x80
[204067.023661]  ? die_addr+0x37/0xa0
[204067.023682]  ? exc_general_protection+0x1db/0x480
[204067.023711]  ? asm_exc_general_protection+0x27/0x30
[204067.023741]  ? get_ksm_page+0x32/0x2b0
[204067.023763]  remove_rmap_item_from_tree+0x74/0x1d0
[204067.023792]  ksm_scan_thread+0x994/0x2300
[204067.023817]  ? __pfx_ksm_scan_thread+0x10/0x10
[204067.023844]  kthread+0xef/0x120
[204067.023863]  ? __pfx_kthread+0x10/0x10
[204067.023885]  ret_from_fork+0x44/0x70
[204067.023907]  ? __pfx_kthread+0x10/0x10
[204067.023930]  ret_from_fork_asm+0x1b/0x30
[204067.023954]  </TASK>
[204067.023967] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core cifs_md4 netfs nfsd auth_rpcgss nfs_acl lockd grace veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink snd_hda_codec_hdmi xe drm_gpuvm drm_exec gpu_sched snd_hda_codec_realtek drm_suballoc_helper drm_ttm_helper snd_hda_codec_generic snd_sof_pci_intel_tgl snd_sof_intel_hda_common intel_rapl_msr soundwire_intel intel_rapl_common iwlmvm snd_sof_intel_hda_mlink soundwire_cadence intel_uncore_frequency snd_sof_intel_hda intel_uncore_frequency_common intel_tcc_cooling snd_sof_pci snd_sof_xtensa_dsp x86_pkg_temp_thermal intel_powerclamp snd_sof coretemp mac80211 snd_sof_utils snd_soc_hdac_hda kvm_intel snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus i915 libarc4 kvm snd_soc_core snd_compress ac97_bus
[204067.023994]  snd_pcm_dmaengine irqbypass snd_hda_intel crct10dif_pclmul snd_intel_dspcfg polyval_clmulni btusb polyval_generic snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_hda_codec sha256_ssse3 btintel sha1_ssse3 btbcm drm_buddy btmtk aesni_intel snd_hda_core ttm snd_hwdep crypto_simd bluetooth drm_display_helper cryptd iwlwifi snd_pcm cmdlinepart cec snd_timer ecdh_generic ucsi_acpi input_leds joydev ecc cdc_acm spi_nor rc_core typec_ucsi intel_pmc_core mei_hdcp tps6598x snd mei_pxp rapl cfg80211 intel_cstate pcspkr eeepc_wmi asus_nb_wmi wmi_bmof soundcore pmt_telemetry mtd i2c_algo_bit intel_vsec typec serial_multi_instantiate pmt_class acpi_tad acpi_pad mei_me mei mac_hid sch_fq tcp_htcp vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c hid_logitech ff_memless hid_generic usbmouse usbkbd usbhid hid mfd_aaeon asus_wmi nvme xhci_pci ledtrig_audio sparse_keymap xhci_pci_renesas platform_profile crc32_pclmul nvme_core
[204067.024547]  atlantic ahci i2c_i801 spi_intel_pci thunderbolt xhci_hcd intel_lpss_pci igc i2c_smbus spi_intel libahci nvme_auth intel_lpss macsec idma64 vmd video wmi pinctrl_alderlake
[204067.025165] ---[ end trace 0000000000000000 ]---
[204069.537891] RIP: 0010:get_ksm_page+0x32/0x2b0
[204069.537905] Code: e5 41 57 41 56 49 89 fe 41 55 41 54 49 89 fc 53 49 83 cc 03 48 83 ec 08 89 75 d4 eb 0d 49 8b 46 30 49 39 c5 0f 84 29 01 00 00 <4d> 8b 6e 30 4c 89 eb 48 c1 e3 06 48 03 1d 44 9c 66 01 48 8b 43 18
[204069.537919] RSP: 0018:ffffac4fc0853db0 EFLAGS: 00010282
[204069.537926] RAX: 93806840772f0eae RBX: ffff98348e0f34c0 RCX: 0000000000000000
[204069.537934] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 099abbb4e1e2a9f0
[204069.537943] RBP: ffffac4fc0853de0 R08: 0000000000000000 R09: 0000000000000000
[204069.537951] R10: 0000000000000000 R11: 0000000000000000 R12: 099abbb4e1e2a9f3
[204069.537959] R13: 099abbb4e1e2a9f0 R14: 099abbb4e1e2a9f0 R15: ffffac4fc0853e88
[204069.538649] FS:  0000000000000000(0000) GS:ffff9839bee00000(0000) knlGS:0000000000000000
rudybzh@pve:~$

Any ideas of what it could be please ? I really don't know from where to start to be honnest ?

Thanks a lot.

Regards.

fiona · Jun 12, 2024

Hi,
what errors do you get during these unsuccessful reboots? Do you have the same issue when you boot into a 6.5 kernel? I'd also check the RAM with memtest86+ (can be selected in the advanced options during boot or via the Proxmox VE installer ISO).

RudyBzh · Jun 12, 2024

fiona said:
Hi,
what errors do you get during these unsuccessful reboots? Do you have the same issue when you boot into a 6.5 kernel? I'd also check the RAM with memtest86+ (can be selected in the advanced options during boot or via the Proxmox VE installer ISO).

Hi, thanks for reply.

Sorry, I don't know exactly what issues/errors prevents from reboot ; I have no display on this server, so when I lost SSH access (really not often, except these last days), I just reset the server (no electric shutdown) and it comes back.
The last times, it does not worked, so I pluged a display to understand and was stucked at boot loader, going to bios, my M2 (boot drive) was not there. After a cool reboot (electric shutdown), it comes back as totally normal (perhaps due to several resets made before ?! Don't know...)
Because I was in the bios (and M2 "disappered strangly"), I took the opportunity to upgrade my MB bios (from 1501 to 2302). Don't know if info is relevant.
By the way, the kernel issues still reappears.

Please find attached last dmesg log files. Hope you'll have patience and time to take a look, because I do not understand myself.
Just notice it's been a long time my /dev/sdb seems to have some issues (but smartctl do not show or log anything), and that crashes are more blocking since 6.8. But there was others in 6.5 (but no impact on my VMs as I can remember)

journalctl -b -1 (kernel 6.8.4-3) :

The error in my 1st message
Crash (no shutdown at the end)

journalctl -b -2 (kernel 6.8.4-3) :

BUG: Bad page state in process swapper
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
get_swap_device: Bad swap offset entry 3ffffffffffff
BUG: Bad page map in process kvm pte:00000080 pmd:2d724e067
...
Crash (no shutdown at the end)

journalctl -b -5 (kernel 6.8.4-3) :

[CODE]mai 21 08:09:13 pve kernel: perf: interrupt took too long (2506 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
mai 23 02:05:43 pve kernel: ------------[ cut here ]------------
mai 23 02:05:43 pve kernel: WARNING: CPU: 10 PID: 193 at mm/gup.c:229 try_grab_page+0xc2/0x120
mai 23 02:05:43 pve kernel: Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport nft_compat cmac nls_utf8 cifs cifs_arc4 nls_ucs2_ut>
mai 23 02:05:43 pve kernel: snd_pcm_dmaengine btusb btrtl btintel snd_hda_intel i915 btbcm irqbypass btmtk crct10dif_pclmul snd_intel_ds>
mai 23 02:05:43 pve kernel: macsec nvme_core ahci intel_lpss_pci xhci_hcd intel_lpss libahci nvme_auth idma64 vmd video wmi pinctrl_alde>
mai 23 02:05:43 pve kernel: CPU: 10 PID: 193 Comm: ksmd Tainted: P O 6.8.4-3-pve #1
mai 23 02:05:43 pve kernel: Hardware name: ASUS System Product Name/ProArt Z790-CREATOR WIFI, BIOS 1501 10/06/2023
mai 23 02:05:43 pve kernel: RIP: 0010:try_grab_page+0xc2/0x120
mai 23 02:05:43 pve kernel: Code: 01 00 00 00 be 23 00 00 00 48 c1 e8 36 48 8b 3c c5 a0 db f7 ab e8 3e 5f fe ff 31 c0 5d 31 d2 31 c9 31 f>
mai 23 02:05:43 pve kernel: RSP: 0018:ffffb5ad80843d30 EFLAGS: 00010246
mai 23 02:05:43 pve kernel: RAX: fffff3b572a64600 RBX: 0000000000000002 RCX: 0000000000000000
mai 23 02:05:43 pve kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: fffff3b572a64600
mai 23 02:05:43 pve kernel: RBP: ffffb5ad80843d80 R08: 0000000000000000 R09: 0000000000000000
mai 23 02:05:43 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff99bf1c04c3c0
mai 23 02:05:43 pve kernel: R13: ffff99c8732e4468 R14: fffff3b572a64600 R15: 8000001ca9918867
mai 23 02:05:43 pve kernel: FS: 0000000000000000(0000) GS:ffff99ddbef00000(0000) knlGS:0000000000000000
mai 23 02:05:43 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
mai 23 02:05:43 pve kernel: CR2: 00007fe33705aeb8 CR3: 0000000c88036000 CR4: 0000000000f52ef0
mai 23 02:05:43 pve kernel: PKRU: 55555554
mai 23 02:05:43 pve kernel: Call Trace:
mai 23 02:05:43 pve kernel: <TASK>
mai 23 02:05:43 pve kernel: ? show_regs+0x6d/0x80
mai 23 02:05:43 pve kernel: ? __warn+0x89/0x160
mai 23 02:05:43 pve kernel: ? try_grab_page+0xc2/0x120
mai 23 02:05:43 pve kernel: ? report_bug+0x17e/0x1b0
mai 23 02:05:43 pve kernel: ? handle_bug+0x46/0x90
mai 23 02:05:43 pve kernel: ? exc_invalid_op+0x18/0x80
mai 23 02:05:43 pve kernel: ? asm_exc_invalid_op+0x1b/0x20
mai 23 02:05:43 pve kernel: ? try_grab_page+0xc2/0x120
mai 23 02:05:43 pve kernel: ? follow_page_pte+0xf9/0x5a0
mai 23 02:05:43 pve kernel: follow_page_mask+0x38c/0x5b0
mai 23 02:05:43 pve kernel: follow_page+0x5e/0xe0
mai 23 02:05:43 pve kernel: ksm_scan_thread+0x22b/0x2300
mai 23 02:05:43 pve kernel: ? __pfx_ksm_scan_thread+0x10/0x10
mai 23 02:05:43 pve kernel: kthread+0xef/0x120
mai 23 02:05:43 pve kernel: ? __pfx_kthread+0x10/0x10
mai 23 02:05:43 pve kernel: ret_from_fork+0x44/0x70
mai 23 02:05:43 pve kernel: ? __pfx_kthread+0x10/0x10
mai 23 02:05:43 pve kernel: ret_from_fork_asm+0x1b/0x30
mai 23 02:05:43 pve kernel: </TASK>
mai 23 02:05:43 pve kernel: ---[ end trace 0000000000000000 ]---
mai 25 22:46:04 pve kernel: usb 1-2: USB disconnect, device number 6[/CODE]
Crash (no shutdown at the end)

journalctl -b -7 (kernel 6.5.13-5) :

up for several days
Issues with /dev/sdb (revalidation failed, I/O error, critical target error, device offline error, ...), but still up
Perhaps a crash (no shutdown ?!), but no error just before.

journalctl -b -8 (kernel 6.5.13-5) :

BUG: Dentry 000000008c50960f{i=d000000001de8,n=/} still in use (2) [unmount of cifs cifs]
RIP: 0010:umount_check+0x6b/0x90
But no crash (shutdown)

journalctl -b -9 (kernel 6.5.13-5) :

Same CIFS error
But no crash (shutdown)

journalctl -b -12 (kernel 6.5.13-5) :

Same CIFS error
But no crash (shutdown)

journalctl -b -13 (kernel 6.5.11-8) :

Issues with /dev/sdb
But no crash (shutdown)

Will to do a memtest when have time, thanks for the advice.
For now, I just wait to see, because it's up since this morning...

I really don't know if it's harware, software, kernel, ... And don't know where to start. That's why I'm here.
Just know that it was correctly working (up for days without issues) since my relatively recent 1st crash.

Thanks again for help/advise.

Regards.

fiona · Jun 13, 2024

From what you said, a good guess is that the crash is not related to CIFS and that it is related to the 6.8.4 kernel, but it's just a guess. There also is kernel 6.8.8 on the pvetest repository currently. You might want to give that one a shot too. Otherwise, pinning kernel 6.5 might be the way to go and see if it really is related to the kernel.

Ideas about "general protection fault"

RudyBzh

Active Member

fiona

Proxmox Staff Member

RudyBzh

Active Member

Attachments

fiona

Proxmox Staff Member

We value your privacy