Server becomes unresponsive with error: unable to handle page fault. error_code(0x0003) - permissions violation

jhm

New Member
Apr 14, 2023
2
0
1
I am setting up a new SuperMicro Server, in an existing cluster. After setting things up, and starting a VM, at some point it becomes unworkably slow or even entirely unresponsive. Sometimes immediately after deploying the VM, sometimes a bit later. When the VM is started, this shows up in dmesg:

Code:
[ 8324.835350] BUG: unable to handle page fault for address: ff4cd41d9b7dbcff
[ 8324.835610] #PF: supervisor write access in kernel mode
[ 8324.835840] #PF: error_code(0x0003) - permissions violation
[ 8324.836066] PGD d31e01067 P4D d31e02067 PUD 102bd1063 PMD 11b65e063 PTE 800000011b7db161
[ 8324.836283] Oops: 0003 [#2] PREEMPT SMP NOPTI
[ 8324.836510] CPU: 19 PID: 186338 Comm: z_wr_iss_h Tainted: P      D    O       6.2.16-14-pve #1
[ 8324.836743] Hardware name: Supermicro SYS-511E-WR/X13SEW-F, BIOS 1.3a 06/02/2023
[ 8324.836977] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.837211] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 35 60 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 8324.837691] RSP: 0018:ff8c9468c81e7930 EFLAGS: 00010082
[ 8324.837943] RAX: 00000000ffffffff RBX: ff8c946909ee3000 RCX: ff4cd41d9b7d9000
[ 8324.838247] RDX: 00000000ffffffff RSI: ff8c946909ee3000 RDI: ff8c9468c81e7a80
[ 8324.838592] RBP: ff8c9468c81e7930 R08: 0000000000000000 R09: 0000000000000000
[ 8324.838939] R10: ff8c9468c81e7ca0 R11: 0000000000000000 R12: ff8c946909f03000
[ 8324.839285] R13: ff8c9468c81e7a80 R14: 0000000000020000 R15: 0000000000000000
[ 8324.839633] FS:  0000000000000000(0000) GS:ff4cd42cbfec0000(0000) knlGS:0000000000000000
[ 8324.839985] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8324.840338] CR2: ff4cd41d9b7dbcff CR3: 00000006fab6a005 CR4: 0000000000773ee0
[ 8324.840689] PKRU: 55555554
[ 8324.841030] Call Trace:
[ 8324.841364]  <TASK>
[ 8324.841689]  ? show_regs+0x6d/0x80
[ 8324.842012]  ? __die+0x24/0x80
[ 8324.842326]  ? page_fault_oops+0x176/0x500
[ 8324.842631]  ? kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.842931]  ? kernelmode_fixup_or_oops+0xb2/0x140
[ 8324.843219]  ? __bad_area_nosemaphore+0x1a5/0x2c0
[ 8324.843501]  ? bad_area_nosemaphore+0x16/0x30
[ 8324.843774]  ? do_kern_addr_fault+0x7b/0xa0
[ 8324.844040]  ? exc_page_fault+0x10a/0x1b0
[ 8324.844303]  ? asm_exc_page_fault+0x27/0x30
[ 8324.844565]  ? kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.844828]  fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
[ 8324.845086]  abd_fletcher_4_iter+0x71/0xe0 [zcommon]
[ 8324.845340]  abd_iterate_func+0x104/0x1e0 [zfs]
[ 8324.845671]  ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[ 8324.845917]  abd_fletcher_4_native+0x89/0xd0 [zfs]
[ 8324.846268]  ? abd_copy_from_buf_off+0x39/0x60 [zfs]
[ 8324.846583]  ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[ 8324.846914]  zio_checksum_compute+0x35f/0x550 [zfs]
[ 8324.847242]  zio_checksum_generate+0x4d/0x80 [zfs]
[ 8324.847566]  zio_execute+0x94/0x170 [zfs]
[ 8324.847883]  taskq_thread+0x2ac/0x4d0 [spl]
[ 8324.848122]  ? __pfx_default_wake_function+0x10/0x10
[ 8324.848353]  ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 8324.848670]  ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 8324.848904]  kthread+0xe6/0x110
[ 8324.849130]  ? __pfx_kthread+0x10/0x10
[ 8324.849352]  ret_from_fork+0x29/0x50
[ 8324.849573]  </TASK>
[ 8324.849786] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter bonding softdog tls sunrpc binfmt_misc nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel pmt_crashlog crypto_simd pmt_telemetry cryptd ast intel_sdsi pmt_class drm_shmem_helper cmdlinepart rapl drm_kms_helper intel_cstate syscopyarea spi_nor idxd isst_if_mbox_pci mei_me pcspkr isst_if_mmio sysfillrect mtd isst_if_common intel_vsec sysimgblt idxd_bus mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad joydev input_leds pfr_update pfr_telemetry mac_hid
[ 8324.849831]  vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid raid6_pq libcrc32c uas usb_storage simplefb xhci_pci xhci_pci_renesas igb ahci i2c_i801 i2c_algo_bit spi_intel_pci crc32_pclmul xhci_hcd libahci spi_intel i2c_ismt i2c_smbus dca wmi pinctrl_emmitsburg
[ 8324.852545] CR2: ff4cd41d9b7dbcff
[ 8324.852787] ---[ end trace 0000000000000000 ]---
[ 8324.900133] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.900404] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 35 60 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 8324.900873] RSP: 0018:ff8c9468e55c7930 EFLAGS: 00010082
[ 8324.901113] RAX: 00000000ffffffff RBX: ff8c9468e516d000 RCX: ff4cd41d9b7d9000
[ 8324.901356] RDX: 00000000ffffffff RSI: ff8c9468e516d000 RDI: ff8c9468e55c7a80
[ 8324.901600] RBP: ff8c9468e55c7930 R08: 0000000000000000 R09: 0000000000000000
[ 8324.901845] R10: 0000000000000000 R11: 0000000000000000 R12: ff8c9468e5176000
[ 8324.902101] R13: ff8c9468e55c7a80 R14: 0000000000009000 R15: 0000000000000000
[ 8324.902368] FS:  0000000000000000(0000) GS:ff4cd42cbfec0000(0000) knlGS:0000000000000000
[ 8324.902639] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8324.902910] CR2: ff4cd41d9b7dbcff CR3: 00000006fab6a005 CR4: 0000000000773ee0
[ 8324.903185] PKRU: 55555554
[ 8324.903458] note: z_wr_iss_h[186338] exited with irqs disabled
[ 8324.903749] note: z_wr_iss_h[186338] exited with preempt_count 1

We are using ZFS. We do decrease the arc_max setting to 16GiB. /etc/modprobe.d/zfs.conf contains:
Code:
options zfs zfs_arc_max=17179869184

The machine has 64GiB RAM, and 2 mirrored SSDs of 1TB.

Output of pveversion:
Code:
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-14-pve)

The other nodes in the cluster are still on proxmox 7.

The slowdown doesn't always happen immediately after deploying a VM, but I think it does most of the time. But I'm not sure the described error and the unresponsiveness are related. Any clue what could be the problem here, or how I can figure out what's going on?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!