I am setting up a new SuperMicro Server, in an existing cluster. After setting things up, and starting a VM, at some point it becomes unworkably slow or even entirely unresponsive. Sometimes immediately after deploying the VM, sometimes a bit later. When the VM is started, this shows up in dmesg:
We are using ZFS. We do decrease the arc_max setting to 16GiB.
The machine has 64GiB RAM, and 2 mirrored SSDs of 1TB.
Output of pveversion:
The other nodes in the cluster are still on proxmox 7.
The slowdown doesn't always happen immediately after deploying a VM, but I think it does most of the time. But I'm not sure the described error and the unresponsiveness are related. Any clue what could be the problem here, or how I can figure out what's going on?
Code:
[ 8324.835350] BUG: unable to handle page fault for address: ff4cd41d9b7dbcff
[ 8324.835610] #PF: supervisor write access in kernel mode
[ 8324.835840] #PF: error_code(0x0003) - permissions violation
[ 8324.836066] PGD d31e01067 P4D d31e02067 PUD 102bd1063 PMD 11b65e063 PTE 800000011b7db161
[ 8324.836283] Oops: 0003 [#2] PREEMPT SMP NOPTI
[ 8324.836510] CPU: 19 PID: 186338 Comm: z_wr_iss_h Tainted: P D O 6.2.16-14-pve #1
[ 8324.836743] Hardware name: Supermicro SYS-511E-WR/X13SEW-F, BIOS 1.3a 06/02/2023
[ 8324.836977] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.837211] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 35 60 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 8324.837691] RSP: 0018:ff8c9468c81e7930 EFLAGS: 00010082
[ 8324.837943] RAX: 00000000ffffffff RBX: ff8c946909ee3000 RCX: ff4cd41d9b7d9000
[ 8324.838247] RDX: 00000000ffffffff RSI: ff8c946909ee3000 RDI: ff8c9468c81e7a80
[ 8324.838592] RBP: ff8c9468c81e7930 R08: 0000000000000000 R09: 0000000000000000
[ 8324.838939] R10: ff8c9468c81e7ca0 R11: 0000000000000000 R12: ff8c946909f03000
[ 8324.839285] R13: ff8c9468c81e7a80 R14: 0000000000020000 R15: 0000000000000000
[ 8324.839633] FS: 0000000000000000(0000) GS:ff4cd42cbfec0000(0000) knlGS:0000000000000000
[ 8324.839985] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8324.840338] CR2: ff4cd41d9b7dbcff CR3: 00000006fab6a005 CR4: 0000000000773ee0
[ 8324.840689] PKRU: 55555554
[ 8324.841030] Call Trace:
[ 8324.841364] <TASK>
[ 8324.841689] ? show_regs+0x6d/0x80
[ 8324.842012] ? __die+0x24/0x80
[ 8324.842326] ? page_fault_oops+0x176/0x500
[ 8324.842631] ? kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.842931] ? kernelmode_fixup_or_oops+0xb2/0x140
[ 8324.843219] ? __bad_area_nosemaphore+0x1a5/0x2c0
[ 8324.843501] ? bad_area_nosemaphore+0x16/0x30
[ 8324.843774] ? do_kern_addr_fault+0x7b/0xa0
[ 8324.844040] ? exc_page_fault+0x10a/0x1b0
[ 8324.844303] ? asm_exc_page_fault+0x27/0x30
[ 8324.844565] ? kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.844828] fletcher_4_avx512f_native+0x1d/0xb0 [zcommon]
[ 8324.845086] abd_fletcher_4_iter+0x71/0xe0 [zcommon]
[ 8324.845340] abd_iterate_func+0x104/0x1e0 [zfs]
[ 8324.845671] ? __pfx_abd_fletcher_4_iter+0x10/0x10 [zcommon]
[ 8324.845917] abd_fletcher_4_native+0x89/0xd0 [zfs]
[ 8324.846268] ? abd_copy_from_buf_off+0x39/0x60 [zfs]
[ 8324.846583] ? __pfx_abd_fletcher_4_native+0x10/0x10 [zfs]
[ 8324.846914] zio_checksum_compute+0x35f/0x550 [zfs]
[ 8324.847242] zio_checksum_generate+0x4d/0x80 [zfs]
[ 8324.847566] zio_execute+0x94/0x170 [zfs]
[ 8324.847883] taskq_thread+0x2ac/0x4d0 [spl]
[ 8324.848122] ? __pfx_default_wake_function+0x10/0x10
[ 8324.848353] ? __pfx_zio_execute+0x10/0x10 [zfs]
[ 8324.848670] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 8324.848904] kthread+0xe6/0x110
[ 8324.849130] ? __pfx_kthread+0x10/0x10
[ 8324.849352] ret_from_fork+0x29/0x50
[ 8324.849573] </TASK>
[ 8324.849786] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter sctp ip6_udp_tunnel udp_tunnel nf_tables iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter bonding softdog tls sunrpc binfmt_misc nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel pmt_crashlog crypto_simd pmt_telemetry cryptd ast intel_sdsi pmt_class drm_shmem_helper cmdlinepart rapl drm_kms_helper intel_cstate syscopyarea spi_nor idxd isst_if_mbox_pci mei_me pcspkr isst_if_mmio sysfillrect mtd isst_if_common intel_vsec sysimgblt idxd_bus mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad joydev input_leds pfr_update pfr_telemetry mac_hid
[ 8324.849831] vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid raid6_pq libcrc32c uas usb_storage simplefb xhci_pci xhci_pci_renesas igb ahci i2c_i801 i2c_algo_bit spi_intel_pci crc32_pclmul xhci_hcd libahci spi_intel i2c_ismt i2c_smbus dca wmi pinctrl_emmitsburg
[ 8324.852545] CR2: ff4cd41d9b7dbcff
[ 8324.852787] ---[ end trace 0000000000000000 ]---
[ 8324.900133] RIP: 0010:kfpu_begin+0x31/0xa0 [zcommon]
[ 8324.900404] Code: 3f 48 89 e5 fa 0f 1f 44 00 00 48 8b 15 88 89 00 00 65 8b 05 6d 35 60 3f 48 98 48 8b 0c c2 0f 1f 44 00 00 b8 ff ff ff ff 89 c2 <0f> c7 29 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3 cc cc cc cc 0f 1f 44
[ 8324.900873] RSP: 0018:ff8c9468e55c7930 EFLAGS: 00010082
[ 8324.901113] RAX: 00000000ffffffff RBX: ff8c9468e516d000 RCX: ff4cd41d9b7d9000
[ 8324.901356] RDX: 00000000ffffffff RSI: ff8c9468e516d000 RDI: ff8c9468e55c7a80
[ 8324.901600] RBP: ff8c9468e55c7930 R08: 0000000000000000 R09: 0000000000000000
[ 8324.901845] R10: 0000000000000000 R11: 0000000000000000 R12: ff8c9468e5176000
[ 8324.902101] R13: ff8c9468e55c7a80 R14: 0000000000009000 R15: 0000000000000000
[ 8324.902368] FS: 0000000000000000(0000) GS:ff4cd42cbfec0000(0000) knlGS:0000000000000000
[ 8324.902639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8324.902910] CR2: ff4cd41d9b7dbcff CR3: 00000006fab6a005 CR4: 0000000000773ee0
[ 8324.903185] PKRU: 55555554
[ 8324.903458] note: z_wr_iss_h[186338] exited with irqs disabled
[ 8324.903749] note: z_wr_iss_h[186338] exited with preempt_count 1
We are using ZFS. We do decrease the arc_max setting to 16GiB.
/etc/modprobe.d/zfs.conf
contains:
Code:
options zfs zfs_arc_max=17179869184
The machine has 64GiB RAM, and 2 mirrored SSDs of 1TB.
Output of pveversion:
Code:
pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-14-pve)
The other nodes in the cluster are still on proxmox 7.
The slowdown doesn't always happen immediately after deploying a VM, but I think it does most of the time. But I'm not sure the described error and the unresponsiveness are related. Any clue what could be the problem here, or how I can figure out what's going on?