ZFS issue ?

ddscentral

New Member
Jun 17, 2024
14
3
3
Eastern Europe
Recently installed Proxmox VE 8.2 on my new homelab build. Used 3x 2 TB NVME drives in ZFS RAIDZ1 for storage.
Noticed some worrying messages in kernel log:

[39966.260656] BUG: Bad page state in process arc_evict pfn:2b3978
[39966.260671] page:00000000ef7e38ba refcount:0 mapcount:0 mapping:0000000000000000 index:0x18 pfn:0x2b3978
[39966.260682] head:000000007f52ca0b order:5 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[39966.260691] flags: 0x17ffffc0010000(head|node=0|zone=2|lastcpupid=0x1fffff)
[39966.260699] page_type: 0xffffffff()
[39966.260704] raw: 0017ffffc0000000 ffffee8dcace5801 dead000040000122 dead000080000400
[39966.260712] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[39966.260720] head: 0017ffffc0010000 dead000000000100 dead000000000122 0000000000000000
[39966.260727] head: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[39966.260734] page dumped because: corrupted mapping in tail page
[39966.260740] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw ip6table_filter ip6_tables iptable_filter nf_tables iptable_raw xt_CT iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter bonding tls softdog sunrpc binfmt_misc nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match x86_pkg_temp_thermal intel_powerclamp snd_soc_acpi soundwire_generic_allocation coretemp soundwire_bus kvm_intel snd_soc_core iwlmvm i915 kvm snd_compress ac97_bus mac80211 snd_pcm_dmaengine irqbypass btusb snd_hda_intel crct10dif_pclmul polyval_clmulni btrtl snd_intel_dspcfg polyval_generic snd_intel_sdw_acpi btbcm ghash_clmulni_intel
[39966.260774] btintel sha256_ssse3 drm_buddy btmtk libarc4 snd_hda_codec sha1_ssse3 ttm bluetooth snd_hda_core aesni_intel drm_display_helper iwlwifi snd_hwdep crypto_simd snd_pcm cec cryptd rc_core pmt_telemetry cmdlinepart snd_timer ecdh_generic input_leds joydev hid_multitouch drm_kms_helper ecc spi_nor snd pmt_class rapl mei_hdcp mei_pxp cfg80211 intel_cstate pcspkr serio_raw soundcore wmi_bmof mtd i2c_algo_bit intel_vsec acpi_tad acpi_pad mei_me mei mac_hid vhost_net vhost vhost_iotlb tap drm efi_pstore dmi_sysfs ip_tables x_tables autofs4 hid_generic usbkbd usbhid hid zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c simplefb xhci_pci nvme xhci_pci_renesas crc32_pclmul r8169 psmouse xhci_hcd nvme_core spi_intel_pci video i2c_i801 ahci realtek spi_intel i2c_smbus libahci nvme_common wmi
[39966.260939] CPU: 16 PID: 497 Comm: arc_evict Tainted: P O 6.5.13-5-pve #1
[39966.261567] Hardware name: ERYING Polestar H770/Polestar HX ATX D5, BIOS 5.27 04/26/2024
[39966.262208] Call Trace:
[39966.262836] <TASK>
[39966.263474] dump_stack_lvl+0x48/0x70
[39966.264115] dump_stack+0x10/0x20
[39966.264748] bad_page+0x76/0x120
[39966.265385] free_tail_page_prepare+0x150/0x190
[39966.266019] __free_pages_ok+0x4bd/0x5b0
[39966.266653] __free_pages+0x105/0x140
[39966.267290] abd_free_chunks+0x71/0x1e0 [zfs]
[39966.268091] ? mutex_lock+0x12/0x50
[39966.268715] abd_free_linear_page+0x23/0x40 [zfs]
[39966.269455] abd_free+0x1f3/0x200 [zfs]
[39966.270185] ? arc_free_data_impl.constprop.0+0x9a/0x160 [zfs]
[39966.270926] arc_hdr_free_abd+0x1cc/0x2f0 [zfs]
[39966.271666] ? arc_change_state+0x228/0x510 [zfs]
[39966.272396] arc_evict_state+0x39d/0x880 [zfs]
[39966.273127] arc_evict_cb+0x564/0x8b0 [zfs]
[39966.273853] zthr_procedure+0x138/0x150 [zfs]
[39966.274579] ? __pfx_zthr_procedure+0x10/0x10 [zfs]
[39966.275305] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[39966.275932] thread_generic_wrapper+0x5c/0x70 [spl]
[39966.276555] kthread+0xef/0x120
[39966.277164] ? __pfx_kthread+0x10/0x10
[39966.277771] ret_from_fork+0x44/0x70
[39966.278378] ? __pfx_kthread+0x10/0x10
[39966.278985] ret_from_fork_asm+0x1b/0x30
[39966.279586] </TASK>

"arc_evict" looks to be ZFS related.
The system is hasn't locked up or shown any other issues but this doesn't look normal to me.
What could be the cause of this ? Any possible fixes/workarounds (kernel tweaking, compiling, etc. is not a problem to me) ?
If you need more info, I will post it.
Switching storage type is an option since this is a fresh installation but I would prefer keeping ZFS.

Hardware specs if that matters:
CPU: i9-13980HX (MoTD board)
Memory: 96 GB DDR5 4800
Storage: 3x NVME 2TB in ZFS RAIDZ1
 
Indeed, Memtest shows errors. This is a RAM issue.
Good thing I found it before it messed up the ZFS with my VMs on it.
You never know without ECC RAM. Even a scrub that tells you there are no checksum error might be wrong if the data got corrupted in RAM.
 
There's nothing of importance on the system right now except for one test VM. I will reinstall if necessary.
If I was building a production system, I would definitely go server-grade hardware and ECC memory without any discussions. But I'm not willing to spend extra on Xeons for a home server setup.

Returing to the topic at hand, I will try swapping slots and testing modules to determine if it's a RAM issue or a mainboard compatibility issue. One thing I already know that it's my hardware that's at fault, not Proxmox.
 
By "supported" you mean it works as ECC RAM or just that ECC RAM will work with consumer chips as regular RAM ?
AFAIK, Intel only truly supports ECC on Xeons. ECC support is one major advantage AMD Ryzen chips have over Intel.
Ok, let's end it here, I really do not want to start an "ECC vs non-ECC discussion", it has nothing to do with my original issue.

Will report back on how memory testing goes.
 
Last edited:
  • Like
Reactions: IsThisThingOn
Oh, very interesting, that's a new one to me. I always though only Xeons had ECC.
Well, you still need a mainboard with a chipset which supports ECC. The one in my MoDT board is H770 (I believe it's a part of the CPU package), so no ECC for me.
Will consider "W" chipset boards for my future Intel home server/work rig builds. They seem to have good expansion options too.

As for RAM testing, Interestingly, it seems that swapping the RAM sticks around has somehow fixed the issue. I no longer get any errors in Memtest. Maybe the sticks just weren't seated properly. Which is very possible as this board only has retaining clips on one side it's RAM slots. First mainboard I've ever seen with such a "feature".

Now back to Proxmox tinkering.
 
Last edited:
Update: the same issue showed up again. I ran Memtest again, this time ran it overnight and errors showed up again. Not thousands, only a few miscompares, but they are there.
One or both of them RAM modules are faulty. It's RMA time. Since it's a set, I'm not sure I should bother testing which one of the sticks is faulty, they both have to be returned anyways.
 
Last edited:
Ok, bumped the RAM voltage a bit, ran 48 hour memtest @4800 Mhz, no more errors. The memory is most likely OK after all.
But the host would still crash, and always with ZFS in the backtrace. No workarounds (like disabling ASPM, C-states, etc.) would help.
Upgraded the ZFS module to bleeding edge (don't do this at home...), so far no more host crashes for 2 days.
Still getting strange BSODs in Windows VMs though, but Linux machines seem fine.

I did order a couple more 48 GB sticks from a different vendor, just to be absolutely sure about the RAM situation.
 
It was the memory. I replaced RAM with sticks from a different vendor and everything is stable.
Either the old RAM was faulty or incompatible with the board. Will try to get a refund.

Issue resolved.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!