BUG: Bad page state in process

KelianSB · Jul 16, 2024

Hello,

My server running PVE 8.2 (kernel version 6.8.8-1) just crashed: all containers/vms no longer worked and SSH to the host didn't work either.
I had to disconnect the power supply because the graceful shutdown using the button on the case didn't work.

Please note that it's the second time this kind of crash happened, and last time the kernel version was 6.8.4-3.

Does anyone have any idea what the problem could be?

Here is the stack trace:

Code:

Jul 16 13:31:30 kernel: BUG: Bad page state in process iou-wrk-3367  pfn:2b8e2c
Jul 16 13:31:30 kernel: page:00000000a717700a refcount:0 mapcount:0 mapping:0000000044154b98 index:0x3f5a0 pfn:0x2b8e2c
Jul 16 13:31:30 kernel: aops:btree_aops [btrfs] ino:1
Jul 16 13:31:30 kernel: flags: 0x17ffffe000020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x3fffff)
Jul 16 13:31:30 kernel: page_type: 0xffffffff()
Jul 16 13:31:30 kernel: raw: 0017ffffe000020c dead000000000100 dead000000000122 ffff8eae08757a38
Jul 16 13:31:30 kernel: raw: 000000000003f5a0 0000000000000000 00000000ffffffff 0000000000000000
Jul 16 13:31:30 kernel: page dumped because: non-NULL mapping
Jul 16 13:31:30 kernel: Modules linked in: nf_conntrack_netlink xt_nat nft_chain_nat xt_MASQUERADE nf_nat xfrm_user overlay tcp_diag inet_diag xt_recent cmac nls_utf8 cifs cifs_arc4 nls_ucs2_utils rdma_cm iw_cm ib_cm ib_core cifs_md4 netfs nft_limit nft_compat veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw sctp ip6_udp_tunnel udp_tunnel nf_tables 8021q garp mrp bonding tls ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport xt_comment xt_limit xt_addrtype xt_tcpudp xt_conntrack softdog nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc ip6table_filter ip6_tables nfnetlink_log iptable_filter nfnetlink binfmt_misc snd_hda_codec_hdmi xe drm_gpuvm drm_exec snd_hda_codec_realtek gpu_sched drm_suballoc_helper snd_hda_codec_generic drm_ttm_helper snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp intel_uncore_frequency snd_sof intel_uncore_frequency_common intel_tcc_cooling
Jul 16 13:31:30 kernel:  snd_sof_utils snd_soc_hdac_hda x86_pkg_temp_thermal snd_hda_ext_core intel_powerclamp snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation kvm_intel soundwire_bus iwlmvm mac80211 snd_soc_core libarc4 i915 kvm snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg btusb irqbypass snd_intel_sdw_acpi crct10dif_pclmul btrtl polyval_clmulni snd_hda_codec polyval_generic btintel ghash_clmulni_intel processor_thermal_device_pci processor_thermal_device btbcm sha256_ssse3 snd_hda_core processor_thermal_wt_hint sha1_ssse3 btmtk processor_thermal_rfim drm_buddy snd_hwdep ttm aesni_intel processor_thermal_rapl mei_pxp mei_hdcp intel_rapl_msr snd_pcm drm_display_helper crypto_simd intel_rapl_common bluetooth iwlwifi cmdlinepart snd_timer processor_thermal_wt_req intel_pmc_core cryptd cec joydev processor_thermal_power_floor snd spi_nor mei_me think_lmi processor_thermal_mbox input_leds int3403_thermal int3400_thermal rapl rc_core ecdh_generic pmt_telemetry cfg80211 intel_cstate pcspkr
Jul 16 13:31:30 kernel:  firmware_attributes_class mtd soundcore wmi_bmof ecc i2c_algo_bit mei intel_vsec pmt_class acpi_thermal_rel int340x_thermal_zone acpi_tad acpi_pad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap coretemp efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c hid_generic usbmouse usbkbd usbhid hid ixgbe xhci_pci xhci_pci_renesas nvme crc32_pclmul nvme_core e1000e intel_lpss_pci xfrm_algo xhci_hcd i2c_i801 ahci spi_intel_pci dca i2c_smbus intel_lpss spi_intel nvme_auth libahci mdio idma64 video wmi
Jul 16 13:31:30 kernel: CPU: 19 PID: 1371895 Comm: iou-wrk-3367 Tainted: P           O       6.8.8-1-pve #1
Jul 16 13:31:30 kernel: Hardware name: LENOVO 30FA000PFR/330E, BIOS M4GKT30A 04/10/2024
Jul 16 13:31:30 kernel: Call Trace:
Jul 16 13:31:30 kernel:  <TASK>
Jul 16 13:31:30 kernel:  dump_stack_lvl+0x76/0xa0
Jul 16 13:31:30 kernel:  dump_stack+0x10/0x20
Jul 16 13:31:30 kernel:  bad_page+0x76/0x120
Jul 16 13:31:30 kernel:  free_page_is_bad_report+0x86/0xa0
Jul 16 13:31:30 kernel:  free_unref_page_prepare+0x279/0x3d0
Jul 16 13:31:30 kernel:  free_unref_page+0x34/0x140
Jul 16 13:31:30 kernel:  ? __mem_cgroup_uncharge+0x96/0xc0
Jul 16 13:31:30 kernel:  __folio_put+0x3c/0x90
Jul 16 13:31:30 kernel:  btrfs_release_extent_buffer_pages+0x54/0x70 [btrfs]
Jul 16 13:31:30 kernel:  release_extent_buffer+0x49/0xf0 [btrfs]
Jul 16 13:31:30 kernel:  free_extent_buffer_stale.part.0+0x2b/0x60 [btrfs]
Jul 16 13:31:30 kernel:  free_extent_buffer_stale+0x13/0x30 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_force_cow_block+0x32e/0x7c0 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_cow_block+0xcc/0x290 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_search_slot+0x567/0xcb0 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_lookup_csum+0x6f/0x170 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_csum_file_blocks+0x1d0/0x7f0 [btrfs]
Jul 16 13:31:30 kernel:  log_csums.isra.0+0xe5/0x110 [btrfs]
Jul 16 13:31:30 kernel:  log_one_extent+0x5e5/0x640 [btrfs]
Jul 16 13:31:30 kernel:  ? btrfs_search_slot+0x8d4/0xcb0 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_log_inode+0x1393/0x1bd0 [btrfs]
Jul 16 13:31:30 kernel:  ? btrfs_get_alloc_profile+0x3f/0x70 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_log_inode_parent+0x308/0xf10 [btrfs]
Jul 16 13:31:30 kernel:  ? wait_current_trans+0x53/0x160 [btrfs]
Jul 16 13:31:30 kernel:  ? start_transaction+0xd4/0x850 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_log_dentry_safe+0x40/0x70 [btrfs]
Jul 16 13:31:30 kernel:  btrfs_sync_file+0x350/0x5a0 [btrfs]
Jul 16 13:31:30 kernel:  vfs_fsync_range+0x48/0xa0
Jul 16 13:31:30 kernel:  ? __schedule+0x409/0x15e0
Jul 16 13:31:30 kernel:  io_fsync+0x3d/0x60
Jul 16 13:31:30 kernel:  io_issue_sqe+0x61/0x400
Jul 16 13:31:30 kernel:  ? lock_timer_base+0x72/0xa0
Jul 16 13:31:30 kernel:  io_wq_submit_work+0xe2/0x360
Jul 16 13:31:30 kernel:  ? __timer_delete_sync+0x8c/0x100
Jul 16 13:31:30 kernel:  io_worker_handle_work+0x153/0x590
Jul 16 13:31:30 kernel:  io_wq_worker+0x112/0x3c0
Jul 16 13:31:30 kernel:  ? raw_spin_rq_unlock+0x10/0x40
Jul 16 13:31:30 kernel:  ? finish_task_switch.isra.0+0x8c/0x310
Jul 16 13:31:30 kernel:  ? __pfx_io_wq_worker+0x10/0x10
Jul 16 13:31:30 kernel:  ret_from_fork+0x44/0x70
Jul 16 13:31:30 kernel:  ? __pfx_io_wq_worker+0x10/0x10
Jul 16 13:31:30 kernel:  ret_from_fork_asm+0x1b/0x30
Jul 16 13:31:30 kernel: RIP: 0033:0x0
Jul 16 13:31:30 kernel: Code: Unable to access opcode bytes at 0xffffffffffffffd6.
Jul 16 13:31:30 kernel: RSP: 002b:0000000000000000 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
Jul 16 13:31:30 kernel: RAX: 0000000000000000 RBX: 00005e0e13216d70 RCX: 00007fbeca8e3b95
Jul 16 13:31:30 kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000017
Jul 16 13:31:30 kernel: RBP: 00005e0e13216d78 R08: 0000000000000000 R09: 0000000000000008
Jul 16 13:31:30 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00005e0e13216e60
Jul 16 13:31:30 kernel: R13: 0000000000000001 R14: 00005e0e130c7ae8 R15: 0000000000000000
Jul 16 13:31:30 kernel:  </TASK>

Thanks

leesteken · Jul 16, 2024

It's probably bad memory (and not specific to Proxmox, so other Linux troubleshooting tips also apply). Maybe run memtest?

esi_y · Jul 16, 2024

The sad truth is that it's unlikely someone will want to assist troubleshooting this - instead you will be suggested you try everything at random (e.g. memtest on ECC RAM system). I am not saying it is not the RAM, but it could be anything, could be a kernel bug. Also, anything BTRFS would be considered "unsupported" ... alas, PVE does not ship any debug kernel to help with these.

I would try with another kernel. Also might be interesting to report details about hardware and when this started happening. Attaching output of journalctl -b -1 > attachment.log might be helpful too.

KelianSB · Jul 21, 2024

leesteken said:
It's probably bad memory (and not specific to Proxmox, so other Linux troubleshooting tips also apply). Maybe run memtest?

I ran a memtest with Memtest86+ and no error were reported :

leesteken · Jul 21, 2024

Maybe try another filesystem than Btrfs? Or maybe try a different kernel version like 6.5 (which comes with older drivers)? It's all guessing unless someone recognizes the trace log or can debug your system.

esi_y · Jul 21, 2024

leesteken said:
Maybe try another filesystem than Btrfs?

Before going down the path of just changing the entire fs stack, I would at least have it checked for integrity:
https://btrfs.readthedocs.io/en/latest/btrfs-check.html

leesteken said:
Or maybe try a different kernel version like 6.5 (which comes with older drivers)?

This is the cheapest to try out.

esi_y said:
I would try with another kernel. Also might be interesting to report details about hardware and when this started happening. Attaching output of journalctl -b -1 > attachment.log might be helpful too.

Nothing else interesting in the logs preceding these events?

Search

Search

BUG: Bad page state in process

KelianSB

New Member

leesteken

Distinguished Member

esi_y

Renowned Member