I/O errors in LXC containers on ZFS. Backups fail, requiring restore.

PhantexTech

Member
Jun 7, 2020
15
3
23
34
I've been chasing a ghost since upgrading to proxmox 8.3 (correlation is not causation, so I can't point to 8.3 just yet). Random LXC containers start to exhibit i/o errors. This is only "noticed" when a nightly backup runs as it then errors, but fails on those specific LXC's. The storage is local, on ZFS. ZFS scrub are clean, no checksum or i/o errors. The backup will tell me which file is giving the i/o error. If I go into the container and try to read the file, I get the same i/o error. Intially I thought it's a disk error, but ZFS is not complaining, smart status is fine with no media or read errors. I restored the container from backup and didn't dive deeper. Next day another LXC container started to have the same issue, and then another one. I am on my 3rd LXC container with this same issue. I finally started to dive a bit deeper, doing scrubs, etc. Could not find anything. Since LXC containers on ZFS are just volumes, I can go on the host and navigate around. The same file that gives the I/o error from inside the LXC container reads just fine from the host. The issue seems to be isolated to just the proxmox services and the from the processes in the lxc container itself. On my other containers that had this issue, random files start to have i/o error. I know this seems like a disk error, but the fact that the file can be read from the host and that there are no zfs errors, nor Smart error leads me to believe that there might be something else going on. QEMU VMs on the same storage medium are not impacted. Has anyone seen this?

I/O error
Screenshot 2024-11-28 201131.png
Screenshot 2024-11-28 192856.png


File is actually readable from the host perspective. No I/O errors if interacted with any of the data having errors from inside the LXC or that the backup complains about. If I try reading this same file from inside the LXC container, I get an i/o error.
Screenshot 2024-11-28 192940.png
 
I have noticed a similar issue, also on ZFS and in LXCs, for things like Nginx Proxy Manager, as well as in spite of it being frowned upon, a Docker LXC as spun up from the community scripts process on debian 12. I was using replication amongst a 3 devices, and it seems to have caused a similar issue as the replications also contained the issue, but ZFS showed no issues with this content. I was able to roll back to a backup stored on another device from a day prior from some tinkering and always take a manual backup before, thankfully. But this is the first instance I've seen of this behavior. I am presently on 8.2.7 so is not only isolated to 8.3 it seems.

Nothing is taxed for processor, ram or otherwise.
 
Alright. Had another system lock up and managed to capture the logs. This is related to the i/o errors. As if I try to migrate any of the data that shows i/o errors, the system i/o goes to 100%, locks up, and then reboots. This appears to be a ZFS related crash. Specifically with something in ARC. I can sort of replicate it if I do really intensive i/o. Today I migrated a running VM's storage locally between two local zfs pools. Locked up just a tad bit after it started the migration.

Here are some important snippets from the logs:
Code:
Nov 30 09:22:09 pve kernel: general protection fault, probably for non-canonical address 0xffe39de18901eb78: 0000 [#1] PREEMPT SMP NOPTI
Nov 30 09:22:09 pve kernel: RIP: 0010:arc_change_state+0x2ca/0x530 [zfs]

Nov 30 09:22:09 pve kernel: Call Trace:
Nov 30 09:22:09 pve kernel:  <TASK>
Nov 30 09:22:09 pve kernel:  ? show_regs+0x6d/0x80
Nov 30 09:22:09 pve kernel:  ? die_addr+0x37/0xa0
Nov 30 09:22:09 pve kernel:  ? exc_general_protection+0x1db/0x480
Nov 30 09:22:09 pve kernel:  ? asm_exc_general_protection+0x27/0x30
Nov 30 09:22:09 pve kernel:  ? arc_change_state+0x2ca/0x530 [zfs]
Nov 30 09:22:09 pve kernel:  arc_access+0x1cc/0x4c0 [zfs]
Nov 30 09:22:09 pve kernel:  arc_write_done+0x2be/0x550 [zfs]
Nov 30 09:22:09 pve kernel:  zio_done+0x289/0x10b0 [zfs]
Nov 30 09:22:09 pve kernel:  zio_execute+0x88/0x130 [zfs]
Nov 30 09:22:09 pve kernel:  taskq_thread+0x27f/0x4c0 [spl]
Nov 30 09:22:09 pve kernel:  ? __pfx_default_wake_function+0x10/0x10
Nov 30 09:22:09 pve kernel:  ? __pfx_zio_execute+0x10/0x10 [zfs]
Nov 30 09:22:09 pve kernel:  ? __pfx_taskq_thread+0x10/0x10 [spl]
Nov 30 09:22:09 pve kernel:  kthread+0xef/0x120
Nov 30 09:22:09 pve kernel:  ? __pfx_kthread+0x10/0x10
Nov 30 09:22:09 pve kernel:  ret_from_fork+0x44/0x70
Nov 30 09:22:09 pve kernel:  ? __pfx_kthread+0x10/0x10
Nov 30 09:22:09 pve kernel:  ret_from_fork_asm+0x1b/0x30
Nov 30 09:22:09 pve kernel:  </TASK>
Nov 30 09:22:09 pve kernel: Modules linked in: tcp_diag inet_diag xt_multiport xt_mark xt_comment cfg80211 xt_nat xt_tcpudp nft_chain_nat nft_compat rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace netfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter ipmi_devintf ipmi_msghandler scsi_transport_iscsi nf_tables nvme_fabrics overlay 8021q garp mrp veth softdog sunrpc binfmt_misc bonding tls nfnetlink_log nfnetlink snd_hda_codec_realtek snd_hda_codec_generic xe drm_gpuvm drm_exec gpu_sched drm_suballoc_helper drm_ttm_helper snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils intel_rapl_msr snd_soc_hdac_hda intel_rapl_common snd_hda_ext_core intel_uncore_frequency snd_soc_acpi_intel_match
Nov 30 09:22:09 pve kernel:  intel_uncore_frequency_common snd_soc_acpi x86_pkg_temp_thermal soundwire_generic_allocation intel_powerclamp soundwire_bus coretemp snd_soc_core kvm_intel snd_compress ac97_bus i915 snd_pcm_dmaengine kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul polyval_clmulni snd_hda_core polyval_generic ghash_clmulni_intel mei_hdcp snd_hwdep sha256_ssse3 mei_pxp snd_pcm sha1_ssse3 aesni_intel drm_buddy cmdlinepart snd_timer apex(OE) crypto_simd spi_nor ttm cryptd snd intel_cstate soundcore gasket(OE) mtd mei_me pcspkr drm_display_helper wmi_bmof ee1004 mei intel_pmc_core cec intel_vsec rc_core pmt_telemetry joydev pmt_class acpi_pad acpi_tad mac_hid vhost_net vhost vhost_iotlb tap vfio_pci vfio_pci_core irqbypass vfio_iommu_type1 vfio iommufd efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c mlx4_ib ib_uverbs ib_core mlx4_en hid_generic usbhid hid xhci_pci nvme xhci_pci_renesas crc32_pclmul igb mlx4_core e1000e nvme_core
Nov 30 09:22:09 pve kernel:  xhci_hcd i2c_i801 spi_intel_pci i2c_algo_bit ahci spi_intel i2c_smbus dca libahci nvme_auth video wmi
Nov 30 09:22:09 pve kernel: general protection fault, probably for non-canonical address 0xff159de2e3a13f58: 0000 [#2] PREEMPT SMP NOPTI
Nov 30 09:22:09 pve kernel: ---[ end trace 0000000000000000 ]---

Kernel version: 6.8.12-4-pve
Process: z_wr_int_h (ZFS worker thread).
Function: arc_change_state

I am going to do a mem test on the server and see if there are any errors detected. If not, this is software bug.
 
This is a hardware issue with bad memory. As usual :) This does not appear to be a software issue. If anyone comes across something similar, test your ram.
 
  • Like
Reactions: Onoitsu2
This is a hardware issue with bad memory. As usual :) This does not appear to be a software issue. If anyone comes across something similar, test your ram.
Joy, looks like I'm going to have to migrate stuff and start running some hardware tests, that should all have been good RAM in that setup.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!