Host freeze on PBS Container/VM Backup kernel error

atwa

New Member
Jun 25, 2023
3
0
1
Hey Guys!
I'm quite lost about my current situation with a node in my proxmox cluster. maybe someone of you can help me out here.

situation:
2x Proxmox Servers (ASRock X470D4U, AMD Ryzen 5 3600 6-Core Processor), 32GB Ram
1x some other server, not to mention :) primary use for quorum

Let's call it pve01, which had some weird issues and segfaults. host was unresponsible via gui/ssh, only ipmi worked. but also there no response. only way to get this thing back to live was a restart.
tried to find the issue and saw some segfaults at perl. i read a lot i the forums and saw check your disks and do some memtest86.

  • smartctl long run of all disks -> no errors found
  • memtest86 for 15 hours, passed all for 10 full runs
a day later, a disk went unavailable. so i replaced it with a spare one. all operations went smoothly, server was back running and no issues left.
in the same "maintenance window" i added a proxmox backup server on another physical host. now when i try to backup a vm/container from this host, nothing works, the system goes fully unresponsible. the last thing i saw was this:

Bash:
Jun 25 17:18:22 pve01 kernel: [26705.376902] BUG: kernel NULL pointer dereference, address: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.377931] #PF: supervisor read access in kernel mode
Jun 25 17:18:22 pve01 kernel: [26705.378891] #PF: error_code(0x0000) - not-present page
Jun 25 17:18:22 pve01 kernel: [26705.379841] PGD 0 P4D 0
Jun 25 17:18:22 pve01 kernel: [26705.380775] Oops: 0000 [#1] SMP NOPTI
Jun 25 17:18:22 pve01 kernel: [26705.381708] CPU: 9 PID: 862016 Comm: tokio-runtime-w Tainted: P        W  O      5.15.108-1-pve #1
Jun 25 17:18:22 pve01 kernel: [26705.382643] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.20 08/12/2019
Jun 25 17:18:22 pve01 kernel: [26705.383578] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.384565] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.386484] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.387449] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.388407] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.389351] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.390277] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.391188] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.392077] FS:  00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.392959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.393824] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0
Jun 25 17:18:22 pve01 kernel: [26705.394678] Call Trace:
Jun 25 17:18:22 pve01 kernel: [26705.395518]  <TASK>
Jun 25 17:18:22 pve01 kernel: [26705.396330]  zio_decompress_data_buf+0x8e/0x100 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.397186]  ? abd_borrow_buf_copy+0x86/0x90 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398005]  zio_decompress_data+0x60/0xf0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398826]  arc_buf_fill+0x171/0xd10 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.399610]  ? aggsum_add+0x1a9/0x1d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.400373]  arc_buf_alloc_impl.isra.0+0x2fc/0x4b0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.401123]  ? mutex_lock+0x13/0x50
Jun 25 17:18:22 pve01 kernel: [26705.401823]  arc_read+0x106b/0x1560 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.402540]  ? dbuf_rele_and_unlock+0x7e0/0x7e0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.403242]  ? kmem_cache_alloc+0x1ab/0x2f0
Jun 25 17:18:22 pve01 kernel: [26705.403902]  ? spl_kmem_cache_alloc+0x79/0x790 [spl]
Jun 25 17:18:22 pve01 kernel: [26705.404570]  dbuf_read_impl.constprop.0+0x44f/0x760 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405273]  dbuf_read+0xda/0x5c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405972]  dmu_buf_hold_array_by_dnode+0x13a/0x610 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.406676]  dmu_read_uio_dnode+0x4b/0x140 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.407376]  ? zfs_rangelock_enter_impl+0x271/0x690 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408097]  dmu_read_uio_dbuf+0x47/0x70 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408808]  zfs_read+0x13a/0x3d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.409506]  zpl_iter_read+0xdf/0x180 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.410196]  new_sync_read+0x10d/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.410809]  vfs_read+0x104/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.411403]  ksys_read+0x67/0xf0
Jun 25 17:18:22 pve01 kernel: [26705.411976]  __x64_sys_read+0x1a/0x20
Jun 25 17:18:22 pve01 kernel: [26705.412531]  do_syscall_64+0x5c/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.413068]  ? _copy_to_user+0x20/0x30
Jun 25 17:18:22 pve01 kernel: [26705.413588]  ? zpl_ioctl+0x1b2/0x1c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.414139]  ? exit_to_user_mode_prepare+0x37/0x1b0
Jun 25 17:18:22 pve01 kernel: [26705.414637]  ? syscall_exit_to_user_mode+0x27/0x50
Jun 25 17:18:22 pve01 kernel: [26705.415127]  ? do_syscall_64+0x69/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.415613]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jun 25 17:18:22 pve01 kernel: [26705.416104] RIP: 0033:0x7f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.416593] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 55 f9 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff 55 f9 ff 48
Jun 25 17:18:22 pve01 kernel: [26705.417609] RSP: 002b:00007f663a493650 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.418116] RAX: ffffffffffffffda RBX: 00007f663a493750 RCX: 00007f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.418609] RDX: 0000000000400000 RSI: 00007f6639121010 RDI: 0000000000000012
Jun 25 17:18:22 pve01 kernel: [26705.419090] RBP: 00007f662c02cd28 R08: 0000000000000000 R09: 4bc7d5cc8df873e3
Jun 25 17:18:22 pve01 kernel: [26705.419564] R10: c582369e6ec05127 R11: 0000000000000246 R12: 0000560a506b2220
Jun 25 17:18:22 pve01 kernel: [26705.420035] R13: 00007f662c02ced4 R14: 00007f662c02ccf0 R15: 0000560a50aa6120
Jun 25 17:18:22 pve01 kernel: [26705.420506]  </TASK>
Jun 25 17:18:22 pve01 kernel: [26705.420970] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter softdog nfnetlink_log nf
netlink ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm ast drm_vram_helper snd_hda_codec drm_ttm_helper ttm irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_core snd_hwdep drm_kms_helper snd_pcm aesni_intel cec snd_timer rc_core crypto_si
md fb_sys_fops snd syscopyarea cryptd sysfillrect k10temp soundcore rapl sysimgblt wmi_bmof pcspkr efi_pstore ccp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs
4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO)
Jun 25 17:18:22 pve01 kernel: [26705.421020]  zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb crc32_pclmul i2c_piix4 igb xhci_pci xhci_pci_renesas i2c_algo_bit dca ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jun 25 17:18:22 pve01 kernel: [26705.426611] CR2: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.427234] ---[ end trace 162cab7295ccde32 ]---
Jun 25 17:18:22 pve01 kernel: [26705.590194] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.590938] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.592320] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.593024] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.593744] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.594456] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.595161] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.595874] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.596596] FS:  00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.597313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.598017] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0

i actually don't know what i should go for next. i did another memtest86 and smartctl on the disks. do i might have a problem with my psu?
hope someone may have an idea :)

all other nodes are good and able to backup to the new backup server. when i move the vms on this one the vms will backup instantly in about 1min.

thx for every response or help!

Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-4
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
since yesterday i see a lot of segfaults in perl:

Code:
Jun 25 01:01:27 pve01 kernel: [83922.943217] pvestatd[1531700]: segfault at 40 ip 00007fbbbab503b6 sp 00007fff3cb86888 error 4 in libc-2.31.so[7fbbbaaee000+159000]
Jun 25 04:30:06 pve01 kernel: [96441.713195] pvesm[2607005]: segfault at 55e5decf1911 ip 000055e658ce8800 sp 00007ffd6f512d40 error 6 in perl[55e658c87000+185000]
Jun 25 05:00:06 pve01 kernel: [98241.349161] pvesm[2654901]: segfault at 40 ip 00007f0a7136824b sp 00007ffdbdfd8b38 error 6 in libc-2.31.so[7f0a7122a000+159000]
Jun 25 07:29:58 pve01 kernel: [107233.511499] pvesm[2889795]: segfault at 0 ip 0000559833838800 sp 00007ffdfb03fa30 error 6 in perl[5598337d7000+185000]
Jun 25 09:00:03 pve01 kernel: [112639.198194] pvesr[3031908]: segfault at f7 ip 000055dea48217d0 sp 00007ffd232cea50 error 6 in perl[55dea47c4000+185000]
Jun 25 09:53:25 pve01 kernel: [    0.220534] pcieport 0000:00:01.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
Jun 25 09:53:25 pve01 kernel: [    0.783870] i8042: probe of i8042 failed with error -5
Jun 25 13:00:01 pve01 kernel: [11204.365354] pvesm[378641]: segfault at ffffffff840f67fa ip 00005568e4db613e sp 00007fff3da34c70 error 5 in perl[5568e4d7d000+185000]
Jun 25 13:05:57 pve01 kernel: [11561.146980] traps: pvesm[390849] trap invalid opcode ip:556f6082411d sp:7fff2168fc00 error:0 in perl[556f60820000+185000]
Jun 25 16:00:04 pve01 kernel: [22007.531911] pvesm[715071]: segfault at 559ff665fce0 ip 000055a0e2621012 sp 00007fff461601e0 error 4 in perl[55a0e2612000+185000]
Jun 25 17:51:05 pve01 kernel: [    0.221033] pcieport 0000:00:01.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
Jun 25 17:51:05 pve01 kernel: [    0.788689] i8042: probe of i8042 failed with error -5
Jun 25 17:52:59 pve01 kernel: [  123.400224] traps: pvesm[9416] trap invalid opcode ip:55dd632aa4bf sp:7fffa9c77950 error:0 in perl[55dd631ef000+185000]
Jun 25 18:18:34 pve01 kernel: [ 1658.830180] traps: pvesr[58496] general protection fault ip:55dbf8ddfb43 sp:7fff5406de90 error:0 in perl[55dbf8d8a000+185000]
Jun 25 22:23:12 pve01 kernel: [16336.690367] traps: pvestatd[3805] trap invalid opcode ip:7f897cde8304 sp:7ffd0006e2a8 error:0 in libc-2.31.so[7f897ccaa000+159000]
Jun 26 00:59:56 pve01 kernel: [25740.400907] pvesm[840731]: segfault at 44 ip 0000556154daf4fc sp 00007ffe52bf2720 error 6 in perl[556154da8000+185000]
Jun 26 08:59:59 pve01 kernel: [54543.373670] pvesm[1913807]: segfault at 2dd24670 ip 0000558972caec24 sp 00007ffd2dd244f0 error 4 in perl[558972c4e000+185000]
Jun 26 09:00:02 pve01 kernel: [54546.641689] pvesr[1915005]: segfault at 6a ip 000055f38a78b248 sp 00007ffe2a048880 error 4 in perl[55f38a77f000+185000]
Jun 26 09:30:04 pve01 kernel: [56349.048372] pvesm[1986474]: segfault at 0 ip 000056416c405419 sp 00007ffe1d92a730 error 6 in perl[56416c3a4000+185000]

anyone an idea?
 
Ok now things go really weird.

As above said, i have two identical proxmox servers.
now, 3 days later, the second server also has faulting disks and zfs pool degraded.

Code:
  pool: tank0
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:04:52 with 0 errors on Sun Jun 11 00:28:53 2023
config:

        NAME                                               STATE     READ WRITE CKSUM
        tank0                                              DEGRADED     0     0     0
          raidz1-0                                         DEGRADED   383    10     0
            wwn-0x5002538ee030fc2a                         DEGRADED   169     9     1  too many errors
            wwn-0x5002538e90400a34                         DEGRADED   311     1     1  too many errors
            ata-Samsung_SSD_860_PRO_512GB_S42YNX0N402577P  FAULTED     58     0     0  too many errors
            wwn-0x5002538ee030fc82                         ONLINE       0     0     1

i already run a memtest86, 5 cycles. no errors at all. can this really be, that two servers have the same issue within 3 days?