Host freeze on PBS Container/VM Backup kernel error

atwa

New Member
Jun 25, 2023
3
0
1
Hey Guys!
I'm quite lost about my current situation with a node in my proxmox cluster. maybe someone of you can help me out here.

situation:
2x Proxmox Servers (ASRock X470D4U, AMD Ryzen 5 3600 6-Core Processor), 32GB Ram
1x some other server, not to mention :) primary use for quorum

Let's call it pve01, which had some weird issues and segfaults. host was unresponsible via gui/ssh, only ipmi worked. but also there no response. only way to get this thing back to live was a restart.
tried to find the issue and saw some segfaults at perl. i read a lot i the forums and saw check your disks and do some memtest86.

  • smartctl long run of all disks -> no errors found
  • memtest86 for 15 hours, passed all for 10 full runs
a day later, a disk went unavailable. so i replaced it with a spare one. all operations went smoothly, server was back running and no issues left.
in the same "maintenance window" i added a proxmox backup server on another physical host. now when i try to backup a vm/container from this host, nothing works, the system goes fully unresponsible. the last thing i saw was this:

Bash:
Jun 25 17:18:22 pve01 kernel: [26705.376902] BUG: kernel NULL pointer dereference, address: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.377931] #PF: supervisor read access in kernel mode
Jun 25 17:18:22 pve01 kernel: [26705.378891] #PF: error_code(0x0000) - not-present page
Jun 25 17:18:22 pve01 kernel: [26705.379841] PGD 0 P4D 0
Jun 25 17:18:22 pve01 kernel: [26705.380775] Oops: 0000 [#1] SMP NOPTI
Jun 25 17:18:22 pve01 kernel: [26705.381708] CPU: 9 PID: 862016 Comm: tokio-runtime-w Tainted: P        W  O      5.15.108-1-pve #1
Jun 25 17:18:22 pve01 kernel: [26705.382643] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.20 08/12/2019
Jun 25 17:18:22 pve01 kernel: [26705.383578] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.384565] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.386484] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.387449] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.388407] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.389351] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.390277] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.391188] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.392077] FS:  00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.392959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.393824] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0
Jun 25 17:18:22 pve01 kernel: [26705.394678] Call Trace:
Jun 25 17:18:22 pve01 kernel: [26705.395518]  <TASK>
Jun 25 17:18:22 pve01 kernel: [26705.396330]  zio_decompress_data_buf+0x8e/0x100 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.397186]  ? abd_borrow_buf_copy+0x86/0x90 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398005]  zio_decompress_data+0x60/0xf0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398826]  arc_buf_fill+0x171/0xd10 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.399610]  ? aggsum_add+0x1a9/0x1d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.400373]  arc_buf_alloc_impl.isra.0+0x2fc/0x4b0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.401123]  ? mutex_lock+0x13/0x50
Jun 25 17:18:22 pve01 kernel: [26705.401823]  arc_read+0x106b/0x1560 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.402540]  ? dbuf_rele_and_unlock+0x7e0/0x7e0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.403242]  ? kmem_cache_alloc+0x1ab/0x2f0
Jun 25 17:18:22 pve01 kernel: [26705.403902]  ? spl_kmem_cache_alloc+0x79/0x790 [spl]
Jun 25 17:18:22 pve01 kernel: [26705.404570]  dbuf_read_impl.constprop.0+0x44f/0x760 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405273]  dbuf_read+0xda/0x5c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405972]  dmu_buf_hold_array_by_dnode+0x13a/0x610 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.406676]  dmu_read_uio_dnode+0x4b/0x140 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.407376]  ? zfs_rangelock_enter_impl+0x271/0x690 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408097]  dmu_read_uio_dbuf+0x47/0x70 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408808]  zfs_read+0x13a/0x3d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.409506]  zpl_iter_read+0xdf/0x180 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.410196]  new_sync_read+0x10d/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.410809]  vfs_read+0x104/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.411403]  ksys_read+0x67/0xf0
Jun 25 17:18:22 pve01 kernel: [26705.411976]  __x64_sys_read+0x1a/0x20
Jun 25 17:18:22 pve01 kernel: [26705.412531]  do_syscall_64+0x5c/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.413068]  ? _copy_to_user+0x20/0x30
Jun 25 17:18:22 pve01 kernel: [26705.413588]  ? zpl_ioctl+0x1b2/0x1c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.414139]  ? exit_to_user_mode_prepare+0x37/0x1b0
Jun 25 17:18:22 pve01 kernel: [26705.414637]  ? syscall_exit_to_user_mode+0x27/0x50
Jun 25 17:18:22 pve01 kernel: [26705.415127]  ? do_syscall_64+0x69/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.415613]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jun 25 17:18:22 pve01 kernel: [26705.416104] RIP: 0033:0x7f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.416593] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 55 f9 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff 55 f9 ff 48
Jun 25 17:18:22 pve01 kernel: [26705.417609] RSP: 002b:00007f663a493650 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.418116] RAX: ffffffffffffffda RBX: 00007f663a493750 RCX: 00007f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.418609] RDX: 0000000000400000 RSI: 00007f6639121010 RDI: 0000000000000012
Jun 25 17:18:22 pve01 kernel: [26705.419090] RBP: 00007f662c02cd28 R08: 0000000000000000 R09: 4bc7d5cc8df873e3
Jun 25 17:18:22 pve01 kernel: [26705.419564] R10: c582369e6ec05127 R11: 0000000000000246 R12: 0000560a506b2220
Jun 25 17:18:22 pve01 kernel: [26705.420035] R13: 00007f662c02ced4 R14: 00007f662c02ccf0 R15: 0000560a50aa6120
Jun 25 17:18:22 pve01 kernel: [26705.420506]  </TASK>
Jun 25 17:18:22 pve01 kernel: [26705.420970] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter softdog nfnetlink_log nf
netlink ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm ast drm_vram_helper snd_hda_codec drm_ttm_helper ttm irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_core snd_hwdep drm_kms_helper snd_pcm aesni_intel cec snd_timer rc_core crypto_si
md fb_sys_fops snd syscopyarea cryptd sysfillrect k10temp soundcore rapl sysimgblt wmi_bmof pcspkr efi_pstore ccp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs
4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO)
Jun 25 17:18:22 pve01 kernel: [26705.421020]  zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb crc32_pclmul i2c_piix4 igb xhci_pci xhci_pci_renesas i2c_algo_bit dca ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jun 25 17:18:22 pve01 kernel: [26705.426611] CR2: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.427234] ---[ end trace 162cab7295ccde32 ]---
Jun 25 17:18:22 pve01 kernel: [26705.590194] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.590938] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.592320] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.593024] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.593744] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.594456] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.595161] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.595874] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.596596] FS:  00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.597313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.598017] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0

i actually don't know what i should go for next. i did another memtest86 and smartctl on the disks. do i might have a problem with my psu?
hope someone may have an idea :)

all other nodes are good and able to backup to the new backup server. when i move the vms on this one the vms will backup instantly in about 1min.

thx for every response or help!

Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-4
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
 
since yesterday i see a lot of segfaults in perl:

Code:
Jun 25 01:01:27 pve01 kernel: [83922.943217] pvestatd[1531700]: segfault at 40 ip 00007fbbbab503b6 sp 00007fff3cb86888 error 4 in libc-2.31.so[7fbbbaaee000+159000]
Jun 25 04:30:06 pve01 kernel: [96441.713195] pvesm[2607005]: segfault at 55e5decf1911 ip 000055e658ce8800 sp 00007ffd6f512d40 error 6 in perl[55e658c87000+185000]
Jun 25 05:00:06 pve01 kernel: [98241.349161] pvesm[2654901]: segfault at 40 ip 00007f0a7136824b sp 00007ffdbdfd8b38 error 6 in libc-2.31.so[7f0a7122a000+159000]
Jun 25 07:29:58 pve01 kernel: [107233.511499] pvesm[2889795]: segfault at 0 ip 0000559833838800 sp 00007ffdfb03fa30 error 6 in perl[5598337d7000+185000]
Jun 25 09:00:03 pve01 kernel: [112639.198194] pvesr[3031908]: segfault at f7 ip 000055dea48217d0 sp 00007ffd232cea50 error 6 in perl[55dea47c4000+185000]
Jun 25 09:53:25 pve01 kernel: [    0.220534] pcieport 0000:00:01.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
Jun 25 09:53:25 pve01 kernel: [    0.783870] i8042: probe of i8042 failed with error -5
Jun 25 13:00:01 pve01 kernel: [11204.365354] pvesm[378641]: segfault at ffffffff840f67fa ip 00005568e4db613e sp 00007fff3da34c70 error 5 in perl[5568e4d7d000+185000]
Jun 25 13:05:57 pve01 kernel: [11561.146980] traps: pvesm[390849] trap invalid opcode ip:556f6082411d sp:7fff2168fc00 error:0 in perl[556f60820000+185000]
Jun 25 16:00:04 pve01 kernel: [22007.531911] pvesm[715071]: segfault at 559ff665fce0 ip 000055a0e2621012 sp 00007fff461601e0 error 4 in perl[55a0e2612000+185000]
Jun 25 17:51:05 pve01 kernel: [    0.221033] pcieport 0000:00:01.3: DPC: error containment capabilities: Int Msg #0, RPExt+ PoisonedTLP+ SwTrigger+ RP PIO Log 6, DL_ActiveErr+
Jun 25 17:51:05 pve01 kernel: [    0.788689] i8042: probe of i8042 failed with error -5
Jun 25 17:52:59 pve01 kernel: [  123.400224] traps: pvesm[9416] trap invalid opcode ip:55dd632aa4bf sp:7fffa9c77950 error:0 in perl[55dd631ef000+185000]
Jun 25 18:18:34 pve01 kernel: [ 1658.830180] traps: pvesr[58496] general protection fault ip:55dbf8ddfb43 sp:7fff5406de90 error:0 in perl[55dbf8d8a000+185000]
Jun 25 22:23:12 pve01 kernel: [16336.690367] traps: pvestatd[3805] trap invalid opcode ip:7f897cde8304 sp:7ffd0006e2a8 error:0 in libc-2.31.so[7f897ccaa000+159000]
Jun 26 00:59:56 pve01 kernel: [25740.400907] pvesm[840731]: segfault at 44 ip 0000556154daf4fc sp 00007ffe52bf2720 error 6 in perl[556154da8000+185000]
Jun 26 08:59:59 pve01 kernel: [54543.373670] pvesm[1913807]: segfault at 2dd24670 ip 0000558972caec24 sp 00007ffd2dd244f0 error 4 in perl[558972c4e000+185000]
Jun 26 09:00:02 pve01 kernel: [54546.641689] pvesr[1915005]: segfault at 6a ip 000055f38a78b248 sp 00007ffe2a048880 error 4 in perl[55f38a77f000+185000]
Jun 26 09:30:04 pve01 kernel: [56349.048372] pvesm[1986474]: segfault at 0 ip 000056416c405419 sp 00007ffe1d92a730 error 6 in perl[56416c3a4000+185000]

anyone an idea?
 
Ok now things go really weird.

As above said, i have two identical proxmox servers.
now, 3 days later, the second server also has faulting disks and zfs pool degraded.

Code:
  pool: tank0
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:04:52 with 0 errors on Sun Jun 11 00:28:53 2023
config:

        NAME                                               STATE     READ WRITE CKSUM
        tank0                                              DEGRADED     0     0     0
          raidz1-0                                         DEGRADED   383    10     0
            wwn-0x5002538ee030fc2a                         DEGRADED   169     9     1  too many errors
            wwn-0x5002538e90400a34                         DEGRADED   311     1     1  too many errors
            ata-Samsung_SSD_860_PRO_512GB_S42YNX0N402577P  FAULTED     58     0     0  too many errors
            wwn-0x5002538ee030fc82                         ONLINE       0     0     1

i already run a memtest86, 5 cycles. no errors at all. can this really be, that two servers have the same issue within 3 days?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!