Hey Guys!
I'm quite lost about my current situation with a node in my proxmox cluster. maybe someone of you can help me out here.
situation:
2x Proxmox Servers (ASRock X470D4U, AMD Ryzen 5 3600 6-Core Processor), 32GB Ram
1x some other server, not to mention primary use for quorum
Let's call it pve01, which had some weird issues and segfaults. host was unresponsible via gui/ssh, only ipmi worked. but also there no response. only way to get this thing back to live was a restart.
tried to find the issue and saw some segfaults at perl. i read a lot i the forums and saw check your disks and do some memtest86.
in the same "maintenance window" i added a proxmox backup server on another physical host. now when i try to backup a vm/container from this host, nothing works, the system goes fully unresponsible. the last thing i saw was this:
i actually don't know what i should go for next. i did another memtest86 and smartctl on the disks. do i might have a problem with my psu?
hope someone may have an idea
all other nodes are good and able to backup to the new backup server. when i move the vms on this one the vms will backup instantly in about 1min.
thx for every response or help!
I'm quite lost about my current situation with a node in my proxmox cluster. maybe someone of you can help me out here.
situation:
2x Proxmox Servers (ASRock X470D4U, AMD Ryzen 5 3600 6-Core Processor), 32GB Ram
1x some other server, not to mention primary use for quorum
Let's call it pve01, which had some weird issues and segfaults. host was unresponsible via gui/ssh, only ipmi worked. but also there no response. only way to get this thing back to live was a restart.
tried to find the issue and saw some segfaults at perl. i read a lot i the forums and saw check your disks and do some memtest86.
- smartctl long run of all disks -> no errors found
- memtest86 for 15 hours, passed all for 10 full runs
in the same "maintenance window" i added a proxmox backup server on another physical host. now when i try to backup a vm/container from this host, nothing works, the system goes fully unresponsible. the last thing i saw was this:
Bash:
Jun 25 17:18:22 pve01 kernel: [26705.376902] BUG: kernel NULL pointer dereference, address: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.377931] #PF: supervisor read access in kernel mode
Jun 25 17:18:22 pve01 kernel: [26705.378891] #PF: error_code(0x0000) - not-present page
Jun 25 17:18:22 pve01 kernel: [26705.379841] PGD 0 P4D 0
Jun 25 17:18:22 pve01 kernel: [26705.380775] Oops: 0000 [#1] SMP NOPTI
Jun 25 17:18:22 pve01 kernel: [26705.381708] CPU: 9 PID: 862016 Comm: tokio-runtime-w Tainted: P W O 5.15.108-1-pve #1
Jun 25 17:18:22 pve01 kernel: [26705.382643] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.20 08/12/2019
Jun 25 17:18:22 pve01 kernel: [26705.383578] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.384565] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.386484] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.387449] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.388407] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.389351] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.390277] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.391188] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.392077] FS: 00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.392959] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.393824] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0
Jun 25 17:18:22 pve01 kernel: [26705.394678] Call Trace:
Jun 25 17:18:22 pve01 kernel: [26705.395518] <TASK>
Jun 25 17:18:22 pve01 kernel: [26705.396330] zio_decompress_data_buf+0x8e/0x100 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.397186] ? abd_borrow_buf_copy+0x86/0x90 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398005] zio_decompress_data+0x60/0xf0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.398826] arc_buf_fill+0x171/0xd10 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.399610] ? aggsum_add+0x1a9/0x1d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.400373] arc_buf_alloc_impl.isra.0+0x2fc/0x4b0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.401123] ? mutex_lock+0x13/0x50
Jun 25 17:18:22 pve01 kernel: [26705.401823] arc_read+0x106b/0x1560 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.402540] ? dbuf_rele_and_unlock+0x7e0/0x7e0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.403242] ? kmem_cache_alloc+0x1ab/0x2f0
Jun 25 17:18:22 pve01 kernel: [26705.403902] ? spl_kmem_cache_alloc+0x79/0x790 [spl]
Jun 25 17:18:22 pve01 kernel: [26705.404570] dbuf_read_impl.constprop.0+0x44f/0x760 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405273] dbuf_read+0xda/0x5c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.405972] dmu_buf_hold_array_by_dnode+0x13a/0x610 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.406676] dmu_read_uio_dnode+0x4b/0x140 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.407376] ? zfs_rangelock_enter_impl+0x271/0x690 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408097] dmu_read_uio_dbuf+0x47/0x70 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.408808] zfs_read+0x13a/0x3d0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.409506] zpl_iter_read+0xdf/0x180 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.410196] new_sync_read+0x10d/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.410809] vfs_read+0x104/0x1a0
Jun 25 17:18:22 pve01 kernel: [26705.411403] ksys_read+0x67/0xf0
Jun 25 17:18:22 pve01 kernel: [26705.411976] __x64_sys_read+0x1a/0x20
Jun 25 17:18:22 pve01 kernel: [26705.412531] do_syscall_64+0x5c/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.413068] ? _copy_to_user+0x20/0x30
Jun 25 17:18:22 pve01 kernel: [26705.413588] ? zpl_ioctl+0x1b2/0x1c0 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.414139] ? exit_to_user_mode_prepare+0x37/0x1b0
Jun 25 17:18:22 pve01 kernel: [26705.414637] ? syscall_exit_to_user_mode+0x27/0x50
Jun 25 17:18:22 pve01 kernel: [26705.415127] ? do_syscall_64+0x69/0xc0
Jun 25 17:18:22 pve01 kernel: [26705.415613] entry_SYSCALL_64_after_hwframe+0x61/0xcb
Jun 25 17:18:22 pve01 kernel: [26705.416104] RIP: 0033:0x7f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.416593] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 c9 55 f9 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff 55 f9 ff 48
Jun 25 17:18:22 pve01 kernel: [26705.417609] RSP: 002b:00007f663a493650 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.418116] RAX: ffffffffffffffda RBX: 00007f663a493750 RCX: 00007f663b72740c
Jun 25 17:18:22 pve01 kernel: [26705.418609] RDX: 0000000000400000 RSI: 00007f6639121010 RDI: 0000000000000012
Jun 25 17:18:22 pve01 kernel: [26705.419090] RBP: 00007f662c02cd28 R08: 0000000000000000 R09: 4bc7d5cc8df873e3
Jun 25 17:18:22 pve01 kernel: [26705.419564] R10: c582369e6ec05127 R11: 0000000000000246 R12: 0000560a506b2220
Jun 25 17:18:22 pve01 kernel: [26705.420035] R13: 00007f662c02ced4 R14: 00007f662c02ccf0 R15: 0000560a50aa6120
Jun 25 17:18:22 pve01 kernel: [26705.420506] </TASK>
Jun 25 17:18:22 pve01 kernel: [26705.420970] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp ip6_udp_tunnel udp_tunnel iptable_filter bpfilter softdog nfnetlink_log nf
netlink ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm ast drm_vram_helper snd_hda_codec drm_ttm_helper ttm irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_core snd_hwdep drm_kms_helper snd_pcm aesni_intel cec snd_timer rc_core crypto_si
md fb_sys_fops snd syscopyarea cryptd sysfillrect k10temp soundcore rapl sysimgblt wmi_bmof pcspkr efi_pstore ccp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs
4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO)
Jun 25 17:18:22 pve01 kernel: [26705.421020] zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb crc32_pclmul i2c_piix4 igb xhci_pci xhci_pci_renesas i2c_algo_bit dca ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jun 25 17:18:22 pve01 kernel: [26705.426611] CR2: 000000000000004c
Jun 25 17:18:22 pve01 kernel: [26705.427234] ---[ end trace 162cab7295ccde32 ]---
Jun 25 17:18:22 pve01 kernel: [26705.590194] RIP: 0010:lz4_decompress_zfs+0x10e/0x330 [zfs]
Jun 25 17:18:22 pve01 kernel: [26705.590938] Code: 0f 0f 84 de 00 00 00 66 83 f8 07 0f 86 01 01 00 00 48 8b 02 48 83 c3 08 48 83 c2 08 48 89 43 f8 4a 8d 7c 03 fc 48 39 f9 72 31 <48> 39 fb 72 16 4d 39 ee 0f 87 47 ff ff ff 4c 29 e7 89 f8 c1 e8 1f
Jun 25 17:18:22 pve01 kernel: [26705.592320] RSP: 0018:ffffbd342821f6f8 EFLAGS: 00010202
Jun 25 17:18:22 pve01 kernel: [26705.593024] RAX: 8f100e4600000000 RBX: ffffbd341e491685 RCX: ffffbd341e492ff8
Jun 25 17:18:22 pve01 kernel: [26705.593744] RDX: ffffbd341e491639 RSI: 000000000000004c RDI: ffffbd341e491681
Jun 25 17:18:22 pve01 kernel: [26705.594456] RBP: ffffbd342821f748 R08: 0000000000000000 R09: 0000000000000010
Jun 25 17:18:22 pve01 kernel: [26705.595161] R10: ffffffffc0683460 R11: 0000000000013000 R12: ffffbd341e473000
Jun 25 17:18:22 pve01 kernel: [26705.595874] R13: ffffbd340d807426 R14: ffffbd340d8082cc R15: ffffbd341e493000
Jun 25 17:18:22 pve01 kernel: [26705.596596] FS: 00007f663a497700(0000) GS:ffff9b636ec40000(0000) knlGS:0000000000000000
Jun 25 17:18:22 pve01 kernel: [26705.597313] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 17:18:22 pve01 kernel: [26705.598017] CR2: 000000000000004c CR3: 0000000247df2000 CR4: 0000000000350ee0
i actually don't know what i should go for next. i did another memtest86 and smartctl on the disks. do i might have a problem with my psu?
hope someone may have an idea
all other nodes are good and able to backup to the new backup server. when i move the vms on this one the vms will backup instantly in about 1min.
thx for every response or help!
Code:
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-15 (running version: 7.4-15/a5d2a31e)
pve-kernel-5.15: 7.4-4
pve-kernel-5.4: 6.4-20
pve-kernel-5.15.108-1-pve: 5.15.108-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.4.203-1-pve: 5.4.203-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: 0.8.36+pve2
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1