[SOLVED] PVE 6.1: ZFS Segfault upon system boot

Hi, upon booting the system (only during boot so far), I get a segfault in ZFS:

Code:
[   17.905941] ZFS: Loaded module v0.8.3-pve1, ZFS pool version 5000, ZFS filesystem version 5
[...][   20.804544] zfs[4387]: segfault at 0 ip 00007f565ddde694 sp 00007f5656ff5420 error 4
[   20.804546] zfs[4379]: segfault at 0 ip 00007f565deba01c sp 00007f565d509478 error 4
[   20.804547]  in libc-2.28.so[7f565dd84000+148000]
[   20.805298]  in libc-2.28.so[7f565dd84000+148000]
[   20.805846] Code: 29 f2 41 ff 55 70 48 85 c0 7e 3b 48 8b 93 90 00 00 00 48 01 43 10 48 83 fa ff 74 0a 48 01 d0 48 89 83 90 00 00 00 48 8b 43 08 <0f> b6 00 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 2e 0f 1f
[   20.806365] Code: 29 c8 c5 f8 77 c3 0f 1f 84 00 00 00 00 00 48 85 d2 0f 84 5a 02 00 00 89 f9 c5 f9 6e c6 c4 e2 7d 78 c0 83 e1 3f 83 f9 20 77 44 <c5> fd 74 0f c5 fd d7 c1 85 c0 0f 85 c4 01 00 00 48 83 ea 20 0f 86

I haven't yet noticed any data corruption, but who knows.
 
Hi,

please send the output of.

Code:
pveversion -v
 
Sure, here you go:

Code:
# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.18-2-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-helper: 6.1-7
pve-kernel-5.3: 6.1-5
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-22
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-6
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
I do not see this here on my Servers.
Do you use the ZFS as rootfs?
What HW do you use with ZFS(SSD/HDD, CPU, ...) ?
 
Yes, I haven't found out, yet, either, how to reproduce this bug with certainty. It's just weird that this is happening from time to time, and gives me a bit of a bad feeling.

The machine here is an EPYC 7502P on a TYAN platform with 256 GB RAM. It uses NVME and SATA SSDs. Boot is from two mirrored (ZFS, I had meant to use md, but this is not officially supported) Kingston DC1000B SSDs. There are also two mirrored P4800X, 6 Micron 9300 Max in striped mirrors, and 4 SATA SSDs (Samsung, Kingston) in a striped mirror in the system.
 
I can't reproduce it here but my setup is quite different.
Please try the pve-kernel-5.4 maybe it solve it.
If not I can hopefully test is next week with a similar HW.
 
@wolfgang: I'm sorry that I have to report that the segfault upon boot is back.

Code:
#dmesg | grep zfs
[   23.442792] traps: zfs[10790] general protection fault ip:7fbe057054a6 sp:7fbdf7ff7310 error:0 in libc-2.28.so[7fbe056a3000+148000]
[   23.462145] systemd[1]: zfs-mount.service: Main process exited, code=killed, status=11/SEGV
[   23.485228] systemd[1]: zfs-mount.service: Failed with result 'signal'.

pveversion -v:
Code:
proxmox-ve: 6.2-1 (running kernel: 5.4.41-1-pve)
pve-manager: 6.2-4 (running version: 6.2-4/9824574a)
pve-kernel-5.4: 6.2-2
pve-kernel-helper: 6.2-2
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-1
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.1-2
libpve-guest-common-perl: 3.0-10
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.2-1
lxcfs: 4.0.3-pve2
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-1
pve-cluster: 6.1-8
pve-container: 3.1-6
pve-docs: 6.2-4
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.1-2
pve-firmware: 3.1-1
pve-ha-manager: 3.0-9
pve-i18n: 2.1-2
pve-qemu-kvm: 5.0.0-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.2-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.4-pve1
 
This is not the same class of errors.

The error comes now from the zfs-mount.service.
Is there more information in journald?