Panic when importing zpool (invalid TYPE 66)

Etique57 · Dec 30, 2020

Hello,

I run Proxmox v6.3 and I created a raidz-1 (3 disks with 1 parity) pool from these 3 SSDs:

(I had to create it manually because of the difference in size).

It initially worked well and I was able to move my VM storage to it. But after a few days running it kernel panicked and hung the system, which forced me to reboot.

Upon reboot, during the import of the zpool, I now consistently get the following panic:

Code:

Dec 29 22:29:23 pve1 kernel: PANIC: blkptr at 000000004c4feb84 has invalid TYPE 66
Dec 29 22:29:23 pve1 kernel: Showing stack for process 9353
Dec 29 22:29:23 pve1 kernel: CPU: 6 PID: 9353 Comm: zpool Tainted: P           O      5.4.78-2-pve #1
Dec 29 22:29:23 pve1 kernel: Hardware name: System manufacturer System Product Name/Z170-A, BIOS 3802 03/15/2018
Dec 29 22:29:23 pve1 kernel: Call Trace:
Dec 29 22:29:23 pve1 kernel:  dump_stack+0x6d/0x9a
Dec 29 22:29:23 pve1 kernel:  spl_dumpstack+0x29/0x2b [spl]
Dec 29 22:29:23 pve1 kernel:  vcmn_err.cold.1+0x60/0x94 [spl]
Dec 29 22:29:23 pve1 kernel:  ? zio_execute+0x99/0xf0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? _cond_resched+0x19/0x30
Dec 29 22:29:23 pve1 kernel:  ? __kmalloc+0x197/0x280
Dec 29 22:29:23 pve1 kernel:  ? sg_kmalloc+0x19/0x30
Dec 29 22:29:23 pve1 kernel:  zfs_panic_recover+0x6f/0x90 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? spa_sync_allpools+0x130/0x130 [zfs]
Dec 29 22:29:23 pve1 kernel:  zfs_blkptr_verify+0x265/0x400 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? abd_alloc+0x280/0x480 [zfs]
Dec 29 22:29:23 pve1 kernel:  zio_read+0x42/0xc0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? spa_sync_allpools+0x130/0x130 [zfs]
Dec 29 22:29:23 pve1 kernel:  spa_load_verify_cb+0x186/0x1d0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x1f3/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? arc_read+0x475/0x1020 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_dnode+0xb6/0x1d0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x824/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x359/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? arc_read+0x475/0x1020 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_dnode+0xb6/0x1d0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_visitbp+0x6ab/0x9e0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_impl+0x1e3/0x480 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_dataset_resume+0x46/0x50 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? spa_sync+0xfa0/0xfa0 [zfs]
Dec 29 22:29:23 pve1 kernel:  traverse_pool+0x181/0x1b0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? spa_sync+0xfa0/0xfa0 [zfs]
Dec 29 22:29:23 pve1 kernel:  spa_load+0x1159/0x13b0 [zfs]
Dec 29 22:29:23 pve1 kernel:  spa_load_best+0x57/0x2d0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? zpool_get_load_policy+0x1aa/0x1c0 [zcommon]
Dec 29 22:29:23 pve1 kernel:  spa_import+0x1ea/0x7f0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? nvpair_value_common.part.13+0x14d/0x170 [znvpair]
Dec 29 22:29:23 pve1 kernel:  zfs_ioc_pool_import+0x12d/0x150 [zfs]
Dec 29 22:29:23 pve1 kernel:  zfsdev_ioctl+0x6db/0x8f0 [zfs]
Dec 29 22:29:23 pve1 kernel:  ? lru_cache_add_active_or_unevictable+0x39/0xb0
Dec 29 22:29:23 pve1 kernel:  do_vfs_ioctl+0xa9/0x640
Dec 29 22:29:23 pve1 kernel:  ? handle_mm_fault+0xc9/0x1f0
Dec 29 22:29:23 pve1 kernel:  ksys_ioctl+0x67/0x90
Dec 29 22:29:23 pve1 kernel:  __x64_sys_ioctl+0x1a/0x20
Dec 29 22:29:23 pve1 kernel:  do_syscall_64+0x57/0x190
Dec 29 22:29:23 pve1 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 29 22:29:23 pve1 kernel: RIP: 0033:0x7f058c627427
Dec 29 22:29:23 pve1 kernel: Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48
Dec 29 22:29:23 pve1 kernel: RSP: 002b:00007ffcb5f23498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 29 22:29:23 pve1 kernel: RAX: ffffffffffffffda RBX: 00007ffcb5f23510 RCX: 00007f058c627427
Dec 29 22:29:23 pve1 kernel: RDX: 00007ffcb5f23510 RSI: 0000000000005a02 RDI: 0000000000000003
Dec 29 22:29:23 pve1 kernel: RBP: 00007ffcb5f27400 R08: 0000556970d4ac40 R09: 0000000000000079
Dec 29 22:29:23 pve1 kernel: R10: 0000556970d1a010 R11: 0000000000000246 R12: 0000556970d1b430
Dec 29 22:29:23 pve1 kernel: R13: 0000556970d306c0 R14: 0000000000000000 R15: 0000000000000000

pvesm is then hanging and I need to reboot again.

I scoured the forum and disabled the zfs-import service, which allows me to avoid the kernel panic and get back control of my proxmox host. I restored my VMs backup, so all is good.

The zpool seems healthy:

Code:

   pool: vmdisks
     id: 7025696728242074529
  state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        vmdisks                                           ONLINE
          raidz1-0                                        ONLINE
            ata-PNY_CS900_240GB_SSD_PNY43191910250107DA1  ONLINE
            ata-SanDisk_SSD_PLUS_240GB_1835AF801426       ONLINE
            ata-Crucial_CT256M550SSD1_14150DF074A0        ONLINE

I'm about to recreate the zpool, but I would like to know if there is something obvious I'm doing bad, and if I will fail again.

Here is the pveversion output:

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Thanks a lot for any hint!

Some info about my system. It's all consumer grade.

apoc · Dec 30, 2020

Etique57 said:
Some info about my system. It's all consumer grade.

... Which might actually be the problem.
ECC memory is highly recommended for ZFS for a reason.
Maybe you have a misbehaving memory module you are not aware of.

Also your SSDs and/or SATA controller might misbehave.

Etique57 · Dec 30, 2020

OK thanks @tburger , that's what I more or less feared...

H4R0 · Dec 30, 2020

ecc memory has nothing to do with it.

it seems more like a broken disk, zfs is software based and sometimes can't prevent kernel panics if the drive controller goes nuts.

please post smart values of every drive e.g. "smartctl -a /dev/sda", this can tell if a disk is bad, otherwise it's cables, raid controller etc.

Etique57 · Dec 30, 2020

Thanks @H4R0 I'll check the cables.

Here's the smartctl -a output for each of the disks. Seems there's no issue.
https://pastebin.com/wip5Uwan

H4R0 · Dec 30, 2020

Etique57 said:
Thanks @H4R0 I'll check the cables.

Here's the smartctl -a output for each of the disks. Seems there's no issue.
https://pastebin.com/wip5Uwan

How are the drives connected ? hba / raid controller or directly to the mb ?

They have unexpected power loss which is quite high.

Crucial also has some read errors.

Poweroff the server and replug the drives on both ends.

Try to import the pool, if it doesn't work unplug the crucial ssd and try again.

Etique57 · Dec 31, 2020

Thanks @H4R0 . I tried reconnecting everything but got the same KP. I'll try disconnecting the Crucial. In the mean time I've reordered another SSD just in case.

I'll update the post after more tests.

Thanks and happy new year!

talos · Dec 31, 2020

Etique57 said:
Hello,

I run Proxmox v6.3 and I created a raidz-1 (3 disks with 1 parity) pool from these 3 SSDs:
View attachment 22403

46% Wearout on your sdp drive? that looks really bad, afaik consumer SSD have only a reserve for around 15% until they die.

In the past i run a small Proxmox Server with four Samsung 850 and they wearout in about 4 month with just a few containers running. I changed SSD to Intel D3-S4510 and Wearout still at 0% after 5 Month

Etique57 · Jan 1, 2021

Happy new year!

Ok, I removed the Crucial but still got the Kernel Panic. I'll delete the pool and try to recreate one and investigate more.

Thanks!

Etique57 · Jan 6, 2021

Just a quick update.

I ordered a replacement for the suspect Crucial but was still panicking eventually when using zraid.
I tried to round robin a raidz1 across 3 of my 4 disks but always led to the same panicking result.

Standalone zpools works fine with good performances. ZRAID10 worked fine but was very, very slow.

I eventually decided to fall back to non-zfs, and played around with a LVM Raid5 thinpool, it was fun but too complicated for my taste.

Eventually I decided to spin-up a mdadm raid5 with LVM on top and that's what I settled on.

So bad luck, I'll try with other ZFS pool that I planned to have but maybe it's just not for my setup.

Thank you very much.

Search

Search

Panic when importing zpool (invalid TYPE 66)

Etique57

Active Member

apoc

Famous Member

Etique57

Active Member

H4R0

Renowned Member

Etique57

Active Member

H4R0

Renowned Member

Etique57

Active Member

talos

Renowned Member

Etique57

Active Member

Etique57

Active Member

We value your privacy