pbs freeze, where to start to debug? (zfs scrub)

Fra · Feb 16, 2023

Hardware have been checked and is fine, any suggestion on how to debug?

The server is fine for 4/5 hours, then freeze: it respond to ping and telnet on 22, but cannot ssh.
Nothing in ordinary logs.

Backup Server 2.3-2, upgrading it to latest now.

some links to external debian debug will help too

4 HD, 16TB each, all in ZFS

thanks

Fra · Feb 16, 2023

several

Code:

Feb 16 11:56:57 pbs-fi kernel: [ 2704.671645] watchdog: BUG: soft lockup - CPU#4 stuck for 756s! [spl_kmem_cache:355]

Code:

Feb 16 11:43:52 pbs-fi kernel: [ 1919.473268] ------------[ cut here ]------------
Feb 16 11:43:52 pbs-fi kernel: [ 1919.474121] NETDEV WATCHDOG: enp7s0 (igb): transmit queue 2 timed out
Feb 16 11:43:52 pbs-fi kernel: [ 1919.474804] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x277/0x280
Feb 16 11:43:52 pbs-fi kernel: [ 1919.475485] Modules linked in: binfmt_misc wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bonding tls intel_rapl_msr intel_rapl_common ast drm_vram_helper amd64_edac drm_ttm_helper edac_mce_amd ttm kvm_amd drm_kms_helper kvm cec rc_core fb_sys_fops syscopyarea sysfillrect irqbypass crct10dif_pclmul sysimgblt ghash_clmulni_intel ccp k10temp aesni_intel crypto_simd cryptd eeepc_wmi serio_raw mac_hid rapl asus_wmi platform_profile efi_pstore sparse_keymap video pcspkr wmi_bmof ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO)
Feb 16 11:43:52 pbs-fi kernel: [ 1919.475528]  znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb crc32_pclmul psmouse igb xhci_pci i2c_piix4 xhci_pci_renesas i2c_algo_bit dca ahci libahci xhci_hcd wmi gpio_amdpt gpio_generic
Feb 16 11:43:52 pbs-fi kernel: [ 1919.482607] CPU: 1 PID: 0 Comm: swapper/1 Tainted: P           O L    5.15.85-1-pve #1
Feb 16 11:43:52 pbs-fi kernel: [ 1919.483296] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 2424 09/29/2021
Feb 16 11:43:52 pbs-fi kernel: [ 1919.483969] RIP: 0010:dev_watchdog+0x277/0x280
Feb 16 11:43:52 pbs-fi kernel: [ 1919.484642] Code: eb 97 48 8b 5d d0 c6 05 26 82 4d 01 01 48 89 df e8 5e 53 f9 ff 44 89 e1 48 89 de 48 c7 c7 f0 a9 0a 84 48 89 c2 e8 cd 99 1c 00 <0f> 0b eb 80 e9 45 ca 25 00 0f 1f 44 00 00 55 49 89 ca 48 89 e5 41
Feb 16 11:43:52 pbs-fi kernel: [ 1919.486016] RSP: 0018:ffffb53900284e70 EFLAGS: 00010282
Feb 16 11:43:52 pbs-fi kernel: [ 1919.486718] RAX: 0000000000000000 RBX: ffff9ed3c1afc000 RCX: ffff9ee2aea60588
Feb 16 11:43:52 pbs-fi kernel: [ 1919.487386] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9ee2aea60580
Feb 16 11:43:52 pbs-fi kernel: [ 1919.488048] RBP: ffffb53900284ea8 R08: 0000000000000003 R09: 0000000000000001
Feb 16 11:43:52 pbs-fi kernel: [ 1919.488701] R10: ffff9ed3c0f95230 R11: ffffffffc0ebb0c0 R12: 0000000000000002
Feb 16 11:43:52 pbs-fi kernel: [ 1919.489365] R13: ffff9ed3d05f7940 R14: 0000000000000008 R15: ffff9ed3c1afc4c0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.490054] FS:  0000000000000000(0000) GS:ffff9ee2aea40000(0000) knlGS:0000000000000000
Feb 16 11:43:52 pbs-fi kernel: [ 1919.490722] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 16 11:43:52 pbs-fi kernel: [ 1919.491390] CR2: 00007f606bf65390 CR3: 000000011f708000 CR4: 0000000000350ee0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.492063] Call Trace:
Feb 16 11:43:52 pbs-fi kernel: [ 1919.492725]  <IRQ>
Feb 16 11:43:52 pbs-fi kernel: [ 1919.493384]  ? pfifo_fast_enqueue+0x160/0x160
Feb 16 11:43:52 pbs-fi kernel: [ 1919.494071]  call_timer_fn+0x2b/0x120
Feb 16 11:43:52 pbs-fi kernel: [ 1919.494729]  __run_timers.part.0+0x1e1/0x270
Feb 16 11:43:52 pbs-fi kernel: [ 1919.495383]  ? ktime_get+0x46/0xc0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.496034]  ? native_x2apic_icr_read+0x20/0x20
Feb 16 11:43:52 pbs-fi kernel: [ 1919.496690]  ? lapic_next_event+0x21/0x30
Feb 16 11:43:52 pbs-fi kernel: [ 1919.497348]  ? clockevents_program_event+0xab/0x130
Feb 16 11:43:52 pbs-fi kernel: [ 1919.498038]  run_timer_softirq+0x2a/0x60
Feb 16 11:43:52 pbs-fi kernel: [ 1919.498691]  __do_softirq+0xd9/0x2ea
Feb 16 11:43:52 pbs-fi kernel: [ 1919.499354]  irq_exit_rcu+0x94/0xc0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.500008]  sysvec_apic_timer_interrupt+0x80/0x90
Feb 16 11:43:52 pbs-fi kernel: [ 1919.500670]  </IRQ>
Feb 16 11:43:52 pbs-fi kernel: [ 1919.501329]  <TASK>
Feb 16 11:43:52 pbs-fi kernel: [ 1919.502008]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
Feb 16 11:43:52 pbs-fi kernel: [ 1919.502667] RIP: 0010:cpuidle_enter_state+0xd9/0x620
Feb 16 11:43:52 pbs-fi kernel: [ 1919.503330] Code: 3d 44 3f bf 7c e8 77 ff 6d ff 49 89 c7 0f 1f 44 00 00 31 ff e8 b8 0c 6e ff 80 7d d0 00 0f 85 5e 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6a 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e5 03 00 00
Feb 16 11:43:52 pbs-fi kernel: [ 1919.504707] RSP: 0018:ffffb53900157e38 EFLAGS: 00000246
Feb 16 11:43:52 pbs-fi kernel: [ 1919.505420] RAX: ffff9ee2aea70bc0 RBX: ffff9ed3c4751400 RCX: 0000000000000000
Feb 16 11:43:52 pbs-fi kernel: [ 1919.506159] RDX: 00000000000000e6 RSI: 00000000239f52d0 RDI: 0000000000000000
Feb 16 11:43:52 pbs-fi kernel: [ 1919.506867] RBP: ffffb53900157e88 R08: 000001bee985a76d R09: 0000000000008ca0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.507568] R10: 0000000000000003 R11: 071c71c71c71c71c R12: ffffffff848e6ec0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.508259] R13: 0000000000000001 R14: 0000000000000001 R15: 000001bee985a76d
Feb 16 11:43:52 pbs-fi kernel: [ 1919.508930]  ? cpuidle_enter_state+0xc8/0x620
Feb 16 11:43:52 pbs-fi kernel: [ 1919.509589]  cpuidle_enter+0x2e/0x50
Feb 16 11:43:52 pbs-fi kernel: [ 1919.510261]  do_idle+0x20d/0x2b0
Feb 16 11:43:52 pbs-fi kernel: [ 1919.510880]  cpu_startup_entry+0x20/0x30
Feb 16 11:43:52 pbs-fi kernel: [ 1919.511480]  start_secondary+0x12a/0x180
Feb 16 11:43:52 pbs-fi kernel: [ 1919.512062]  secondary_startup_64_no_verify+0xc2/0xcb
Feb 16 11:43:52 pbs-fi kernel: [ 1919.512631]  </TASK>
Feb 16 11:43:52 pbs-fi kernel: [ 1919.513175] ---[ end trace 249b1209234ec1c7 ]---

Fra · Feb 16, 2023

I see a lot of `wa` in top

backups are pretty slow too:

I see that there is a "scrub in progress" since 4 days ago (!)

Code:

root:~# zpool status -v rpool
  pool: rpool
 state: ONLINE
  scan: scrub in progress since Sun Feb 12 00:24:02 2023
    2.60T scanned at 3.48G/s, 6.67M issued at 8.92K/s, 12.2T total
    0B repaired, 0.00% done, no estimated completion time
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      raidz2-0                                       ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0S0FVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1E0A009FVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0FPFVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0UWFVNG-part3  ONLINE       0     0     0

errors: No known data errors

Fra · Feb 16, 2023

dmesg is not showing anything, and smart output is fine

do we just need to wait the end of the scrub?

Fra · Feb 16, 2023

omg

Code:

root:~# zpool status
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Feb 12 00:24:02 2023
    5.42T scanned at 2.52G/s, 236G issued at 110M/s, 12.2T total
    192K repaired, 1.89% done, 1 days 07:40:17 to go
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      raidz2-0                                       ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0S0FVNG-part3  ONLINE       0     0     2  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1E0A009FVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0FPFVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0UWFVNG-part3  ONLINE       0     0     1  (repairing)

errors: No known data errors

Fra · Feb 16, 2023

some tips in the suggested https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P

btw, 30mins after my last post I now see

Code:

root@pbs-fi:/var/log# zpool status -x
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Feb 12 00:24:02 2023
    6.06T scanned at 1.49G/s, 841G issued at 206M/s, 12.2T total
    704K repaired, 6.75% done, 16:01:40 to go
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      raidz2-0                                       ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0S0FVNG-part3  ONLINE       0     0     4  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1E0A009FVNG-part3  ONLINE       0     0     2  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0FPFVNG-part3  ONLINE       0     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0UWFVNG-part3  ONLINE       0     0     5  (repairing)

errors: No known data errors

Fra · Feb 16, 2023

btw

Code:

# lscpu  | grep -i model
Model:                           113
Model name:                      AMD Ryzen 5 3600 6-Core Processor

Fra · Feb 17, 2023

I see other hit something similar

* https://forum.proxmox.com/threads/problems-with-zfs-after-major-proxmox-update.85171/

* https://forum.proxmox.com/threads/zfs-scrub-crashes-server-zfs_vdev_scrub_max_active-1.76277/

I've booted the server with another OS (Herzner resque) and hit the *same* issue (the server freeze when the scrub arrives at about 40%), so it is not proxmox related; smartctl is fine for all 4 disks

swap is disabled (and swappiness is 0)

Dunuin · Feb 17, 2023

With checksum errors spread accross multiple disks I would run memtest86+ to see if your RAM is corrupting data, which also could explain the system instability.

Fra · Feb 17, 2023

thanks

btw, I've got this and then it crashed again

Code:

root@rescue ~ # zpool status
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Feb 12 00:24:02 2023
    8.46T scanned at 1.64G/s, 3.31T issued at 656M/s, 12.2T total
    3.13M repaired, 27.25% done, 03:55:43 to go
config:

    NAME                                             STATE     READ WRITE CKSUM
    rpool                                            ONLINE       0     0     0
      raidz2-0                                       ONLINE       1     0     0
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0S0FVNG-part3  ONLINE       0     0    16  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1E0A009FVNG-part3  ONLINE       0     0    13  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0FPFVNG-part3  ONLINE       0     0    12  (repairing)
        ata-TOSHIBA_MG08ACA16TEY_X1D0A0UWFVNG-part3  ONLINE       0     0    10  (repairing)

errors: No known data errors

Fra · Feb 17, 2023

so, we did some cpu load test and some memory test (in os resque mode in Hetzner), and check ECC errors: the server was on for several hours with no issues: the we installed zfs and (while installing it) it went into freeze again (but ping respond, while SSH goes in timeout): so for sure this is not proxmox related... not even sure if it is zfs related

Fra · Feb 17, 2023

so, pretty weird so far: it come out that the server does not freeze, is just the SSH login that does not work anymore;

indeed I was able to login to the local console (using KVM);
then I started a SSH server on a *custom* port and I could login with SSH into the server (!)

so, SSH on port 22 does not work (even if restarted)

other ssh server started by me works, and let me SSH to the server e.g.

/usr/sbin/sshd -p 999

so, somehow the server do not completely freeze, it just block SSH
now at least I can wait for the zpool scrub to finish

Fra · Feb 17, 2023

maybe this has something to do with it

Code:

Feb 17 07:39:01 rescue systemd-udevd[1527]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Feb 17 07:39:01 rescue kernel: znvpair: module license 'CDDL' taints kernel.
Feb 17 07:39:01 rescue kernel: Disabling lock debugging due to kernel taint
Feb 17 07:39:02 rescue kernel: ZFS: Loaded module v2.1.9-1, ZFS pool version 5000, ZFS filesystem version 5

update: https://itsfoss.com/linus-torvalds-zfs/ (so it's 'just' a license issue)

Fra · Feb 17, 2023

Code:

root@rescue /var/log # uname -r
6.1.4

root@rescue /var/log # cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

root@rescue /var/log #  lsmod | grep -i znvpair
znvpair                61440  2 zfs,zcommon
spl                    81920  6 zfs,icp,zzstd,znvpair,zcommon,zavl

root@rescue /var/log # modinfo znvpair
filename:       /lib/modules/6.1.4/extra/nvpair/znvpair.ko
version:        2.1.9-1
license:        CDDL
author:         OpenZFS
description:    Generic name/value pair implementation
srcversion:     0A3BE0F1F05389ACC190C95
depends:        spl
retpoline:      Y
name:           znvpair
vermagic:       6.1.4 SMP mod_unload modversions

Fra · Feb 17, 2023

so, the scrub was almost done, and then the server did really freeze (kernel panic)

so I rebooted, stopped the scrub, and now will migrate everything to a new server

it will be the same hardware: so we hope it's not a bios/Ryzen/zfs issue, or we'll hit this again

(sorry for the noise)

Fra · Feb 17, 2023

luckly the new hardware has different disks: Seagate Exos X18 vs toshiba MG08

btw, this is the zpool history on the "faulty" server: the setup has been done using an ordinary ISO pbs setup, then we've created the storage: pbs-local

Code:

# zpool history
History for 'rpool':
2022-12-15.10:23:27 zpool create -f -o cachefile=none -o ashift=12 rpool raidz2 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TEY_X1D0A0S0FVNG-part3 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TEY_X1E0A009FVNG-part3 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TEY_X1D0A0FPFVNG-part3 /dev/disk/by-id/ata-TOSHIBA_MG08ACA16TEY_X1D0A0UWFVNG-part3
2022-12-15.10:23:27 zfs create rpool/ROOT
2022-12-15.10:23:27 zfs create rpool/ROOT/pbs-1
2022-12-15.10:23:27 zfs set atime=on relatime=on rpool
2022-12-15.10:23:27 zfs set compression=on rpool
2022-12-15.10:23:27 zfs set sync=disabled rpool
2022-12-15.10:24:34 zfs set sync=standard rpool
2022-12-15.10:24:34 zfs set mountpoint=/ rpool/ROOT/pbs-1
2022-12-15.10:24:34 zpool set bootfs=rpool/ROOT/pbs-1 rpool
2022-12-15.10:24:34 zpool export rpool
2022-12-15.10:25:44 zpool import -N rpool
2022-12-15.11:14:47 zpool import -N rpool
2022-12-15.11:28:58 zfs create rpool/pbs-local
2022-12-15.11:32:14 zfs set mountpoint=/rpool/pbs-local rpool/pbs-local
2022-12-15.11:35:52 zfs set quota=500G rpool/ROOT
2022-12-15.11:37:04 zfs set quota=27T rpool/pbs-local
2022-12-15.11:46:05 zpool import -N rpool
2022-12-16.11:13:13 zpool import -N rpool
2022-12-16.11:40:20 zpool import -N rpool
2022-12-21.12:24:26 zpool import -N rpool
2022-12-21.14:25:17 zpool import -N rpool
2022-12-22.10:35:54 zpool import -N rpool
2023-01-08.00:24:02 zpool scrub rpool
2023-01-08.08:27:00 zpool import -N rpool
2023-01-08.08:31:08 zpool import -N rpool
2023-01-08.08:47:17 zpool import -N rpool
2023-02-12.00:24:06 zpool scrub rpool
2023-02-12.14:32:44 zpool import -N rpool
2023-02-16.00:04:22 zpool import -N rpool
2023-02-16.08:50:23 zpool import -N rpool
2023-02-16.09:16:47 zpool import -N rpool
2023-02-16.09:29:58 zpool import -N rpool
2023-02-16.11:12:27 zpool import -N rpool
2023-02-16.12:14:14 zpool import -N rpool
2023-02-16.14:14:45 zpool import -N rpool
2023-02-16.16:02:44 zpool import -N rpool
2023-02-16.20:29:12 zpool import -N rpool
2023-02-16.22:07:09 zpool import -N rpool
2023-02-16.22:58:33 zpool import -f rpool
2023-02-17.00:08:41 zpool import -f rpool
2023-02-17.09:16:19 zpool import -f rpool
2023-02-17.14:25:56 zpool import -f rpool
2023-02-17.14:34:52 zpool scrub -s rpool
2023-02-17.14:36:20 zpool import -N rpool

narrateourale · Feb 18, 2023

Hmm... if the memory seemed okay, and quite a few of the disks showed problems, it could have been a few things:
- disk themselves
- controller
- cables to the disk
- too old BIOS / firmware
...

I hope it won't happen on the new machine

Fra · Feb 18, 2023

uh, yes, the bios, it's ASUS System Product Name/Pro WS 565-ACE, BIOS 2424 09/29/2021 while there are two new versions

https://www.asus.com/us/supportonly/pro ws 565-ace/helpdesk_bios/

Fra · Feb 19, 2023

The BIOS has been updated

Code:

[    0.000000] DMI: ASUS System Product Name/Pro WS 565-ACE, BIOS 9901 10/13/2022

but the issue is still there, so now we really move to new hardware: we'll keep exactly the same setup: let's hope we do not hit the same issue again.

Fra · Feb 23, 2023

So, we moved our data into a new server (same hardware model at Hetzner, but I see with different disk (Seagate Exos X18 vs. Toshiba): the scrub was fine, without any correction.

So, apparently the reason was some faulty hardware we were not able to spot.

We'll give back the hardware to Hetzner in a couple of days: if some of you suggest some test we may do in the old server (even destructive ones) to dig into this issue, we are available.

pbs freeze, where to start to debug? (zfs scrub)

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Distinguished Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member