Wanted: best practice to find root cause of system crash during ZFS scrub

stefre

New Member
Sep 10, 2024
6
2
3
Am looking for a documentation that talks about a best practice to investigate in system crashes during ZFS scrubs.

Context: once about 5% into the scrub the whole system becomes unresponsive and needs a power off/on to recover. Once system is up again, scrub continues and succeeds. ZFS pool continues to be valid, no errors found. During scrub core temps raise from about 40C to about 60C, but that is within reasonable range.
Configured to run scrub once a quarter now but would like to find root cause.

Any idea/link on a best practice?
 
becomes unresponsive and needs a power off/on to recover
Basically, this should not happen. Please tell us a little more about your hardware:

  • CPU
  • RAM
  • ZFS Raidlevel, how many drives?
  • SSD or/and HDD's
  • Which manufacturer/type are the drives?
  • Has it ever worked normally before?

And the output of:
Code:
zpool status
zpool list
pveversion -v
 
CPU: i9-12900H
RAM: 96GB (2x48GB) Crucial CT2K48G56C46S5

ZFS: mirrored RAID10 mirrored ZIL SLOG
SSDs:
a) System on Samsung 990 Pro 2TB
b) SLOG on 2x WD Red SN700 2TB
Disks: 4x10TB HGST Ultrastar He10-512e in QNAP TL-R400S

Yes, it did work normally. I noticed "just" 2 crashes; both a few percentages into scrub. Therefore, I concluded there is a chance this is not just coincidence. The system had a CPU-load of about 5% when scrub started. Memory-load was at about 70%; most of it likely ZFS cache.

The pool looks completely healthy to me. But see below for yourself.

I do not really expect get a solution for my problem. Just some hint in what direction to investigate. Any logs I could look at.

-------

$zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pile 18.2T 6.23T 12.0T - - 0% 34% 1.00x ONLINE -


-------

$ zpool status
zpool status
pool: pile
state: ONLINE
scan: scrub repaired 0B in 1 days 00:54:18 with 0 errors on Mon Sep 9 01:18:19 2024
config:

NAME STATE READ WRITE CKSUM
pile ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000cca266e0a76c ONLINE 0 0 0
wwn-0x5000cca266e18655 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
wwn-0x5000cca251f47b27 ONLINE 0 0 0
wwn-0x5000cca266c86d66 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
nvme-eui.e8238fa6bf530001001b448b47ad75c0 ONLINE 0 0 0
nvme-eui.e8238fa6bf530001001b448b47ad759a ONLINE 0 0 0

errors: No known data errors

--------

$ pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.12-1-pve)
pve-manager: 8.2.4 (running version: 8.2.4/faa83925c9641325)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.12-1
proxmox-kernel-6.8.12-1-pve-signed: 6.8.12-1
proxmox-kernel-6.8.8-2-pve-signed: 6.8.8-2
proxmox-kernel-6.8.8-1-pve-signed: 6.8.8-1
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.2
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.9
libpve-storage-perl: 8.2.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.1.12
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.1
pve-firewall: 5.0.7
pve-firmware: 3.13-1
pve-ha-manager: 4.0.5
pve-i18n: 3.2.2
pve-qemu-kvm: 9.0.2-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.4-pve1
 
Thanks for the info. ZFS basically looks good. Since the system is running on the 990 Pro, there shouldn't be any slowdowns/crashes during the scrub (even if the 2nd pool has a problem).

Have you checked the smart values of your drives manually?

Code:
smartctl -a /dev/disk/by-id/<your devices>

Please also check the journal, maybe you will find something there. e.g.

Code:
journalctl -f # Livelog during scrub
journalctl -r -p3 # Reverse log search with priority 3 (only errors)
dmesg -l warn
dmesg -l err

man journalctl:

-p, --priority=
Filter output by message priorities or priority ranges. Takes either a single numeric or textual log level (i.e. between 0/"emerg" and 7/"debug"), or a
range of numeric/text log levels in the form FROM..TO. The log levels are the usual syslog log levels as documented in syslog(3), i.e. "emerg" (0),
"alert" (1), "crit" (2), "err" (3), "warning" (4), "notice" (5), "info" (6), "debug" (7). If a single log level is specified, all messages with this
log level or a lower (hence more important) log level are shown. If a range is specified, all messages within the range are shown, including both the
start and the end value of the range. This will add "PRIORITY=" matches for the specified priorities.

into the scrub the whole system becomes unresponsive and needs a power off/on to recover.

What does “top” look like at that time? E.g. I/O wait or other values. Are there perhaps zombie processes?
 
Ok, Thanks. I was hoping for some "ZFS specific" log files, but will check the standard logs. Should have done that anyways, sorry ... :-/

Regarding "top": Once the system becomes unresponsive, I can't ssh into it anymore. Means: no idea if there are any I/O waits or zombie processes. But ... thinking about it ... MeshCommander might still be able to connect ... and I could write an ugly little debug script to report/publish these things frequently. Like to an external MQTT broker. hmmm ...
 
  • Like
Reactions: mariol
dmesg -l err:
no output

dmesg -l warn:
loads of these:
Code:
[   38.081618] i40e 0000:03:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
[   38.096762] i40e 0000:03:00.0: Error I40E_AQ_RC_ENOSPC adding RX filters on PF, promiscuous mode forced on
apparently I need to work on /etc/network/interfaces ... but likely not related to my initial problem

a few of those:
Code:
[  499.651664] EXT4-fs warning (device loop0): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.

journalctl -r -p3:
loads of these:
Code:
Sep 11 09:59:51 pve smartd[1951]: Device: /dev/sdc [SAT], 6 Offline uncorrectable sectors
So, I guess that is a hint that one of the 4 disk has bad sectors which have been taken offline.

smartctl -a /dev/disk/by-id/... :
basically shows it is old disks but in an "ok" state:

Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       266
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   095   095   000    Old_age   Always       -       39857
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       145
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
 45 Unknown_Attribute       0x0023   100   100   001    Pre-fail  Always       -       1095233372415
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       876
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       876
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 25/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       3954151115908
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       5898489953220

SMART Error Log Version: 1
No Errors Logged
 
Last edited:
apparently I need to work on /etc/network/interfaces ... but likely not related to my initial problem
Right.

[ 499.651664] EXT4-fs warning (device loop0): ext4_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.
Looks to me as if this is caused by the system freezing. Apparently from a container running there? But probably not relevant to our topic here.

That with smartclt output helps us more... Offline uncorrectable sectors are areas on a hard drive that have been identified as unusable during self-tests, but have not yet been remapped to spare sectors. These sectors are marked as “offline” because they cannot be read or written, and the drive has not attempted to recover data from them.
Unreadable sectors are a major sign that the drive is on it's way out. Drives can die without showing bad sectors beforehand, but if a drive starts showing this kind of error it's almost guaranteed that it's not long for this world.

Are you sure that this is the right drive with the smart values, because all these values:
Current_Pending_Sector, Reallocated_Sector_Ct and Offline_Uncorrectable
do not display anything.

Please make a long smart test of all drives in the Zpool.

Code:
smartctl -t long /dev/disk/by-id/<your device>

The tests will take a while. You can check the progress status with this command:
Code:
smartctl -a /dev/disk/by-id/<your device> | grep remaining
 
Thanks a lot.
So yes, below is the complete output of the short smartctl and it shows the 6 offline sectors but also 2 errors that occurred a very long time ago. I think, these 2 errors are actually not related to the issue, though.

The long smart test is still running for a few hours and is at 50%... but i wanted to share the short output for the drive in question:

Code:
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)

Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HUH721010ALE601
Serial Number:    7JGLJPYC
LU WWN Device Id: 5 000cca 266c86d66
Firmware Version: LHGL0003
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 11 17:29:14 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   93) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   134   134   054    Pre-fail  Offline      -       96
  3 Spin_Up_Time            0x0007   148   148   024    Pre-fail  Always       -       444 (Average 447)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       179
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       45272
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       100
 45 Unknown_Attribute       0x0023   100   100   001    Pre-fail  Always       -       1095233372415
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       886
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       886
194 Temperature_Celsius     0x0002   146   146   000    Old_age   Always       -       41 (Min/Max 21/56)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0032   100   100   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       4247860164822
242 Total_LBAs_Read         0x0012   100   100   000    Old_age   Always       -       6037479380466

SMART Error Log Version: 1
ATA Error Count: 2
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 1260 hours (52 days + 12 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c8 28 1d 7c 40  Error: UNC 200 sectors at LBA = 0x007c1d28 = 8133928

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 f0 1a 7c e0 08  14d+14:11:35.829  READ DMA EXT
  25 00 00 f0 1e 7c e0 08  14d+14:11:35.819  READ DMA EXT
  27 00 00 00 00 00 e0 08  14d+14:11:35.812  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 08  14d+14:11:35.812  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  14d+14:11:35.762  SET FEATURES [Set transfer mode]

Error 1 occurred at disk power-on lifetime: 1260 hours (52 days + 12 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c8 28 1d 7c 40  Error: UNC 200 sectors at LBA = 0x007c1d28 = 8133928

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 00 f0 1a 7c e0 08  14d+14:11:35.746  READ DMA EXT
  25 00 08 30 9c c5 e0 08  14d+14:11:35.730  READ DMA EXT
  25 00 00 30 98 c5 e0 08  14d+14:11:35.719  READ DMA EXT
  25 00 00 30 94 c5 e0 08  14d+14:11:35.708  READ DMA EXT
  35 00 08 c8 5d bd e0 08  14d+14:11:35.705  WRITE DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short captive       Completed without error       00%     42461         -
# 2  Short captive       Completed without error       00%     42455         -
# 3  Vendor (0x70)       Completed without error       00%     42406         -
# 4  Vendor (0x71)       Completed without error       00%     42406         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay
 
Even if the long test does not find any faults, it does not mean that the drive is OK. Very often no errors are found during the manual tests and the drive is still damaged.
The output shows errors. I would therefore replace the drive as soon as possible.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!