Complete system lockup - Seems to be ZFS, but no persistent logs?

ahorner

Member
Dec 21, 2023
33
1
8
Hi there,

I am running Proxmox on a dedicated server with OVH, and under high disk load, the whole system seems to be taken out. Sometimes just for a few seconds, sometimes a few minutes, and more often than not completely requiring a hard reset of the system.

When this happens, I can't see any specific logs which indicate what happened, just a spike in disk IO on the graphs on the dashboard before losing graph history. When the system is locked up I can't even log in to view logs, as typing the username causes it to hang trying to read from disk before timing out.

How can I go about diagnosing this? Thanks
 
Last edited:
Sounds like a disk issue, what is the underlying hardware, and are you using ZFS boot? Is the disk fabric dedicated or shared?
Dual HDDs connected via SATA, used as the boot disk, completely dedicated hardware, configured in a ZFS mirror

Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz (1 Socket)

Linux 6.8.12-8-pve (2025-01-24T12:32Z)

pve-manager/8.3.3/f157a38b211595d6

16GB RAM
 
That’s some ancient hardware, I’m assuming you’re not running anything when this happens? You don’t have enough memory. I would check SMART status. Most likely these drives are as old as the CPU.
 
That’s some ancient hardware, I’m assuming you’re not running anything when this happens? You don’t have enough memory. I would check SMART status. Most likely these drives are as old as the CPU.
There's a lot running when it happens, it takes everything down with it which is frustrating.

Hardware is one of the current available options from OVH, its their KS-4 offering. Trying to find a dedi at a manageable price is quite difficult

I've included a cut down version of the SMART output for both disks. I provided this to OVH as a support ticket under concern about some of the values despite the disks passing SMART, but they refuse my output unless I run it from their dedicated recovery image instead of within Proxmox, which I cannot provide to them right now until I schedule downtime.

Code:
/dev/sda:

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       112
  3 Spin_Up_Time            0x0007   180   180   024    Pre-fail  Always       -       261 (Average 332)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       52
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       3
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       46048
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1521
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1521
194 Temperature_Celsius     0x0002   162   162   000    Old_age   Always       -       37 (Min/Max 19/53)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 45600 hours (1900 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 43 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 30 80 97 ce 40 08  27d+18:11:51.281  READ FPDMA QUEUED
  61 08 08 08 a2 e2 40 08  27d+18:11:50.880  WRITE FPDMA QUEUED
  61 78 48 38 c5 df 40 08  27d+18:11:48.531  WRITE FPDMA QUEUED
  61 08 00 38 6e 4a 40 08  27d+18:11:48.531  WRITE FPDMA QUEUED
  61 40 f8 68 54 0b 40 08  27d+18:11:48.530  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     36952         -
# 2  Short offline       Completed without error       00%     36938         -
# 3  Short offline       Completed without error       00%     36937         -
# 4  Short offline       Completed without error       00%     19198         -
# 5  Short offline       Completed without error       00%     19183         -
# 6  Short offline       Completed without error       00%     19183         -
# 7  Short offline       Completed without error       00%     18291         -
# 8  Short offline       Completed without error       00%     18277         -
# 9  Short offline       Completed without error       00%     18277         -
#10  Short offline       Completed without error       00%     13611         -
#11  Short offline       Completed without error       00%     13597         -
#12  Short offline       Completed without error       00%     13596         -
#13  Short offline       Completed without error       00%       168         -
#14  Short offline       Completed without error       00%       166         -
#15  Short offline       Completed without error       00%       166         -
#16  Short offline       Completed without error       00%       124         -
#17  Short offline       Completed without error       00%         0         -


/dev/sdb:
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       108
  3 Spin_Up_Time            0x0007   181   181   024    Pre-fail  Always       -       257 (Average 331)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       52
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
  9 Power_On_Hours          0x0012   094   094   000    Old_age   Always       -       46047
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
192 Power-Off_Retract_Count 0x0032   095   095   000    Old_age   Always       -       6425
193 Load_Cycle_Count        0x0012   095   095   000    Old_age   Always       -       6425
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 19/55)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     36952         -
# 2  Short offline       Completed without error       00%     36937         -
# 3  Short offline       Completed without error       00%     36937         -
# 4  Short offline       Completed without error       00%     19198         -
# 5  Short offline       Completed without error       00%     19183         -
# 6  Short offline       Completed without error       00%     19183         -
# 7  Short offline       Completed without error       00%     18291         -
# 8  Short offline       Completed without error       00%     18277         -
# 9  Short offline       Completed without error       00%     18276         -
#10  Short offline       Completed without error       00%     13611         -
#11  Short offline       Completed without error       00%     13596         -
#12  Short offline       Completed without error       00%     13596         -
#13  Short offline       Completed without error       00%       168         -
#14  Short offline       Completed without error       00%       166         -
#15  Short offline       Completed without error       00%       166         -
#16  Short offline       Completed without error       00%       124         -
#17  Short offline       Completed without error       00%         0         -
 
/dev/sda obviously is not in the perfect condition.
This
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
(the number in the last column is not 0)
can mean a problem.

And
Error 1 occurred at disk power-on lifetime: 45600 hours (1900 days + 0 hours)
...
Have you got a backup of this host? :)

I can see that only "short" tests have been executed on these disks.
You can execute more thorough "long test": smartctl -t long /dev/sda and see whether it completes without error.
If it ends with an error, OVH can hardly reject it.

Edit: also note that compleeting _without_ an error is NOT a proof that the disk is OK!

But first of all have a backup on other media :)
 
Last edited:
/dev/sda obviously is not in the perfect condition.
This
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
(the number in the last column is not 0)
can mean a problem.

And
Error 1 occurred at disk power-on lifetime: 45600 hours (1900 days + 0 hours)
...
Have you got a backup of this host? :)

I can see that only "short" tests have been executed on these disks.
You can execute more thorough "long test": smartctl -t long /dev/sda and see whether it completes without error.
If it ends with an error, OVH can hardly reject it.

But first of all have a backup on other media :)
I do, but I'll take another daily now and run that test, thanks.

It is weird that everything entirely locks up, because I thought ZFS would at least just drop to a degraded state and use only one disk, and the likelihood of both disks dropping out simultaneously many times repeatedly seems low?

Regardless it does track that the zpool becomes unavailable and therefore no logs written making it hard to diagnose.
 
What are the actual drives (brand and model)? The SMART output is severely limited leading me to think these are “desktop” drives, which will “hang” attempting to recover the data that is broken. It could also be another issue that locks up your SATA bus, you’d need to physically remove one drive at a time to test. Do you have access to the console (IPMI), during the hangs you should see a kernel output log.

If your system hangs for more than so many seconds, Proxmox in a cluster then tries to reset the CPU.
 
What are the actual drives (brand and model)? The SMART output is severely limited leading me to think these are “desktop” drives, which will “hang” attempting to recover the data that is broken. It could also be another issue that locks up your SATA bus, you’d need to physically remove one drive at a time to test. Do you have access to the console (IPMI), during the hangs you should see a kernel output log.

If your system hangs for more than so many seconds, Proxmox in a cluster then tries to reset the CPU.
This is not a clustered system. I was mistaken, it's a SAS3 bus not SATA, they're both HGST HUS726040AL drives, which I do not have access to remove as they're somewhere in a datacentre.

I do have IPMI, but no kernel output is shown on it. Any attempt to log into this console results in accepting the username, then never getting to the password prompt, before the 60s login timeout limit is reached (and announced) and the login prompt returning back to requesting a username, presumably because it hangs waiting for disk IO to read /etc/passwd or something to validate the username.

I have attempted to back up the system to do SMART tests, but somewhat ironically the system I was sending the backup to had a PSU failure during the backup process, so I am currently stuck fixing a target system for the backup at this time before I can run tests.