PVE stops working after few hours


May 11, 2021

I'm new to Proxmox and pve.
I installed pve 6.3 on an Intel Nuc with Intel(R) Core(TM) i5-7260U CPU @ 2.20GHz and 8GB of ram.
After boot, everything works well but after few hours, the web GUI stops responding, either completely or after the login window.
I can ssh into the host, but most of the commands end with an Input/output error. At the time I'm writing this post, it's 10pm and most of the logs stopped logging before 1pm today.
I cannot even reboot the host. When I run the command, it hangs forever. I have to hard stop the host the restart it.
If you have any idea what can be the issue, I'll be very grateful.
Please do not hesitate to ask any log if I can provide it.
this sounds like a disk error. I would check the output of dmesg and take a look on smartmon output of disks
Thanks for your fast reply.
dmesg output is also Input/output error but I can read on other files like kern.log, which last output is attached at the end of the message.
smartctl (not smartmon) output is similar. I can restart the host and check output.

May 11 12:36:57 proxmox kernel: [160044.016897] ata1.00: failed command: READ FPDMA QUEUED
May 11 12:36:57 proxmox kernel: [160044.016899] ata1.00: cmd 60/00:70:00:00:00/01:00:00:00:00/40 tag 14 ncq dma 131072 in
May 11 12:36:57 proxmox kernel: [160044.016899]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 12:36:57 proxmox kernel: [160044.016902] ata1.00: status: { DRDY }
May 11 12:36:57 proxmox kernel: [160044.016903] ata1.00: failed command: READ FPDMA QUEUED
May 11 12:36:57 proxmox kernel: [160044.016905] ata1.00: cmd 60/00:78:00:08:00/01:00:00:00:00/40 tag 15 ncq dma 131072 in
May 11 12:36:57 proxmox kernel: [160044.016905]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 12:36:57 proxmox kernel: [160044.016908] ata1.00: status: { DRDY }
May 11 12:36:57 proxmox kernel: [160044.016909] ata1.00: failed command: READ FPDMA QUEUED
May 11 12:36:57 proxmox kernel: [160044.016911] ata1.00: cmd 60/00:80:00:08:10/01:00:00:00:00/40 tag 16 ncq dma 131072 in
May 11 12:36:57 proxmox kernel: [160044.016911]          res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 12:36:57 proxmox kernel: [160044.016914] ata1.00: status: { DRDY }
May 11 12:36:57 proxmox kernel: [160044.016915] ata1.00: failed command: WRITE FPDMA QUEUED
May 11 12:36:57 proxmox kernel: [160044.016917] ata1.00: cmd 61/08:88:38:57:e4/00:00:0f:00:00/40 tag 17 ncq dma 4096 out
May 11 12:36:57 proxmox kernel: [160044.016917]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 12:36:57 proxmox kernel: [160044.016920] ata1.00: status: { DRDY }
May 11 12:36:57 proxmox kernel: [160044.016924] ata1: hard resetting link
May 11 12:37:02 proxmox kernel: [160049.380770] ata1: link is slow to respond, please be patient (ready=0)
May 11 12:37:07 proxmox kernel: [160054.072746] ata1: COMRESET failed (errno=-16)
May 11 12:37:07 proxmox kernel: [160054.072754] ata1: hard resetting link
May 11 12:37:12 proxmox kernel: [160059.440590] ata1: link is slow to respond, please be patient (ready=0)
May 11 12:37:17 proxmox kernel: [160064.120627] ata1: COMRESET failed (errno=-16)
May 11 12:37:17 proxmox kernel: [160064.120635] ata1: hard resetting link
May 11 12:37:22 proxmox kernel: [160069.484507] ata1: link is slow to respond, please be patient (ready=0)
May 11 12:37:52 proxmox kernel: [160099.160145] ata1: COMRESET failed (errno=-16)
May 11 12:37:52 proxmox kernel: [160099.160156] ata1: limiting SATA link speed to 3.0 Gbps
May 11 12:37:52 proxmox kernel: [160099.160157] ata1: hard resetting link
It seems that disk is failing. I would also check the disk cabling just in case
Here is the full output:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.4.106-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37480G
Serial Number:    50026B7682F925B1
LU WWN Device Id: 5 0026b7 682f925b1
Firmware Version: SBFKB1D1
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue May 11 23:09:56 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (65535) seconds.
Offline data collection
capabilities:              (0x79) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  30) minutes.
Conveyance self-test routine
recommended polling time:      (   6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     -O--CK   000   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    3615
 12 Power_Cycle_Count       -O--CK   100   100   000    -    32
148 Unknown_Attribute       ------   100   100   000    -    0
149 Unknown_Attribute       ------   100   100   000    -    0
167 Write_Protect_Mode      ------   100   100   000    -    0
168 SATA_Phy_Error_Count    -O--C-   100   100   000    -    0
169 Bad_Block_Rate          ------   100   100   000    -    20
170 Bad_Blk_Ct_Erl/Lat      ------   100   100   010    -    0/14
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 MaxAvgErase_Ct          ------   100   100   000    -    4 (Average 2)
181 Program_Fail_Count      -O--CK   100   100   000    -    0
182 Erase_Fail_Count        ------   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
192 Unsafe_Shutdown_Count   -O--C-   100   100   000    -    26
194 Temperature_Celsius     -O---K   071   063   000    -    29 (Min/Max 21/37)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
199 SATA_CRC_Error_Count    -O--CK   100   100   000    -    0
218 CRC_Error_Count         -O--CK   100   100   000    -    0
231 SSD_Life_Left           ------   001   001   000    -    99
233 Flash_Writes_GiB        -O--CK   100   100   000    -    628
241 Lifetime_Writes_GiB     -O--CK   100   100   000    -    321
242 Lifetime_Reads_GiB      -O--CK   100   100   000    -    4178
244 Average_Erase_Count     ------   100   100   000    -    2
245 Max_Erase_Count         ------   100   100   000    -    4
246 Total_Erase_Count       ------   100   100   000    -    38048
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O     64  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              32  ---  Lifetime Power-On Resets
0x01  0x010  4            3615  ---  Power-on Hours
0x01  0x018  6       673982702  ---  Logical Sectors Written
0x01  0x028  6      8762500827  ---  Logical Sectors Read
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              29  ---  Current Temperature
0x05  0x020  1              37  ---  Highest Temperature
0x05  0x028  1              21  ---  Lowest Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               0  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4          120  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            4  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC
This is a very strange. No CRC errors, most likely it's not the cable. 99% ssd lifetime is remaining, but 20% of reserved blocks already used.
To be honest, I doubt it's the cable. It's a new NUC very little used. No reason the cable to be the issue.
So can I do something? Or do you think I need to change the drive?
Thanks anywhere for all the time spent looking to my issue.
So since I rebooted it yesterday, it's working fine. Should I check something now while it is working?
failed command: WRITE FPDMA QUEUED means most likely RAID controller issue, rebooting will reset the controller, you can check the controller log if possible to see any issue reported by controller