Hi,
I noticed that sometimes proxmox web interface freezes for some seconds, looking at /var/log/syslog and filtering on ata2, I see:
Using dmesg | grep ata2 | head, ata2 is the NVMe disk where I installed proxmox:
However SMART report is:
And when doing a test:
Does not look that bad. Any idea?
Full logs/tests attached.
Thanks,
Jean
I noticed that sometimes proxmox web interface freezes for some seconds, looking at /var/log/syslog and filtering on ata2, I see:
Jul 16 18:39:36 ganymede kernel: [1805235.264998] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 16 18:39:36 ganymede kernel: [1805235.265019] ata2.00: failed command: FLUSH CACHE EXT
Jul 16 18:39:36 ganymede kernel: [1805235.265024] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 20
Jul 16 18:39:36 ganymede kernel: [1805235.265037] ata2.00: status: { DRDY }
Jul 16 18:39:36 ganymede kernel: [1805235.265044] ata2: hard resetting link
Jul 16 18:39:46 ganymede kernel: [1805245.281141] ata2: softreset failed (device not ready)
Jul 16 18:39:46 ganymede kernel: [1805245.281159] ata2: hard resetting link
Jul 16 18:39:56 ganymede kernel: [1805255.305151] ata2: softreset failed (device not ready)
Jul 16 18:39:56 ganymede kernel: [1805255.305168] ata2: hard resetting link
Jul 16 18:39:59 ganymede kernel: [1805258.605285] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 16 18:39:59 ganymede kernel: [1805258.610051] ata2.00: configured for UDMA/133
Jul 16 18:39:59 ganymede kernel: [1805258.610057] ata2.00: retrying FLUSH 0xea Emask 0x4
Jul 16 18:39:59 ganymede kernel: [1805258.620222] ata2: EH complete
...
Jul 21 22:50:38 ganymede kernel: [2252297.175925] ata2: EH complete
Jul 21 22:58:37 ganymede kernel: [2252776.517166] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jul 21 22:58:37 ganymede kernel: [2252776.517187] ata2.00: failed command: FLUSH CACHE EXT
Jul 21 22:58:37 ganymede kernel: [2252776.517192] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 23
Jul 21 22:58:37 ganymede kernel: [2252776.517206] ata2.00: status: { DRDY }
Jul 21 22:58:37 ganymede kernel: [2252776.517213] ata2: hard resetting link
Jul 21 22:58:47 ganymede kernel: [2252786.537147] ata2: softreset failed (device not ready)
Jul 21 22:58:47 ganymede kernel: [2252786.537165] ata2: hard resetting link
Jul 21 22:58:48 ganymede kernel: [2252787.012899] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 21 22:58:48 ganymede kernel: [2252787.017818] ata2.00: configured for UDMA/133
Jul 21 22:58:48 ganymede kernel: [2252787.017824] ata2.00: retrying FLUSH 0xea Emask 0x4
Jul 21 22:58:48 ganymede kernel: [2252787.028258] ata2: EH complete
Using dmesg | grep ata2 | head, ata2 is the NVMe disk where I installed proxmox:
I initially thought of a wrong SATA cable (I also have a SATA disk) but I guess the NVMe disk is simply faulty.[ 1.015008] ata2: SATA max UDMA/133 abar m2048@0xfcc00000 port 0xfcc00100 irq 37
[ 1.476849] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 1.479311] ata2.00: ATA-11: Lexar SSD NQ100 480GB, SN11873, max UDMA/133
[ 1.479388] ata2.00: 937703088 sectors, multi 1: LBA48 NCQ (depth 32), AA
[ 1.481923] ata2.00: configured for UDMA/133
However SMART report is:
root@ganymede:~# smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.108-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: Lexar SSD NQ100 480GB
Serial Number: NAC205R0054910S30D
LU WWN Device Id: 5 3a5a27 2050b082b
Firmware Version: SN11873
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 21 23:22:22 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 33) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 85) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 20
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x1300 100 100 010 Old_age Offline - 0
9 Power_On_Hours 0x1200 100 100 000 Old_age Offline - 876
12 Power_Cycle_Count 0x1200 100 100 000 Old_age Offline - 3
164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 17180983304
165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 17
166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 4
167 Unknown_Attribute 0x2200 100 100 000 Old_age Offline - 8
194 Temperature_Celsius 0x2200 038 038 000 Old_age Offline - 38 (Min/Max 16/48)
199 UDMA_CRC_Error_Count 0x1200 100 100 000 Old_age Offline - 0
241 Total_LBAs_Written 0x3200 100 100 000 Old_age Offline - 4428
242 Total_LBAs_Read 0x3200 100 100 000 Old_age Offline - 7596
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
And when doing a test:
root@ganymede:~# smartctl -t short /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.108-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Jul 21 23:26:23 2023 CEST
Use smartctl -X to abort test.
root@ganymede:~# smartctl --log selftest /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.108-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 877 -
Does not look that bad. Any idea?
Full logs/tests attached.
Thanks,
Jean