Hello,
I am getting SMART alerts for drives that are passed through to a VM (infact, the entire controller is passed through!)
Here's the passthrough config on my VM.
Now, I get SMART alerts that look like the following:
The problem is, Proxmox can't even see that disk.
Additionally /dev/sdx is an 800GB SSD
See:
Now, if I go into the VM that has the disks passed through and check the SMART output of the actual disk with the serial number it's complaining about;
Anyone able to tell me what I'm missing?
Thanks
Nick
I am getting SMART alerts for drives that are passed through to a VM (infact, the entire controller is passed through!)
Here's the passthrough config on my VM.
Now, I get SMART alerts that look like the following:
Code:
The following warning/error was logged by the smartd daemon:
Device: /dev/sdx [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Device info:
WDC WD40EFRX-68N32N0, S/N:WD-[REDACTED], WWN:5-0014ee-2b98eead1, FW:82.00A82, 4.00 TB
The problem is, Proxmox can't even see that disk.
Additionally /dev/sdx is an 800GB SSD
See:
Code:
root@pve:~# smartctl -a /dev/sdx
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.8-4-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model: INTEL SSDSC2BB800G6
Serial Number: [REDACTED]
LU WWN Device Id: 5 5cd2e4 04c7e9bed
Firmware Version: G2010140
User Capacity: 800,166,076,416 bytes [800 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Aug 27 17:19:42 2024 NZST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 2) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 64870
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 107
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 77
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 7110 (358 1584)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Temperature_Case 0x0022 077 070 000 Old_age Always - 23 (Min/Max 14/30)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 77
194 Temperature_Internal 0x0022 100 100 000 Old_age Always - 23
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 5720427
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 21872
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 61
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 3891797
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 079 079 000 Old_age Always - 0
234 Thermal_Throttle 0x0032 100 100 000 Old_age Always - 0/0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 5720427
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 9276677
243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 17573935
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 56455 -
# 2 Extended offline Completed without error 00% 8 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Now, if I go into the VM that has the disks passed through and check the SMART output of the actual disk with the serial number it's complaining about;
Code:
root@Store01:~# smartctl -a /dev/sdd
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.79-Unraid] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68N32N0
Serial Number: [REDACTED]
LU WWN Device Id: 5 0014ee 2b98eead1
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5601
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Tue Aug 27 17:27:08 2024 NZST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (44040) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 467) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 209 181 021 Pre-fail Always - 4533
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 15
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 032 032 000 Old_age Always - 50006
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 46
194 Temperature_Celsius 0x0022 124 111 000 Old_age Always - 26
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 50006 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
The above only provides legacy SMART information - try 'smartctl -x' for more
Anyone able to tell me what I'm missing?
Thanks
Nick