[SOLVED] Strange disks problems

dpi

New Member
Mar 29, 2024
6
0
1
Hi, as i've stated in the title i'm having some problems which i can't really pinpoint on a specific server.
The server has been off for around 10 days during summer vacations, and when i turned it back on one of the VMs didn't want to turn on.
After a reboot, the incriminated vm booted without problems, but now i'm looking at the storage graph and they are shrinking
everyday, as you can see in the attached screenshots.
HDD_Backup is formatted as ext4.

This morning i found the webui unresponsive, as i couldn't log into it.
I've successfully logged via SSH, looked at journalctl, and found the following error after the scheduled backup onto HDD_Backup.

Aug 27 12:04:35 pve pve-ha-lrm[1748]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pve/lrm_status.tmp.1748' - Input/output error

Does anyone have any idea on what to look at, or at what to try to fix this?

I guess one of the disks is failing, i can't understand if its the local-lvm (mirror with 2 SSDs) or the HDD for backups.

Thanks in advance to everybody! ^^

Have a great day
 

Attachments

  • 1756289342884.png
    1756289342884.png
    40.9 KB · Views: 6
  • 1756289363484.png
    1756289363484.png
    36.8 KB · Views: 6
Your I/O error is on OS disk (pve db) and if it's a raid1 mirror it already looks like both disks are at their end as otherwise you don't get this error. Take a look with smartctl -x /dev/sd<assuming a + b) here ?>. Never seen that capacity is shrinking before a disk is dying which doesn't mean that this didn't exists as seen here ...
 
  • Like
Reactions: dpi
Attached smartctl of both disks.

/dev/sda
Code:
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    6765
 12 Power_Cycle_Count       -O--CK   100   100   000    -    14
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   071   071   000    -    299
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    0
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    72
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   062   047   000    -    38 (Min/Max 21/53)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   071   071   001    -    29
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    62214973302
247 Host_Program_Page_Count -O--CK   100   100   000    -    711138704
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    3835017701
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

/dev/sdc
Code:
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    6742
 12 Power_Cycle_Count       -O--CK   100   100   000    -    14
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   080   080   000    -    206
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    0
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    59
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   062   049   000    -    38 (Min/Max 21/51)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_ECC_Cnt -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   080   080   001    -    20
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    62215121947
247 Host_Program_Page_Count -O--CK   100   100   000    -    1431646864
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    2289632782
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning
 
Power-on-hours is about 9 month and even when remaining lifetime is just 20..29 % the I/O error for further writing is existing. Assuming you got consumer ssd's and anyway even as smart looks quiet good you need to get new once immedently if don't need a total failure of your vm's and pve at all.
 
Power-on-hours is about 9 month and even when remaining lifetime is just 20..29 % the I/O error for further writing is existing. Assuming you got consumer ssd's and anyway even as smart looks quiet good you need to get new once immedently if don't need a total failure of your vm's and pve at all.
Thank you for your swift answer.
I will be changing disks and marking this as solved.
Best regards