Periodic randomly stuck service

next40

Member
May 31, 2021
11
1
6
38
Hi all, i have some servers on latest proxmox 6.4.
I look periodic randomly stuck service(panic) and fixing only with reset via IPMI.
Hardware :
server - 1014s-wtrt
ram: 256gb
OS installed on small device SSD-DM128-SMCMVN1 ( fs - ext4 )

On IPMI console i see fastly repeated error:
sda - access beyond end of device

I try to replace boot device, but this not fixing problem. first time i think about hardware problem, but problem locating on random server, possible( more loaded )
Any ideas for tshoot \ fix this problem?
 

Attachments

  • 2022-01-05_03-21.png
    2022-01-05_03-21.png
    337 KB · Views: 17
  • 2022-01-05_03-23.png
    2022-01-05_03-23.png
    396.4 KB · Views: 17
well it looks like a hardware problem... AFAIU you did replace that disk? what about the cable/controller ?
 
well it looks like a hardware problem... AFAIU you did replace that disk? what about the cable/controller ?
Yes, i replaced problem disk on first server with this problem , but after random time i have same problem on server with new disk.
No cable for SATADOM disk, it connects directly to the motherboard.
for example:
https://www.supermicro.com/products/nfo/SATADOM.cfm

server platform spec:
https://www.supermicro.com/en/Aplus/system/1U/1014/AS-1014S-WTRT.cfm
 
Last edited:
Problem not fixed . Randomly present 1-2 times per month. Tested a lot of kernels versions, but without success. Any Ideas?
 
Problem not fixed . Randomly present 1-2 times per month. Tested a lot of kernels versions, but without success. Any Ideas?
Replace your Supermicro SATA-DOM again.... we had several which report SMART-Error very soon after initially used... and some which are now years old.... SATA-DOM is a cool small SSD-Sata-Disk, but there are many which just fail very soon. So maybe you got 2 or 3 bad ones in a row..... Happens sometimes....

Also if you have 2 of them, and 2 DOM-Connectors on your Mainboard..... you can go with ZFS-Mirror and see if ZFS detects error, without crashing you whole system all the time...
 
Replace your Supermicro SATA-DOM again.... we had several which report SMART-Error very soon after initially used... and some which are now years old.... SATA-DOM is a cool small SSD-Sata-Disk, but there are many which just fail very soon. So maybe you got 2 or 3 bad ones in a row..... Happens sometimes....

Also if you have 2 of them, and 2 DOM-Connectors on your Mainboard..... you can go with ZFS-Mirror and see if ZFS detects error, without crashing you whole system all the time...
ohh, ok. I will try to replace again and send feedback after