Multiple Offline uncorrectable sectors

rbjohnson78

Member
Aug 8, 2022
23
0
6
I'm finding this odd, or just shear bad luck with our drives. But we have a cluster of 8 proxmox hosts, and out of the 8, 2 hosts constantly start to show errors. We've replaced the drives with new ones, and a few months later we get reports of these errors again. Anyone know if there are some parameters for smartd that I can look at? Firmware is all at the latest version, along with BIOS. I'm just finding it odd that it is only 2 of the servers out of the bunch, and all have the same type and brand of SSD's.

Jan 31 08:25:58 prxoms02 smartd[2967]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors

Jan 31 08:25:58 prxoms02 smartd[2967]: Device: /dev/sdb [SAT], 16 Offline uncorrectable sectors

Jan 31 08:25:58 prxoms02 smartd[2967]: Device: /dev/sdd [SAT], 16 Offline uncorrectable sectors

Jan 31 08:25:58 prxoms02 smartd[2967]: Device: /dev/sde [SAT], 24 Offline uncorrectable sectors

Jan 31 08:25:58 prxoms02 smartd[2967]: Device: /dev/sdg [SAT], 8 Offline uncorrectable sectors
 
Maybe check with smartctl -a /dev/sda if there is more wrong? Maybe run a short and long test? And check journalctl for more I/O errors around the same time.
Maybe switch their place in the server rack with some of the other servers? Maybe it's a (location specific) vibration that the drives can't handle? Or put the new drives in another server and the known to be good drives of that server in the two problematic ones?
 
Maybe check with smartctl -a /dev/sda if there is more wrong? Maybe run a short and long test? And check journalctl for more I/O errors around the same time.
Maybe switch their place in the server rack with some of the other servers? Maybe it's a (location specific) vibration that the drives can't handle? Or put the new drives in another server and the known to be good drives of that server in the two problematic ones?
Nothing looked out of the ordinary within journalctl. Tried the long and short, but just came up as passed, but had uncorrectable error counts. Nothing shows on the server SMART logs, only within Proxmox.
 
Nothing looked out of the ordinary within journalctl. Tried the long and short, but just came up as passed, but had uncorrectable error counts. Nothing shows on the server SMART logs, only within Proxmox.
I only just noticed your drives are SSDs. It's possible that the attribute does not represent offline uncorrectable sectors like it once did for HDDs. SMART attributes are not standardized and smartd is not perfect. Maybe try to find out what the attribute means for these particular drives by contacting Dell support and update the smartd configuration accordingly (or tell it to ignore it).

EDIT: Looks like other people also run into this: https://forum.proxmox.com/threads/s...but-no-smart-error-on-test.139489/post-623192
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!