SATA issues

dejhost

Member
Dec 13, 2020
64
1
13
45
Hello !
I've been struggeling with one of my servers for several months now.

There seems to be a hardware conflict related to SATA. My ZFS RAIDZ (4x 10TB HDD in 2 mirrors) claimed that there was 1 disk failure. I replaced the disk and shortly after, 2 more disks seem to fail (one in each of the mirrors). I assumed, that this is not at disk problem, and started to search for other issues.

dmesg slow sata.jpeg

In the comming weeks, I replaced:
1) All SATA cables
2) The PSU*
3) The mainboard, including the chasis
4) The CPU

*I should mention that since changing the PSU did not help, I reinstalled the original PSU.

So now, I have pretty much a new server. Nothing seems to help. Here are further things I tried:
1) I ran 4 hours of RAM test. No errors found.
2) Booted from a live-USB: Linux Mint. Same issues found.
3) I upgrade proxmox to 8.2.4.
4) I tried systematically all Linux-kernels that are available in the server.
5) Ran smartmontools on all hdds. Some of the short tests, and all of the long tests got "Aborted by the host".
6) Attached all hdds to a workstation. Conducted smartmontools long tests (something between 8-14 hours). No errors founds, all healthy.
7) Ran many, many scrubs on the ZFS on the server. In the beginning, inconsistent data was found and dealt with. By now, all data is deleted.


Even if unlikely, I then figured that several of the hdd's are actually broken. So I bought 2 more. Including an elderly hdd I had lying around, I have now 7x10TB. I attached one at the time to the server, creating several hours of load with the tool "fio". 4 disks indicated no error of any kind. I used them, created the ZFSz-Raid and started restore from backups. 2 days after, while still restoring, I got the zpool-error "disk unavailable". I removed the troublemaker, created another type of Raid with the remaining 3 HDDS, and started restoring VM's again. Shortly after, I got the error about slow SATA response. Restore-process got cancelled, but the zpool seems healthy.


This is just a quick summary of what has happened. Thanks for reading. even more thanks, if you can suggest a solution.
 
Hi,

Test each of yout hdd with problem with badblock(who will write each block, and then it will check if the data is ok)

Good luck / Bafta !
 
If you use the same HDD models and maybe the same production series it‘s not unlikely that additional drives will fail shortly after. Rebuilding RAIDs (especially with parity) puts a lot of stress on the disks.
 
Some of the badblock-tests have been running for about a week:

Code:
root@s301:/# sudo badblocks -wsv -b 4096 -c 1024 -o /sda.txt /dev/sda &

Code:
root@s301:/# ls -lh *.txt
-rw-r--r-- 1 root root    0 Sep 17 21:09 sda.txt
-rw-r--r-- 1 root root    0 Sep 17 21:09 sdb.txt
-rw-r--r-- 1 root root 288M Sep 18 10:15 sdc.txt
-rw-r--r-- 1 root root 1.3M Sep 24 16:58 sdd.txt
-rw-r--r-- 1 root root 7.3K Sep 18 13:35 sde.txt
-rw-r--r-- 1 root root 4.2K Sep 21 12:28 sdf.txt
-rw-r--r-- 1 root root    0 Sep 17 21:09 sdg.txt

I guess it is safe to say that sdc is seriously broken. I will remove and destroy it.

Smartd info of the hdd's:
Code:
root@s301:~# sudo smartctl -i /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0008-2JM101
Serial Number:    ZPW0JS0Q
LU WWN Device Id: 5 000c50 0c7afd62d
Firmware Version: EN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:57:04 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@s301:~# sudo smartctl -i /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf
Device Model:     ST10000VN0004-1ZD101
Serial Number:    ZA24X1HP
LU WWN Device Id: 5 000c50 0afc2c1c5
Firmware Version: SC60
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:58:31 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@s301:~# sudo smartctl -i /dev/sdc
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (CMR)
Device Model:     ST10000DM0004-2GR11L
Serial Number:    ZJV6AN7J
LU WWN Device Id: 5 000c50 0c41cd201
Firmware Version: DN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:58:52 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

rtctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (CMR)
Device Model:     ST10000DM0004-1ZC101
Serial Number:    ZA2D93PP
LU WWN Device Id: 5 000c50 0c5eb4a09
Firmware Version: DN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:59:17 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@s301:~#

root@s301:~# sudo smartctl -i /dev/sde
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0008-2JM101
Serial Number:    ZPW0NNDB
LU WWN Device Id: 5 000c50 0c838142b
Firmware Version: EN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:59:35 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@s301:~# sudo smartctl -i /dev/sdf
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST10000NE0008-2JM101
Serial Number:    ZPW0H5GC
LU WWN Device Id: 5 000c50 0c78835e3
Firmware Version: EN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 12:59:50 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@s301:~# sudo smartctl -i /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-1-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5 (CMR)
Device Model:     ST10000DM0004-1ZC101
Serial Number:    ZA2DT6ZS
LU WWN Device Id: 5 000c50 0c76b1910
Firmware Version: DN01
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Sep 25 13:00:02 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Also, I want to mention that I am using an PCI ethernet card:
Code:
06:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02)
        Subsystem: ASUSTeK Computer Inc. XG-C100C
        Flags: bus master, fast devsel, latency 0, IRQ 16, IOMMU group 57
        Memory at fb840000 (64-bit, non-prefetchable) [size=64K]
        Memory at fb850000 (64-bit, non-prefetchable) [size=4K]
        Memory at fb400000 (64-bit, non-prefetchable) [size=4M]
        Expansion ROM at fb800000 [disabled] [size=256K]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
        Capabilities: [a0] MSI: Enable- Count=1/32 Maskable- 64bit+
I don't think it is likely, but maybe it causes some kind of issues.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!