Zfs striped raid degradation on high load

openaspace

Active Member
Sep 16, 2019
486
10
38
Italy
Hello.
I have a striped zfs of 2 mirror and when proxmox perform an intensive operation like backup it report sometimes write and read errors..

But running after a manual scrub it report all the times 0 errors and 0 data repaired.

The zfs run on hba PCI sas card and I think the problem is as the PCI card manage the read/write throughput.

Any experience on sas PCI cards?
Thank you.
 
Hi.
I have Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

The degradation happen only on heavy load like nights backup...
performing scub any time it report 0 errors.. also the SMART it'ok and now I'm running long smart test..

Therefore I think the problem is the controller... or the power ..

prx1-Proxmox-Virtual-Environment.png
Code:
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50062e2f18f
Serial number:        Z1Z6RY3L0000W5198LL3
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Sun Oct  8 12:20:51 2023 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     49 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 48008:59
Manufactured in week 51 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  285
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2330
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3795369
  Blocks received from initiator = 44132940
  Blocks read from cache and sent to initiator = 16224
  Number of read and write commands whose size <= segment size = 21842
  Number of read and write commands whose size > segment size = 944

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 48008.98
  number of minutes until next internal SMART test = 19

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4036639        0         0   4036639          0          1.943           0
write:         0        0         0         0          0         27.981           0

Non-medium error count:        0


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Self-test execution status:             90% of test remaining
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]

Long (extended) Self-test duration: 32700 seconds [9.1 hours]
 
Therefore I think the problem is the controller... or the power ..
or the cables or RAM or the disks (even if smart is OK). Try switching (out) everything ...

First, I'd try switchting the cable endpoint from the disks so that inside of each mirror, the disks are swapped and see if the error moves or not.

Broadcom / LSI SAS2008
I've a lot of them and I never got any problem with them.
 
  • Like
Reactions: openaspace
I've a lot of them and I never got any problem with them.
They are quite old. Got 3 of the in the homelab and one of them started to fail so I replaced it. Was causing IO errors and sometimes even wasn't recognized anymore after a few minutes. They also tend to get very hot and shouldn't be used in a normal tower case without modding a fan to it as they are designed to be passively cooled by the powerful/noisy fans of a rack server. I also repasted them this year and the thermal paste already was very brittle.
 
Last edited:
They are quite old. Got 3 of the in the homelab and one of them started to fail so I replaced it. Was causing IO errors and sometimes even wasn't recognized anymore after a few minutes. They also tend to get very hot and shouldn't be used in a normal tower case without modding a fan to it as they are designed to be passively cooled by the powerful/noisy fans of a rack server. I also repasted them this year and the thermal paste already was very brittle.
The pool was degraded then powered off..
Therefore for test,following the advice of @LnxBil I have inverted the sas connectors from 1 to 4 with 4 to 1 order.

Powered on and the pool without the clear command was online whitout errors .. o_O
and.. searching online found that for this model a lot of users added: mpt3sas.max_queue_depth=10000 that I have added to the grub..

Now with the health pool online...i'm performing a scrub...

Code:
scan: scrub in progress since Sun Oct  8 13:25:37 2023
        4.16T scanned at 287M/s, 3.36T issued at 231M/s, 6.25T total
        0B repaired, 53.70% done, 03:38:32 to go
 
Final response with 1 CKSUM error;

Code:
ZFS has finished a scrub:

eid: 32
class: scrub_finish
host: prx1
time: 2023-10-08 21:33:43+0200
pool: zfs-sas
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 08:08:06 with 0 errors on Sun Oct 8 21:33:43 2023
config:

NAME STATE READ WRITE CKSUM
zfs-sas ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50062e2f95b ONLINE 0 0 0
scsi-35000c50062e2f3f7 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c50062e2f18f ONLINE 0 0 0
scsi-35000c50062e3329f ONLINE 0 0 1

errors: No known data errors
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!