I'm new to ZFS. This is the first pool I'm working with and it's on a server that I setup just recently. I decided to dive into ZFS vs a RAID5 based on what I had read online for data errors, so my knowledge is really basic.
I have a raidz1-0 and it's been up a month or so with some data running to it and I see a number of READ, WRITE and CKSUM errors popup. I've setup a cronjob to run every week to scrub, but I noticed that the errors had always been on drives 1 - 4. Those 4 are plugged into the SATA ports on my motherboard that face 3 o'clock, and the other 2 drives (with no read/write errors) are on the SATA plugs that face 6 o'clock. I thought perhaps the cables in the 1 - 4 were screwed up, so I replaced them all. I even changed the power cable from the PSU for those drives. I then scrubbed and things checked out (the zpool status -v files that were listed as errored seemed to be fine after these steps). After a bit more of data read and write, I noticed a few more errors popping up.
I also noticed all 4 of those drives were the same model number (ST14000NM001G), vs the other 2 being different (ST14000NM0018). I wondered if there was some issue between that and my board, so I ordered a new one of the 0018 and did a zpool replace on one of the 001G drives. I'm now running a
zpool status:
I guess my question is, what is normal? I've done SMART checks on all of these drives and they check just fine. I have all the data on these drives backed up, so I truly am using this as a learning experience, but I've written roughly 40TB to it. Should I expected to see some READ/ WRITE errors popup every now and then and get cleaned up with a scrub? Or should it always be 0's? Thanks!
I have a raidz1-0 and it's been up a month or so with some data running to it and I see a number of READ, WRITE and CKSUM errors popup. I've setup a cronjob to run every week to scrub, but I noticed that the errors had always been on drives 1 - 4. Those 4 are plugged into the SATA ports on my motherboard that face 3 o'clock, and the other 2 drives (with no read/write errors) are on the SATA plugs that face 6 o'clock. I thought perhaps the cables in the 1 - 4 were screwed up, so I replaced them all. I even changed the power cable from the PSU for those drives. I then scrubbed and things checked out (the zpool status -v files that were listed as errored seemed to be fine after these steps). After a bit more of data read and write, I noticed a few more errors popping up.
I also noticed all 4 of those drives were the same model number (ST14000NM001G), vs the other 2 being different (ST14000NM0018). I wondered if there was some issue between that and my board, so I ordered a new one of the 0018 and did a zpool replace on one of the 001G drives. I'm now running a
badblocks -svwb 4096 /dev/sda
on the drive I replaced in the pool as a test (note, the drive marked FAULTED went to a FAULTED state after I had decided to replace the first drive, otherwise I would have selected that one first). The first drive in this series is now attached to a USB > SATA converter, while /dev/sda is being badblock'ed. After badblocks is done, if the drive passes, I'll replace the ZL293B4B drive.zpool status:
Code:
pool: nas
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Mon Jul 24 06:03:48 2023
20.0T scanned at 207M/s, 18.7T issued at 192M/s, 39.2T total
0B repaired, 47.59% done, 1 days 07:06:57 to go
config:
NAME STATE READ WRITE CKSUM
nas DEGRADED 0 0 0
raidz1-0 DEGRADED 55 0 0
ata-ST14000NM0018-2H4101_ZHZ38XLQ ONLINE 0 0 12
ata-ST14000NM001G-2KJ103_WL202XJ9 ONLINE 28 6 10.6K
ata-ST14000NM001G-2KJ103_ZL293B4B FAULTED 15 8 0 too many errors
ata-ST14000NM001G-2KJ103_ZTM089LF ONLINE 48 6 10.6K
ata-ST14000NM0018-2H4101_ZHZ32TWF ONLINE 0 0 10.6K
ata-ST14000NM0018-2H4101_ZHZ3WLKC ONLINE 0 0 10.6K
errors: 8 data errors, use '-v' for a list
I guess my question is, what is normal? I've done SMART checks on all of these drives and they check just fine. I have all the data on these drives backed up, so I truly am using this as a learning experience, but I've written roughly 40TB to it. Should I expected to see some READ/ WRITE errors popup every now and then and get cleaned up with a scrub? Or should it always be 0's? Thanks!
Last edited: