Hey all, I've got a strange error that seems harmless, but I'd like to find a root cause and I'm hoping someone has some ideas. I occasionally get a kernel write error on two specific 4Kn disks: Toshiba MG09 SAS SED drives. These errors are correlated with logs saying that a given operation wasn't correctly aligned.
Here's the most recent event, though it occurred previously on 6/25 and 7/01:
This results in output like this from zpool status:
The values aren't consistent, but they are always low (4 is the largest I've seen). I've attached the full output -- it appears that both disks got hit with this error at around the same time and never again. This happens extremely rarely and I haven't seen any rhyme or reason for it.
System is a 12900K with 128GB memory and an LSI 9305-16i connected to a 45homelab HL15 with 15 drives in a 7x2 + hot spare setup. Of the above disks, the two 4Kn disks having the error are wwn-0x5000039ce85bf1cd and wwn-0x5000039ce8626e19. The disks that previously held that specific spot never had issues, but were replaced to upgrade the vdev from 8TB -> 18TB.
The array has never had any actual runtime issues that were noticeable in any applications and clearing the error makes it go away for what appears to be about a month at a time according to the attached log excerpts. Scrubs don't run monthly, they run every 4 months -- so there were no scrubs running when the errors previously occurred.
Any ideas on things to try to narrow down the issue?
Here's the most recent event, though it occurred previously on 6/25 and 7/01:
Code:
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1705 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710089600 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1706 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710091647 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1708 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710093695 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1709 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710095743 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
Oct 01 00:29:18 Joker kernel: zio pool=tank vdev=/dev/disk/by-id/wwn-0x5000039ce85bf1cd-part1 error=5 type=2 offset=2923557486592 size=4186112 flags=1573046
Oct 01 00:29:18 Joker zed[4147459]: eid=16946 class=io pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 size=4186112 offset=2923557486592 priority=3 err=5 flags=0x1800b6 bookmark=95645:848:0:1082
Oct 01 00:29:18 Joker zed[4147462]: eid=16947 class=checksum pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 algorithm=fletcher4 size=4186112 offset=2923557486592 priority=4 err=52 flags=0x1800b0 bookmark=95645:848:0:1082
This results in output like this from zpool status:
Code:
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub repaired 0B in 19:25:44 with 0 errors on Tue Oct 1 19:49:47 2024
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000c500e8aca4f0 ONLINE 0 0 0
wwn-0x5000039ce85bf1cd ONLINE 0 4 4
mirror-1 ONLINE 0 0 0
wwn-0x5000c500e8b2e3b3 ONLINE 0 0 0
wwn-0x5000039ce8626e19 ONLINE 0 2 2
mirror-2 ONLINE 0 0 0
wwn-0x5000039af8d1d138 ONLINE 0 0 0
wwn-0x5000039af8d1d150 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
wwn-0x5000c500e553ef5d ONLINE 0 0 0
wwn-0x5000c500e58a3e06 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
wwn-0x5000c500e5be860b ONLINE 0 0 0
wwn-0x5000cca2c2dbcbf4 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
wwn-0x5000c500e5b190de ONLINE 0 0 0
wwn-0x5000cca285c940e0 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
wwn-0x5000039af8d1d3c6 ONLINE 0 0 0
wwn-0x5000c500e54f7ef2 ONLINE 0 0 0
special
mirror-4 ONLINE 0 0 0
wwn-0x500a0751e6bc2335 ONLINE 0 0 0
wwn-0x500a0751e5c75dfe ONLINE 0 0 0
wwn-0x5002538f3392be89 ONLINE 0 0 0
spares
wwn-0x5000c500e848371d AVAIL
errors: No known data errors
The values aren't consistent, but they are always low (4 is the largest I've seen). I've attached the full output -- it appears that both disks got hit with this error at around the same time and never again. This happens extremely rarely and I haven't seen any rhyme or reason for it.
System is a 12900K with 128GB memory and an LSI 9305-16i connected to a 45homelab HL15 with 15 drives in a 7x2 + hot spare setup. Of the above disks, the two 4Kn disks having the error are wwn-0x5000039ce85bf1cd and wwn-0x5000039ce8626e19. The disks that previously held that specific spot never had issues, but were replaced to upgrade the vdev from 8TB -> 18TB.
The array has never had any actual runtime issues that were noticeable in any applications and clearing the error makes it go away for what appears to be about a month at a time according to the attached log excerpts. Scrubs don't run monthly, they run every 4 months -- so there were no scrubs running when the errors previously occurred.
Any ideas on things to try to narrow down the issue?