Random tag# request not aligned on 4Kn disks only

Oct 2, 2024
1
0
1
Hey all, I've got a strange error that seems harmless, but I'd like to find a root cause and I'm hoping someone has some ideas. I occasionally get a kernel write error on two specific 4Kn disks: Toshiba MG09 SAS SED drives. These errors are correlated with logs saying that a given operation wasn't correctly aligned.

Here's the most recent event, though it occurred previously on 6/25 and 7/01:

Code:
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1705 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710089600 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1706 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710091647 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1708 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710093695 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1709 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710095743 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
Oct 01 00:29:18 Joker kernel: zio pool=tank vdev=/dev/disk/by-id/wwn-0x5000039ce85bf1cd-part1 error=5 type=2 offset=2923557486592 size=4186112 flags=1573046
Oct 01 00:29:18 Joker zed[4147459]: eid=16946 class=io pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 size=4186112 offset=2923557486592 priority=3 err=5 flags=0x1800b6 bookmark=95645:848:0:1082
Oct 01 00:29:18 Joker zed[4147462]: eid=16947 class=checksum pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 algorithm=fletcher4 size=4186112 offset=2923557486592 priority=4 err=52 flags=0x1800b0 bookmark=95645:848:0:1082

This results in output like this from zpool status:

Code:
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 19:25:44 with 0 errors on Tue Oct  1 19:49:47 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500e8aca4f0  ONLINE       0     0     0
            wwn-0x5000039ce85bf1cd  ONLINE       0     4     4
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000c500e8b2e3b3  ONLINE       0     0     0
            wwn-0x5000039ce8626e19  ONLINE       0     2     2
          mirror-2                  ONLINE       0     0     0
            wwn-0x5000039af8d1d138  ONLINE       0     0     0
            wwn-0x5000039af8d1d150  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            wwn-0x5000c500e553ef5d  ONLINE       0     0     0
            wwn-0x5000c500e58a3e06  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            wwn-0x5000c500e5be860b  ONLINE       0     0     0
            wwn-0x5000cca2c2dbcbf4  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x5000c500e5b190de  ONLINE       0     0     0
            wwn-0x5000cca285c940e0  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            wwn-0x5000039af8d1d3c6  ONLINE       0     0     0
            wwn-0x5000c500e54f7ef2  ONLINE       0     0     0
        special
          mirror-4                  ONLINE       0     0     0
            wwn-0x500a0751e6bc2335  ONLINE       0     0     0
            wwn-0x500a0751e5c75dfe  ONLINE       0     0     0
            wwn-0x5002538f3392be89  ONLINE       0     0     0
        spares
          wwn-0x5000c500e848371d    AVAIL  

errors: No known data errors

The values aren't consistent, but they are always low (4 is the largest I've seen). I've attached the full output -- it appears that both disks got hit with this error at around the same time and never again. This happens extremely rarely and I haven't seen any rhyme or reason for it.

System is a 12900K with 128GB memory and an LSI 9305-16i connected to a 45homelab HL15 with 15 drives in a 7x2 + hot spare setup. Of the above disks, the two 4Kn disks having the error are wwn-0x5000039ce85bf1cd and wwn-0x5000039ce8626e19. The disks that previously held that specific spot never had issues, but were replaced to upgrade the vdev from 8TB -> 18TB.

The array has never had any actual runtime issues that were noticeable in any applications and clearing the error makes it go away for what appears to be about a month at a time according to the attached log excerpts. Scrubs don't run monthly, they run every 4 months -- so there were no scrubs running when the errors previously occurred.

Any ideas on things to try to narrow down the issue?
 

Attachments