Random tag# request not aligned on 4Kn disks only

Oct 2, 2024
1
0
1
Hey all, I've got a strange error that seems harmless, but I'd like to find a root cause and I'm hoping someone has some ideas. I occasionally get a kernel write error on two specific 4Kn disks: Toshiba MG09 SAS SED drives. These errors are correlated with logs saying that a given operation wasn't correctly aligned.

Here's the most recent event, though it occurred previously on 6/25 and 7/01:

Code:
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1705 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710089600 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1706 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710091647 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1708 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710093695 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 0
Oct 01 00:29:18 Joker kernel: sd 0:0:3:0: [sdd] tag#1709 request not aligned to the logical block size
Oct 01 00:29:18 Joker kernel: I/O error, dev sdd, sector 5710095743 op 0x1:(WRITE) flags 0x0 phys_seg 8 prio class 0
Oct 01 00:29:18 Joker kernel: zio pool=tank vdev=/dev/disk/by-id/wwn-0x5000039ce85bf1cd-part1 error=5 type=2 offset=2923557486592 size=4186112 flags=1573046
Oct 01 00:29:18 Joker zed[4147459]: eid=16946 class=io pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 size=4186112 offset=2923557486592 priority=3 err=5 flags=0x1800b6 bookmark=95645:848:0:1082
Oct 01 00:29:18 Joker zed[4147462]: eid=16947 class=checksum pool='tank' vdev=wwn-0x5000039ce85bf1cd-part1 algorithm=fletcher4 size=4186112 offset=2923557486592 priority=4 err=52 flags=0x1800b0 bookmark=95645:848:0:1082

This results in output like this from zpool status:

Code:
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 19:25:44 with 0 errors on Tue Oct  1 19:49:47 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5000c500e8aca4f0  ONLINE       0     0     0
            wwn-0x5000039ce85bf1cd  ONLINE       0     4     4
          mirror-1                  ONLINE       0     0     0
            wwn-0x5000c500e8b2e3b3  ONLINE       0     0     0
            wwn-0x5000039ce8626e19  ONLINE       0     2     2
          mirror-2                  ONLINE       0     0     0
            wwn-0x5000039af8d1d138  ONLINE       0     0     0
            wwn-0x5000039af8d1d150  ONLINE       0     0     0
          mirror-3                  ONLINE       0     0     0
            wwn-0x5000c500e553ef5d  ONLINE       0     0     0
            wwn-0x5000c500e58a3e06  ONLINE       0     0     0
          mirror-5                  ONLINE       0     0     0
            wwn-0x5000c500e5be860b  ONLINE       0     0     0
            wwn-0x5000cca2c2dbcbf4  ONLINE       0     0     0
          mirror-6                  ONLINE       0     0     0
            wwn-0x5000c500e5b190de  ONLINE       0     0     0
            wwn-0x5000cca285c940e0  ONLINE       0     0     0
          mirror-7                  ONLINE       0     0     0
            wwn-0x5000039af8d1d3c6  ONLINE       0     0     0
            wwn-0x5000c500e54f7ef2  ONLINE       0     0     0
        special
          mirror-4                  ONLINE       0     0     0
            wwn-0x500a0751e6bc2335  ONLINE       0     0     0
            wwn-0x500a0751e5c75dfe  ONLINE       0     0     0
            wwn-0x5002538f3392be89  ONLINE       0     0     0
        spares
          wwn-0x5000c500e848371d    AVAIL  

errors: No known data errors

The values aren't consistent, but they are always low (4 is the largest I've seen). I've attached the full output -- it appears that both disks got hit with this error at around the same time and never again. This happens extremely rarely and I haven't seen any rhyme or reason for it.

System is a 12900K with 128GB memory and an LSI 9305-16i connected to a 45homelab HL15 with 15 drives in a 7x2 + hot spare setup. Of the above disks, the two 4Kn disks having the error are wwn-0x5000039ce85bf1cd and wwn-0x5000039ce8626e19. The disks that previously held that specific spot never had issues, but were replaced to upgrade the vdev from 8TB -> 18TB.

The array has never had any actual runtime issues that were noticeable in any applications and clearing the error makes it go away for what appears to be about a month at a time according to the attached log excerpts. Scrubs don't run monthly, they run every 4 months -- so there were no scrubs running when the errors previously occurred.

Any ideas on things to try to narrow down the issue?
 

Attachments

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!