Hey all. This bug looks to be known (but on a different kernel in the Debian report), but from what I can tell it was supposed to be fixed and yet I'm seeing it. It's very easy to replicate (unlike the bug reports where I'm assuming they weren't using ZFS and had to create the high load scenarios as documented) by attempting to do a ZFS scrub. It's also pretty game ending for ZFS since "scrub" and more importantly "resilver" is highload. IO drops to pittance due to the controller resets and you get onesie/twosie read/write checksum errors due to the timeouts.
Is there anything in PVE kernel about it that you know of? Is it at all related to the Ubuntu report ICO 1810781 which is marked as complete/won'tfix depending on kernel? I know that is mentioned in 926202 but the actual error code is different.
Actual issue (multiple repeats):
Ubuntu bug is for error mpt3sas_cm0: fault_state(0x2100)! whereas I experience mpt3sas_cm0: fault_state(0x5862)!; as well the Ubuntu bug report is for LSI [1000:00ac] compared to [1000:00c4] or [1000:31a0].
The LSI 3008 chipset on the motherboard itself (which is flashed to the same firmware revision) is a
[1000:0097] / [15d9:0808] and does not appear to exhibit the same bug, but it also only has 6 drives attached compared to the 24 of the 3224 and all 6 of those are SSDs compared to the 24 Seagate enterprise drives.
Is there anything in PVE kernel about it that you know of? Is it at all related to the Ubuntu report ICO 1810781 which is marked as complete/won'tfix depending on kernel? I know that is mentioned in 926202 but the actual error code is different.
Code:
21:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 [1000:00c4] (rev 01) Subsystem: LSI Logic / Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 [1000:31a0] Kernel driver in use: mpt3sas Kernel modules: mpt3sas
Actual issue (multiple repeats):
Code:
[17901.927979] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[17901.927982] mpt3sas_cm0: removing unresponding devices: start
[17901.927982] mpt3sas_cm0: removing unresponding devices: end-devices
[17901.927983] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[17901.927983] mpt3sas_cm0: removing unresponding devices: expanders
[17901.927984] mpt3sas_cm0: removing unresponding devices: complete
[17901.927988] mpt3sas_cm0: scan devices: start
[17901.928900] mpt3sas_cm0: scan devices: expanders start
[17901.928955] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[17901.928956] mpt3sas_cm0: scan devices: expanders complete
[17901.928956] mpt3sas_cm0: scan devices: end devices start
[17901.930722] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[17901.930723] mpt3sas_cm0: scan devices: end devices complete
[17901.930723] mpt3sas_cm0: scan devices: pcie end devices start
[17901.930739] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[17901.930754] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[17901.930756] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
[17901.930756] mpt3sas_cm0: pcie devices: pcie end devices complete
[17901.930757] mpt3sas_cm0: scan devices: complete
[17901.934272] sd 1:0:14:0: Power-on or device reset occurred
[17902.937655] mpt3sas_cm0: fault_state(0x5862)!
[17902.938232] mpt3sas_cm0: sending diag reset !!
[17903.902790] mpt3sas_cm0: diag reset: SUCCESS
[17903.918307] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[17904.059127] mpt3sas_cm0: _base_display_fwpkg_version: complete
[17904.059746] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
[17904.060498] mpt3sas_cm0: LSISAS3224: FWVersion(16.00.01.00), ChipRevision(0x01), BiosVersion(18.00.00.00)
[17904.060991] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[17904.062060] mpt3sas_cm0: sending port enable !!
[17912.077673] mpt3sas_cm0: port enable: SUCCESS
[17912.078417] mpt3sas_cm0: search for end-devices: start
[17912.079973] scsi target1:0:0: handle(0x0019), sas_addr(0x5000c500837fbfe5)
[17912.080509] scsi target1:0:0: enclosure logical id(0x500062b202bc0440), slot(3)
[17912.081079] scsi target1:0:8: handle(0x001a), sas_addr(0x5000c5008d6a5dbd)
[17912.081611] scsi target1:0:8: enclosure logical id(0x500062b202bc0440), slot(19)
[17912.082194] scsi target1:0:16: handle(0x001b), sas_addr(0x5000c5008d6a8805)
[17912.082804] scsi target1:0:16: enclosure logical id(0x500062b202bc0440), slot(11)
[17912.083428] scsi target1:0:2: handle(0x001c), sas_addr(0x5000c5008d6a2175)
[17912.083952] scsi target1:0:2: enclosure logical id(0x500062b202bc0440), slot(0)
[17912.084477] handle changed from(0x001d)!!!
[17912.085034] scsi target1:0:3: handle(0x001d), sas_addr(0x5000c5008d6a2519)
[17912.085553] scsi target1:0:3: enclosure logical id(0x500062b202bc0440), slot(1)
[17912.086078] handle changed from(0x001e)!!!
[17912.086657] scsi target1:0:4: handle(0x001e), sas_addr(0x5000c5008d6a3ed1)
[17912.087288] scsi target1:0:4: enclosure logical id(0x500062b202bc0440), slot(7)
[17912.087942] handle changed from(0x001f)!!!
[17912.088542] scsi target1:0:5: handle(0x001f), sas_addr(0x5000c5008d6a3ef1)
[17912.089100] scsi target1:0:5: enclosure logical id(0x500062b202bc0440), slot(6)
[17912.089666] handle changed from(0x0020)!!!
[17912.090278] scsi target1:0:6: handle(0x0020), sas_addr(0x5000c500850fe505)
[17912.090959] scsi target1:0:6: enclosure logical id(0x500062b202bc0440), slot(4)
[17912.091597] handle changed from(0x0021)!!!
[17912.092221] scsi target1:0:7: handle(0x0021), sas_addr(0x5000c5008d6a2515)
[17912.092821] scsi target1:0:7: enclosure logical id(0x500062b202bc0440), slot(5)
[17912.093415] handle changed from(0x0022)!!!
[17912.094053] scsi target1:0:9: handle(0x0022), sas_addr(0x5000c5008d6966f5)
[17912.094765] scsi target1:0:9: enclosure logical id(0x500062b202bc0440), slot(18)
[17912.095414] handle changed from(0x0023)!!!
[17912.096043] scsi target1:0:10: handle(0x0023), sas_addr(0x5000c50083ac4fdd)
[17912.096651] scsi target1:0:10: enclosure logical id(0x500062b202bc0440), slot(16)
[17912.097257] handle changed from(0x0024)!!!
[17912.097914] scsi target1:0:11: handle(0x0024), sas_addr(0x5000c5008d645759)
[17912.098627] scsi target1:0:11: enclosure logical id(0x500062b202bc0440), slot(17)
[17912.099313] handle changed from(0x0025)!!!
[17912.099960] scsi target1:0:12: handle(0x0025), sas_addr(0x5000c5008d6a1d59)
[17912.100581] scsi target1:0:12: enclosure logical id(0x500062b202bc0440), slot(23)
[17912.101203] handle changed from(0x0026)!!!
[17912.101876] scsi target1:0:13: handle(0x0026), sas_addr(0x5000c5008d6a3ee1)
[17912.102627] scsi target1:0:13: enclosure logical id(0x500062b202bc0440), slot(22)
[17912.103341] handle changed from(0x0027)!!!
[17912.104006] scsi target1:0:14: handle(0x0027), sas_addr(0x5000c5008d6a5db9)
[17912.104646] scsi target1:0:14: enclosure logical id(0x500062b202bc0440), slot(20)
[17912.105279] handle changed from(0x0028)!!!
[17912.105937] scsi target1:0:15: handle(0x0028), sas_addr(0x5000c5008d6a3411)
[17912.106645] scsi target1:0:15: enclosure logical id(0x500062b202bc0440), slot(21)
[17912.107324] handle changed from(0x0029)!!!
[17912.107967] scsi target1:0:17: handle(0x0029), sas_addr(0x5000c5008d6a240d)
[17912.108577] scsi target1:0:17: enclosure logical id(0x500062b202bc0440), slot(10)
[17912.109183] handle changed from(0x002a)!!!
[17912.109858] scsi target1:0:18: handle(0x002a), sas_addr(0x5000c5008d6a3ef5)
[17912.110591] scsi target1:0:18: enclosure logical id(0x500062b202bc0440), slot(8)
[17912.111280] handle changed from(0x002b)!!!
[17912.111921] scsi target1:0:19: handle(0x002b), sas_addr(0x5000c5008d6c70a5)
[17912.112535] scsi target1:0:19: enclosure logical id(0x500062b202bc0440), slot(9)
[17912.113150] handle changed from(0x002c)!!!
[17912.113804] scsi target1:0:20: handle(0x002c), sas_addr(0x5000c5008d6b1c79)
[17912.114534] scsi target1:0:20: enclosure logical id(0x500062b202bc0440), slot(15)
[17912.115240] handle changed from(0x002d)!!!
[17912.115886] scsi target1:0:21: handle(0x002d), sas_addr(0x5000c5008d6c8729)
[17912.116506] scsi target1:0:21: enclosure logical id(0x500062b202bc0440), slot(14)
[17912.117123] handle changed from(0x002e)!!!
[17912.117787] scsi target1:0:22: handle(0x002e), sas_addr(0x5000c50093d311c1)
[17912.118519] scsi target1:0:22: enclosure logical id(0x500062b202bc0440), slot(12)
[17912.119227] handle changed from(0x002f)!!!
[17912.119875] scsi target1:0:23: handle(0x002f), sas_addr(0x5000c5008d6c70b1)
[17912.120497] scsi target1:0:23: enclosure logical id(0x500062b202bc0440), slot(13)
[17912.121116] handle changed from(0x0030)!!!
[17912.121773] scsi target1:0:1: handle(0x0030), sas_addr(0x5000c5008d6c7109)
[17912.122506] scsi target1:0:1: enclosure logical id(0x500062b202bc0440), slot(2)
[17912.123219] handle changed from(0x001c)!!!
[17912.123894] mpt3sas_cm0: search for end-devices: complete
[17912.124509] mpt3sas_cm0: search for end-devices: start
[17912.125113] mpt3sas_cm0: search for PCIe end-devices: complete
[17912.125719] mpt3sas_cm0: search for expanders: start
[17912.126410] mpt3sas_cm0: search for expanders: complete
[17912.127106] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
Ubuntu bug is for error mpt3sas_cm0: fault_state(0x2100)! whereas I experience mpt3sas_cm0: fault_state(0x5862)!; as well the Ubuntu bug report is for LSI [1000:00ac] compared to [1000:00c4] or [1000:31a0].
The LSI 3008 chipset on the motherboard itself (which is flashed to the same firmware revision) is a
[1000:0097] / [15d9:0808] and does not appear to exhibit the same bug, but it also only has 6 drives attached compared to the 24 of the 3224 and all 6 of those are SSDs compared to the 24 Seagate enterprise drives.