Proxmox 6 mpt3sas Debian Bug ICO Report 926202 with Loaded Controller

armouredking

Member
Jul 3, 2016
12
1
21
40
Hey all. This bug looks to be known (but on a different kernel in the Debian report), but from what I can tell it was supposed to be fixed and yet I'm seeing it. It's very easy to replicate (unlike the bug reports where I'm assuming they weren't using ZFS and had to create the high load scenarios as documented) by attempting to do a ZFS scrub. It's also pretty game ending for ZFS since "scrub" and more importantly "resilver" is highload. IO drops to pittance due to the controller resets and you get onesie/twosie read/write checksum errors due to the timeouts.

Is there anything in PVE kernel about it that you know of? Is it at all related to the Ubuntu report ICO 1810781 which is marked as complete/won'tfix depending on kernel? I know that is mentioned in 926202 but the actual error code is different.

Code:
21:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 [1000:00c4] (rev 01) Subsystem: LSI Logic / Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 [1000:31a0] Kernel driver in use: mpt3sas Kernel modules: mpt3sas

Actual issue (multiple repeats):
Code:
[17901.927979] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[17901.927982] mpt3sas_cm0: removing unresponding devices: start
[17901.927982] mpt3sas_cm0: removing unresponding devices: end-devices
[17901.927983] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[17901.927983] mpt3sas_cm0: removing unresponding devices: expanders
[17901.927984] mpt3sas_cm0: removing unresponding devices: complete
[17901.927988] mpt3sas_cm0: scan devices: start
[17901.928900] mpt3sas_cm0:     scan devices: expanders start
[17901.928955] mpt3sas_cm0:     break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[17901.928956] mpt3sas_cm0:     scan devices: expanders complete
[17901.928956] mpt3sas_cm0:     scan devices: end devices start
[17901.930722] mpt3sas_cm0:     break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[17901.930723] mpt3sas_cm0:     scan devices: end devices complete
[17901.930723] mpt3sas_cm0:     scan devices: pcie end devices start
[17901.930739] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[17901.930754] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[17901.930756] mpt3sas_cm0:     break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
[17901.930756] mpt3sas_cm0:     pcie devices: pcie end devices complete
[17901.930757] mpt3sas_cm0: scan devices: complete
[17901.934272] sd 1:0:14:0: Power-on or device reset occurred
[17902.937655] mpt3sas_cm0: fault_state(0x5862)!
[17902.938232] mpt3sas_cm0: sending diag reset !!
[17903.902790] mpt3sas_cm0: diag reset: SUCCESS
[17903.918307] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[17904.059127] mpt3sas_cm0: _base_display_fwpkg_version: complete
[17904.059746] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
[17904.060498] mpt3sas_cm0: LSISAS3224: FWVersion(16.00.01.00), ChipRevision(0x01), BiosVersion(18.00.00.00)
[17904.060991] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[17904.062060] mpt3sas_cm0: sending port enable !!
[17912.077673] mpt3sas_cm0: port enable: SUCCESS
[17912.078417] mpt3sas_cm0: search for end-devices: start
[17912.079973] scsi target1:0:0: handle(0x0019), sas_addr(0x5000c500837fbfe5)
[17912.080509] scsi target1:0:0: enclosure logical id(0x500062b202bc0440), slot(3)
[17912.081079] scsi target1:0:8: handle(0x001a), sas_addr(0x5000c5008d6a5dbd)
[17912.081611] scsi target1:0:8: enclosure logical id(0x500062b202bc0440), slot(19)
[17912.082194] scsi target1:0:16: handle(0x001b), sas_addr(0x5000c5008d6a8805)
[17912.082804] scsi target1:0:16: enclosure logical id(0x500062b202bc0440), slot(11)
[17912.083428] scsi target1:0:2: handle(0x001c), sas_addr(0x5000c5008d6a2175)
[17912.083952] scsi target1:0:2: enclosure logical id(0x500062b202bc0440), slot(0)
[17912.084477]     handle changed from(0x001d)!!!
[17912.085034] scsi target1:0:3: handle(0x001d), sas_addr(0x5000c5008d6a2519)
[17912.085553] scsi target1:0:3: enclosure logical id(0x500062b202bc0440), slot(1)
[17912.086078]     handle changed from(0x001e)!!!
[17912.086657] scsi target1:0:4: handle(0x001e), sas_addr(0x5000c5008d6a3ed1)
[17912.087288] scsi target1:0:4: enclosure logical id(0x500062b202bc0440), slot(7)
[17912.087942]     handle changed from(0x001f)!!!
[17912.088542] scsi target1:0:5: handle(0x001f), sas_addr(0x5000c5008d6a3ef1)
[17912.089100] scsi target1:0:5: enclosure logical id(0x500062b202bc0440), slot(6)
[17912.089666]     handle changed from(0x0020)!!!
[17912.090278] scsi target1:0:6: handle(0x0020), sas_addr(0x5000c500850fe505)
[17912.090959] scsi target1:0:6: enclosure logical id(0x500062b202bc0440), slot(4)
[17912.091597]     handle changed from(0x0021)!!!
[17912.092221] scsi target1:0:7: handle(0x0021), sas_addr(0x5000c5008d6a2515)
[17912.092821] scsi target1:0:7: enclosure logical id(0x500062b202bc0440), slot(5)
[17912.093415]     handle changed from(0x0022)!!!
[17912.094053] scsi target1:0:9: handle(0x0022), sas_addr(0x5000c5008d6966f5)
[17912.094765] scsi target1:0:9: enclosure logical id(0x500062b202bc0440), slot(18)
[17912.095414]     handle changed from(0x0023)!!!
[17912.096043] scsi target1:0:10: handle(0x0023), sas_addr(0x5000c50083ac4fdd)
[17912.096651] scsi target1:0:10: enclosure logical id(0x500062b202bc0440), slot(16)
[17912.097257]     handle changed from(0x0024)!!!
[17912.097914] scsi target1:0:11: handle(0x0024), sas_addr(0x5000c5008d645759)
[17912.098627] scsi target1:0:11: enclosure logical id(0x500062b202bc0440), slot(17)
[17912.099313]     handle changed from(0x0025)!!!
[17912.099960] scsi target1:0:12: handle(0x0025), sas_addr(0x5000c5008d6a1d59)
[17912.100581] scsi target1:0:12: enclosure logical id(0x500062b202bc0440), slot(23)
[17912.101203]     handle changed from(0x0026)!!!
[17912.101876] scsi target1:0:13: handle(0x0026), sas_addr(0x5000c5008d6a3ee1)
[17912.102627] scsi target1:0:13: enclosure logical id(0x500062b202bc0440), slot(22)
[17912.103341]     handle changed from(0x0027)!!!
[17912.104006] scsi target1:0:14: handle(0x0027), sas_addr(0x5000c5008d6a5db9)
[17912.104646] scsi target1:0:14: enclosure logical id(0x500062b202bc0440), slot(20)
[17912.105279]     handle changed from(0x0028)!!!
[17912.105937] scsi target1:0:15: handle(0x0028), sas_addr(0x5000c5008d6a3411)
[17912.106645] scsi target1:0:15: enclosure logical id(0x500062b202bc0440), slot(21)
[17912.107324]     handle changed from(0x0029)!!!
[17912.107967] scsi target1:0:17: handle(0x0029), sas_addr(0x5000c5008d6a240d)
[17912.108577] scsi target1:0:17: enclosure logical id(0x500062b202bc0440), slot(10)
[17912.109183]     handle changed from(0x002a)!!!
[17912.109858] scsi target1:0:18: handle(0x002a), sas_addr(0x5000c5008d6a3ef5)
[17912.110591] scsi target1:0:18: enclosure logical id(0x500062b202bc0440), slot(8)
[17912.111280]     handle changed from(0x002b)!!!
[17912.111921] scsi target1:0:19: handle(0x002b), sas_addr(0x5000c5008d6c70a5)
[17912.112535] scsi target1:0:19: enclosure logical id(0x500062b202bc0440), slot(9)
[17912.113150]     handle changed from(0x002c)!!!
[17912.113804] scsi target1:0:20: handle(0x002c), sas_addr(0x5000c5008d6b1c79)
[17912.114534] scsi target1:0:20: enclosure logical id(0x500062b202bc0440), slot(15)
[17912.115240]     handle changed from(0x002d)!!!
[17912.115886] scsi target1:0:21: handle(0x002d), sas_addr(0x5000c5008d6c8729)
[17912.116506] scsi target1:0:21: enclosure logical id(0x500062b202bc0440), slot(14)
[17912.117123]     handle changed from(0x002e)!!!
[17912.117787] scsi target1:0:22: handle(0x002e), sas_addr(0x5000c50093d311c1)
[17912.118519] scsi target1:0:22: enclosure logical id(0x500062b202bc0440), slot(12)
[17912.119227]     handle changed from(0x002f)!!!
[17912.119875] scsi target1:0:23: handle(0x002f), sas_addr(0x5000c5008d6c70b1)
[17912.120497] scsi target1:0:23: enclosure logical id(0x500062b202bc0440), slot(13)
[17912.121116]     handle changed from(0x0030)!!!
[17912.121773] scsi target1:0:1: handle(0x0030), sas_addr(0x5000c5008d6c7109)
[17912.122506] scsi target1:0:1: enclosure logical id(0x500062b202bc0440), slot(2)
[17912.123219]     handle changed from(0x001c)!!!
[17912.123894] mpt3sas_cm0: search for end-devices: complete
[17912.124509] mpt3sas_cm0: search for end-devices: start
[17912.125113] mpt3sas_cm0: search for PCIe end-devices: complete
[17912.125719] mpt3sas_cm0: search for expanders: start
[17912.126410] mpt3sas_cm0: search for expanders: complete
[17912.127106] mpt3sas_cm0: _base_fault_reset_work: hard reset: success

Ubuntu bug is for error mpt3sas_cm0: fault_state(0x2100)! whereas I experience mpt3sas_cm0: fault_state(0x5862)!; as well the Ubuntu bug report is for LSI [1000:00ac] compared to [1000:00c4] or [1000:31a0].

The LSI 3008 chipset on the motherboard itself (which is flashed to the same firmware revision) is a
[1000:0097] / [15d9:0808] and does not appear to exhibit the same bug, but it also only has 6 drives attached compared to the 24 of the 3224 and all 6 of those are SSDs compared to the 24 Seagate enterprise drives.
 
I am having the same issue. I have 4 nodes running PVE 6.0-7 and once in awhile the server would freeze and then reboot on it's own. Looking through the log I see the same errors as you posted above. It's a Dell PowerEdge R7415 with non-raid Dell HBA330 Mini (Embedded) Firmware 16.17.00.03 on-board controller. Looking at the boot screen it's actually Avago Tech MPT SAS3 with MPT3BIOS-8.37.00.00 (2018.04.04).

Sep 9 17:16:01 pve7 systemd[1]: Started Proxmox VE replication runner.
Sep 9 17:17:00 pve7 systemd[1]: Starting Proxmox VE replication runner...
Sep 9 17:17:01 pve7 systemd[1]: pvesr.service: Succeeded.
Sep 9 17:17:01 pve7 systemd[1]: Started Proxmox VE replication runner.
Sep 9 17:17:01 pve7 CRON[671732]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Sep 9 17:17:16 pve7 kernel: [35102.602642] mpt3sas_cm0: fault_state(0x5862)!
Sep 9 17:17:16 pve7 kernel: [35102.602666] mpt3sas_cm0: sending diag reset !!
Sep 9 17:17:17 pve7 kernel: [35103.650419] mpt3sas_cm0: diag reset: SUCCESS
Sep 9 17:17:17 pve7 kernel: [35103.714629] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
Sep 9 17:17:17 pve7 kernel: [35103.886069] mpt3sas_cm0: _base_display_fwpkg_version: complete
Sep 9 17:17:17 pve7 kernel: [35103.886073] mpt3sas_cm0: FW Package Version (16.17.00.03)
Sep 9 17:17:17 pve7 kernel: [35103.886484] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.04.00), ChipRevision(0x02), BiosVersion(18.00.00.00)
Sep 9 17:17:17 pve7 kernel: [35103.886485] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
Sep 9 17:17:17 pve7 kernel: [35103.886536] mpt3sas_cm0: sending port enable !!
Sep 9 17:17:24 pve7 kernel: [35110.934013] mpt3sas_cm0: port enable: SUCCESS
Sep 9 17:17:24 pve7 kernel: [35110.934137] mpt3sas_cm0: search for end-devices: start
Sep 9 17:17:24 pve7 kernel: [35110.934513] scsi target1:0:0: handle(0x000a), sas_addr(0x50000399082245ca)
Sep 9 17:17:24 pve7 kernel: [35110.934516] scsi target1:0:0: enclosure logical id(0x500056b31234abff), slot(0)
Sep 9 17:17:24 pve7 kernel: [35110.934555] scsi target1:0:1: handle(0x000b), sas_addr(0x5000c500c1c595a5)
Sep 9 17:17:24 pve7 kernel: [35110.934556] scsi target1:0:1: enclosure logical id(0x500056b31234abff), slot(1)
Sep 9 17:17:24 pve7 kernel: [35110.934624] scsi target1:0:2: handle(0x000c), sas_addr(0x5000cca09987dda9)
Sep 9 17:17:24 pve7 kernel: [35110.934626] scsi target1:0:2: enclosure logical id(0x500056b31234abff), slot(2)
Sep 9 17:17:24 pve7 kernel: [35110.934663] scsi target1:0:3: handle(0x000d), sas_addr(0x5000cca09987df31)
Sep 9 17:17:24 pve7 kernel: [35110.934664] scsi target1:0:3: enclosure logical id(0x500056b31234abff), slot(3)
Sep 9 17:17:24 pve7 kernel: [35110.934703] scsi target1:0:4: handle(0x000e), sas_addr(0x5000cca09986d681)
Sep 9 17:17:24 pve7 kernel: [35110.934703] scsi target1:0:4: enclosure logical id(0x500056b31234abff), slot(4)
Sep 9 17:17:24 pve7 kernel: [35110.934742] scsi target1:0:5: handle(0x000f), sas_addr(0x5000c500c1caf7ed)
Sep 9 17:17:24 pve7 kernel: [35110.934744] scsi target1:0:5: enclosure logical id(0x500056b31234abff), slot(8)
Sep 9 17:17:24 pve7 kernel: [35110.934782] scsi target1:0:6: handle(0x0010), sas_addr(0x500056b3c8f281fd)
Sep 9 17:17:24 pve7 kernel: [35110.934783] scsi target1:0:6: enclosure logical id(0x500056b31234abff), slot(12)
Sep 9 17:17:24 pve7 kernel: [35110.934825] mpt3sas_cm0: search for end-devices: complete
Sep 9 17:17:24 pve7 kernel: [35110.934825] mpt3sas_cm0: search for end-devices: start
Sep 9 17:17:24 pve7 kernel: [35110.934825] mpt3sas_cm0: search for PCIe end-devices: complete
Sep 9 17:17:24 pve7 kernel: [35110.934826] mpt3sas_cm0: search for expanders: start
Sep 9 17:17:24 pve7 kernel: [35110.934864] expander present: handle(0x0009), sas_addr(0x500056b3c8f281ff)
Sep 9 17:17:24 pve7 kernel: [35110.934902] mpt3sas_cm0: search for expanders: complete
Sep 9 17:17:24 pve7 kernel: [35110.934907] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
Sep 9 17:17:24 pve7 kernel: [35110.934910] mpt3sas_cm0: removing unresponding devices: start
Sep 9 17:17:24 pve7 kernel: [35110.934911] mpt3sas_cm0: removing unresponding devices: end-devices
Sep 9 17:17:24 pve7 kernel: [35110.934911] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
Sep 9 17:17:24 pve7 kernel: [35110.934912] mpt3sas_cm0: removing unresponding devices: expanders
Sep 9 17:17:24 pve7 kernel: [35110.934912] mpt3sas_cm0: removing unresponding devices: complete
Sep 9 17:17:24 pve7 kernel: [35110.934914] mpt3sas_cm0: scan devices: start
Sep 9 17:17:24 pve7 kernel: [35110.935242] mpt3sas_cm0: scan devices: expanders start
Sep 9 17:17:24 pve7 kernel: [35110.936864] mpt3sas_cm0: break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
Sep 9 17:17:24 pve7 kernel: [35110.936865] mpt3sas_cm0: scan devices: expanders complete
Sep 9 17:17:24 pve7 kernel: [35110.936865] mpt3sas_cm0: scan devices: end devices start
Sep 9 17:17:24 pve7 kernel: [35110.937492] mpt3sas_cm0: break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
Sep 9 17:17:24 pve7 kernel: [35110.937493] mpt3sas_cm0: scan devices: end devices complete
Sep 9 17:17:24 pve7 kernel: [35110.937494] mpt3sas_cm0: scan devices: pcie end devices start
Sep 9 17:17:24 pve7 kernel: [35110.937509] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
Sep 9 17:17:24 pve7 kernel: [35110.937524] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
Sep 9 17:17:24 pve7 kernel: [35110.937526] mpt3sas_cm0: break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)

Lucky I have HA running to keep the VMs going but sooner or later via unxpected reboots it may cause VM corruption so hoping there is a fix or at least a workaround for now.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!