Hello everyone,
I'm encountering an issue with two of our nodes that both use the same Host Bus Adapter (HBA). Recently, they have begun to consistently log error messages, which are cluttering the journal. Here are examples of the repeated error messages:
```
Apr 04 02:28:05 pve kernel: mpt3sas_cm0: log_info(0x300301e1): originator(IOP), code(0x03), sub_code(0x01e1)
...
(Repeated errors)
...
Apr 04 02:28:25 pve kernel: mpt3sas_cm0: log_info(0x300301e1): originator(IOP), code(0x03), sub_code(0x01e1)
```
These errors are occurring intermittently, often minutes or less apart. In an attempt to address this, I upgraded the drivers for the LSI/Broadcom 9400-8 cards, which did update the mpt3sas to version 49. However, the errors persist. Despite minimal activity on these systems, my research suggests these could be timeout errors...? This doesn't make much sense, though.
Given that this issue surfaced on both nodes simultaneously within the past few weeks, I initially doubted hardware failure. Both nodes are running the latest firmware available for these HBA cards, which, despite being somewhat dated, should still be supported as these cards are currently on the market.
Considering these nodes have operated without such issues for over a year, I'm beginning to suspect a kernel-related problem. I have also updated the server BIOS on both machines following a recent release, yet this step didn't resolve the issue either.
At this point, the errors don't seem to be causing functional problems, but I'm concerned they might be indicative of a deeper issue. I'm at a bit of a loss for next steps and would greatly appreciate any guidance or suggestions from the community.
Thank you in advance for your help!
Best regards,
Keith
I'm encountering an issue with two of our nodes that both use the same Host Bus Adapter (HBA). Recently, they have begun to consistently log error messages, which are cluttering the journal. Here are examples of the repeated error messages:
```
Apr 04 02:28:05 pve kernel: mpt3sas_cm0: log_info(0x300301e1): originator(IOP), code(0x03), sub_code(0x01e1)
...
(Repeated errors)
...
Apr 04 02:28:25 pve kernel: mpt3sas_cm0: log_info(0x300301e1): originator(IOP), code(0x03), sub_code(0x01e1)
```
These errors are occurring intermittently, often minutes or less apart. In an attempt to address this, I upgraded the drivers for the LSI/Broadcom 9400-8 cards, which did update the mpt3sas to version 49. However, the errors persist. Despite minimal activity on these systems, my research suggests these could be timeout errors...? This doesn't make much sense, though.
Given that this issue surfaced on both nodes simultaneously within the past few weeks, I initially doubted hardware failure. Both nodes are running the latest firmware available for these HBA cards, which, despite being somewhat dated, should still be supported as these cards are currently on the market.
Considering these nodes have operated without such issues for over a year, I'm beginning to suspect a kernel-related problem. I have also updated the server BIOS on both machines following a recent release, yet this step didn't resolve the issue either.
At this point, the errors don't seem to be causing functional problems, but I'm concerned they might be indicative of a deeper issue. I'm at a bit of a loss for next steps and would greatly appreciate any guidance or suggestions from the community.
Thank you in advance for your help!
Best regards,
Keith