[SOLVED] HW raid strange issue: megaraid_sas FW in FAULT state

Galen Chi

New Member
Mar 15, 2024
5
0
1
Greeting,

I've been experiencing a strange issue that randomly crashes the HW raid I/O for a while,
The issue could be temporary fix by simply reboot the server,
I've tried the stress test to the Machine included the Virtual Drive on MegaRAID SAS 9270-4i, but I can't recreate the issue.

Machine specs:
CPU: AMD 4750G PRO
MB: Asus TUF B450 Plus-ii
Ram: Transcend 16GB ECC-DIMM * 2
PCIE: EVGA RTX 2060
PCIE: LSI MegaRaid 9270-4i
OS: pve-manager/8.1.4/ec5affc9e41f1d79 (running kernel: 6.5.13-1-pve)
PVE Guest: Windows10, TureNas, UbuntuServer.
Additional : GPU Passthrough to Windows10 follow the Topic : PCI/GPU Passthrough on Proxmox VE 8 : Installation and configuration

Symptoms: LSI MegaRaid 9270-4i randomly disconnect / crash.


I've try trying to fix the issues with no luck:
1. Upgrade MegaRaid Lsi 9271-4I Bios/Firmware from 5.48.04.0 to 5.50.03.0, Issue still presents.

2.Install the package from https://hwraid.le-vert.net/wiki/DebianPackages

3. And I just changed the /etc/default/grub, Not sure if this will solve the problem yet.
from :
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"
To :
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off iommu=pt



Below is the ErrorLog:
Code:
Mar 15 09:28:54 proxmox kernel: megaraid_sas 0000:06:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0xffffffff
Mar 15 09:28:54 proxmox kernel: megaraid_sas 0000:06:00.0: FW in FAULT state Fault code:0xfff0000 subcode:0xff00 func:megasas_wait_for_outstanding_fusion
Mar 15 09:28:54 proxmox kernel: megaraid_sas 0000:06:00.0: resetting fusion adapter scsi0.
Mar 15 09:28:54 proxmox kernel: megaraid_sas 0000:06:00.0: Outstanding fastpath IOs: 19
Mar 15 09:30:46 proxmox kernel: megaraid_sas 0000:06:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 4097
Mar 15 09:30:49 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - got timeout
Mar 15 09:30:49 proxmox pvestatd[1194]: status update time (8.236 seconds)
Mar 15 09:30:59 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:30:59 proxmox pvestatd[1194]: status update time (8.221 seconds)
Mar 15 09:31:09 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:31:09 proxmox pvestatd[1194]: status update time (8.248 seconds)
Mar 15 09:31:19 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:31:20 proxmox pvestatd[1194]: status update time (8.252 seconds)
Mar 15 09:31:29 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:31:29 proxmox pvestatd[1194]: status update time (8.239 seconds)
Mar 15 09:31:39 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:31:39 proxmox pvestatd[1194]: status update time (8.232 seconds)
Mar 15 09:31:47 proxmox pvedaemon[932564]: <root@pam> successful auth for user 'root@pam'
Mar 15 09:31:49 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:31:49 proxmox pvestatd[1194]: status update time (8.252 seconds)
Mar 15 09:31:55 proxmox pvedaemon[932371]: <root@pam> successful auth for user 'root@pam'
Mar 15 09:31:59 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:32:00 proxmox pvestatd[1194]: status update time (8.251 seconds)
Mar 15 09:32:09 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:32:09 proxmox pvestatd[1194]: status update time (8.212 seconds)
Mar 15 09:32:19 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:32:19 proxmox pvestatd[1194]: status update time (8.256 seconds)
Mar 15 09:32:29 proxmox pvestatd[1194]: VM 101 qmp command failed - VM 101 qmp command 'query-proxmox-support' failed - unable to connect to VM 101 qmp socket - timeout after 51 retries
Mar 15 09:32:29 proxmox pvestatd[1194]: status update time (8.231 seconds)
Mar 15 09:32:36 proxmox kernel: INFO: task jbd2/sda1-8:702 blocked for more than 120 seconds.
Mar 15 09:32:36 proxmox kernel:       Tainted: P           O       6.5.13-1-pve #1
Mar 15 09:32:36 proxmox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 09:32:36 proxmox kernel: task:jbd2/sda1-8     state:D stack:0     pid:702   ppid:2      flags:0x00004000
Mar 15 09:32:36 proxmox kernel: Call Trace:
Mar 15 09:32:36 proxmox kernel:  <TASK>
Mar 15 09:32:36 proxmox kernel:  __schedule+0x3fc/0x1440
Mar 15 09:32:36 proxmox kernel:  ? update_load_avg+0x82/0x7f0
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  schedule+0x63/0x110
Mar 15 09:32:36 proxmox kernel:  io_schedule+0x46/0x80
Mar 15 09:32:36 proxmox kernel:  bit_wait_io+0x11/0x90
Mar 15 09:32:36 proxmox kernel:  __wait_on_bit+0x4d/0x120
Mar 15 09:32:36 proxmox kernel:  ? __pfx_bit_wait_io+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  out_of_line_wait_on_bit+0x8c/0xb0
Mar 15 09:32:36 proxmox kernel:  ? __pfx_wake_bit_function+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  __wait_on_buffer+0x30/0x50
Mar 15 09:32:36 proxmox kernel:  jbd2_journal_commit_transaction+0x1119/0x19d0
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  kjournald2+0xab/0x280
Mar 15 09:32:36 proxmox kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ? __pfx_kjournald2+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  kthread+0xf2/0x120
Mar 15 09:32:36 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork+0x47/0x70
Mar 15 09:32:36 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork_asm+0x1b/0x30
Mar 15 09:32:36 proxmox kernel:  </TASK>
Mar 15 09:32:36 proxmox kernel: INFO: task iou-wrk-933562:1033361 blocked for more than 120 seconds.
Mar 15 09:32:36 proxmox kernel:       Tainted: P           O       6.5.13-1-pve #1
Mar 15 09:32:36 proxmox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 09:32:36 proxmox kernel: task:iou-wrk-933562  state:D stack:0     pid:1033361 ppid:1      flags:0x00004000
Mar 15 09:32:36 proxmox kernel: Call Trace:
Mar 15 09:32:36 proxmox kernel:  <TASK>
Mar 15 09:32:36 proxmox kernel:  __schedule+0x3fc/0x1440
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? unlock_page+0x18/0x60
Mar 15 09:32:36 proxmox kernel:  schedule+0x63/0x110
Mar 15 09:32:36 proxmox kernel:  io_schedule+0x46/0x80
Mar 15 09:32:36 proxmox kernel:  folio_wait_bit_common+0x136/0x330
Mar 15 09:32:36 proxmox kernel:  ? __pfx_wake_page_function+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  folio_wait_bit+0x18/0x30
Mar 15 09:32:36 proxmox kernel:  folio_wait_writeback+0x2c/0xa0
Mar 15 09:32:36 proxmox kernel:  __filemap_fdatawait_range+0x90/0x100
Mar 15 09:32:36 proxmox kernel:  file_write_and_wait_range+0x93/0xc0
Mar 15 09:32:36 proxmox kernel:  ext4_sync_file+0x86/0x380
Mar 15 09:32:36 proxmox kernel:  ? raw_spin_rq_unlock+0x10/0x40
Mar 15 09:32:36 proxmox kernel:  vfs_fsync_range+0x4b/0xa0
Mar 15 09:32:36 proxmox kernel:  ? __schedule+0x404/0x1440
Mar 15 09:32:36 proxmox kernel:  io_fsync+0x3d/0x60
Mar 15 09:32:36 proxmox kernel:  io_issue_sqe+0x68/0x3f0
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? lock_timer_base+0x72/0xa0
Mar 15 09:32:36 proxmox kernel:  io_wq_submit_work+0x90/0x2f0
Mar 15 09:32:36 proxmox kernel:  ? __timer_delete_sync+0x8c/0x100
Mar 15 09:32:36 proxmox kernel:  io_worker_handle_work+0x156/0x590
Mar 15 09:32:36 proxmox kernel:  io_wq_worker+0x112/0x3c0
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? raw_spin_rq_unlock+0x10/0x40
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? finish_task_switch.isra.0+0x85/0x2c0
Mar 15 09:32:36 proxmox kernel:  ? __pfx_io_wq_worker+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork+0x47/0x70
Mar 15 09:32:36 proxmox kernel:  ? __pfx_io_wq_worker+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork_asm+0x1b/0x30
Mar 15 09:32:36 proxmox kernel: RIP: 0033:0x0
Mar 15 09:32:36 proxmox kernel: RSP: 002b:0000000000000000 EFLAGS: 00000293 ORIG_RAX: 000000000000010f
Mar 15 09:32:36 proxmox kernel: RAX: 0000000000000000 RBX: 00006135f1746fd0 RCX: 0000784b1471b256
Mar 15 09:32:36 proxmox kernel: RDX: 00007fff7ab6b1c0 RSI: 000000000000004f RDI: 00006135f1af4c00
Mar 15 09:32:36 proxmox kernel: RBP: 00007fff7ab6b22c R08: 0000000000000008 R09: 0000000000000000
Mar 15 09:32:36 proxmox kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007fff7ab6b1c0
Mar 15 09:32:36 proxmox kernel: R13: 00006135f1746fd0 R14: 00006135f08d2c48 R15: 00007fff7ab6b230
Mar 15 09:32:36 proxmox kernel:  </TASK>
Mar 15 09:32:36 proxmox kernel: INFO: task kworker/u64:0:1024601 blocked for more than 120 seconds.
Mar 15 09:32:36 proxmox kernel:       Tainted: P           O       6.5.13-1-pve #1
Mar 15 09:32:36 proxmox kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 15 09:32:36 proxmox kernel: task:kworker/u64:0   state:D stack:0     pid:1024601 ppid:2      flags:0x00004000
Mar 15 09:32:36 proxmox kernel: Workqueue: writeback wb_workfn (flush-8:0)
Mar 15 09:32:36 proxmox kernel: Call Trace:
Mar 15 09:32:36 proxmox kernel:  <TASK>
Mar 15 09:32:36 proxmox kernel:  __schedule+0x3fc/0x1440
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  schedule+0x63/0x110
Mar 15 09:32:36 proxmox kernel:  io_schedule+0x46/0x80
Mar 15 09:32:36 proxmox kernel:  bit_wait_io+0x11/0x90
Mar 15 09:32:36 proxmox kernel:  __wait_on_bit+0x4d/0x120
Mar 15 09:32:36 proxmox kernel:  ? __pfx_bit_wait_io+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  out_of_line_wait_on_bit+0x8c/0xb0
Mar 15 09:32:36 proxmox kernel:  ? __pfx_wake_bit_function+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  do_get_write_access+0x284/0x440
Mar 15 09:32:36 proxmox kernel:  jbd2_journal_get_write_access+0x6b/0xa0
Mar 15 09:32:36 proxmox kernel:  __ext4_journal_get_write_access+0x8e/0x1c0
Mar 15 09:32:36 proxmox kernel:  ext4_reserve_inode_write+0x67/0xe0
Mar 15 09:32:36 proxmox kernel:  __ext4_mark_inode_dirty+0x71/0x240
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? __ext4_journal_start_sb+0x157/0x1d0
Mar 15 09:32:36 proxmox kernel:  ext4_dirty_inode+0x5c/0x90
Mar 15 09:32:36 proxmox kernel:  __mark_inode_dirty+0x5e/0x3b0
Mar 15 09:32:36 proxmox kernel:  ext4_da_update_reserve_space+0x184/0x1f0
Mar 15 09:32:36 proxmox kernel:  ext4_ext_map_blocks+0xf41/0x1b40
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? release_pages+0x155/0x4c0
Mar 15 09:32:36 proxmox kernel:  ? filemap_get_folios_tag+0x1c8/0x220
Mar 15 09:32:36 proxmox kernel:  ? __folio_batch_release+0x30/0x70
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? mpage_prepare_extent_to_map+0x50b/0x550
Mar 15 09:32:36 proxmox kernel:  ext4_map_blocks+0x1cb/0x620
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? kmem_cache_alloc+0x1a4/0x380
Mar 15 09:32:36 proxmox kernel:  ext4_do_writepages+0x711/0xdf0
Mar 15 09:32:36 proxmox kernel:  ext4_writepages+0xb5/0x190
Mar 15 09:32:36 proxmox kernel:  do_writepages+0xd0/0x1e0
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? nvme_prep_rq.part.0+0x3b3/0x870 [nvme]
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? fprop_reflect_period_percpu.isra.0+0x87/0x100
Mar 15 09:32:36 proxmox kernel:  __writeback_single_inode+0x44/0x370
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  ? srso_return_thunk+0x5/0x10
Mar 15 09:32:36 proxmox kernel:  writeback_sb_inodes+0x211/0x510
Mar 15 09:32:36 proxmox kernel:  __writeback_inodes_wb+0x54/0x100
Mar 15 09:32:36 proxmox kernel:  ? queue_io+0x115/0x120
Mar 15 09:32:36 proxmox kernel:  wb_writeback+0x2a8/0x320
Mar 15 09:32:36 proxmox kernel:  wb_workfn+0x2c7/0x4d0
Mar 15 09:32:36 proxmox kernel:  ? __schedule+0x404/0x1440
Mar 15 09:32:36 proxmox kernel:  process_one_work+0x23e/0x450
Mar 15 09:32:36 proxmox kernel:  worker_thread+0x50/0x3f0
Mar 15 09:32:36 proxmox kernel:  ? __pfx_worker_thread+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  kthread+0xf2/0x120
Mar 15 09:32:36 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork+0x47/0x70
Mar 15 09:32:36 proxmox kernel:  ? __pfx_kthread+0x10/0x10
Mar 15 09:32:36 proxmox kernel:  ret_from_fork_asm+0x1b/0x30
Mar 15 09:32:36 proxmox kernel:  </TASK>
Mar 15 09:32:37 proxmox kernel: megaraid_sas 0000:06:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 4097
 
Last edited:
MegaRaid Status :

Code:
root@proxmox:~# megacli -AdpAllInfo -aALL
                                    
Adapter #0

==============================================================================
                    Versions
                ================
Product Name    : LSI MegaRAID SAS 9271-4i
Serial No       : SK61821849
FW Package Build: 23.34.0-0019

                    Mfg. Data
                ================
Mfg. Date       : 05/21/16
Rework Date     : 00/00/00
Revision No     : 001
Battery FRU     : N/A

                Image Versions in Flash:
                ================
BIOS Version       : 5.50.03.0_4.17.08.00_0x06110200
WebBIOS Version    : 6.1-76-e_76-Rel
Preboot CLI Version: 05.07-00:#%00011
FW Version         : 3.460.115-6465
NVDATA Version     : 2.1507.03-0162
Boot Block Version : 2.05.00.00-0010
BOOT Version       : 07.26.26.219

                Pending Images in Flash
                ================
None

                PCI Info
                ================
Controller Id    : 0000
Vendor Id       : 1000
Device Id       : 005b
SubVendorId     : 1000
SubDeviceId     : 9276

Host Interface  : PCIE

ChipRevision    : D1

Link Speed          : 0
Number of Frontend Port: 0
Device Interface  : PCIE

Number of Backend Port: 8
Port  :  Address
0        4433221102000000
1        4433221103000000
2        0000000000000000
3        0000000000000000
4        0000000000000000
5        0000000000000000
6        0000000000000000
7        0000000000000000

                HW Configuration
                ================
SAS Address      : 500605b00bdf9a60
BBU              : Absent
Alarm            : Present
NVRAM            : Present
Serial Debugger  : Present
Memory           : Present
Flash            : Present
Memory Size      : 1024MB
TPM              : Absent
On board Expander: Absent
Upgrade Key      : Absent
Temperature sensor for ROC    : Present
Temperature sensor for controller    : Absent

ROC temperature : 53  degree Celsius

                Settings
                ================
Current Time                     : 9:3:43 3/15, 2024
Predictive Fail Poll Interval    : 300sec
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
BGI Rate                         : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 2
Delay Among Spinup Groups        : 12s
Physical Drive Coercion Mode     : 1GB
Cluster Mode                     : Disabled
Alarm                            : Enabled
Auto Rebuild                     : Enabled
Battery Warning                  : Enabled
Ecc Bucket Size                  : 15
Ecc Bucket Leak Rate             : 1440 Minutes
Restore HotSpare on Insertion    : Disabled
Expose Enclosure Devices         : Enabled
Maintain PD Fail History         : Enabled
Host Request Reordering          : Enabled
Auto Detect BackPlane Enabled    : SGPIO/i2c SEP
Load Balance Mode                : Auto
Use FDE Only                     : No
Security Key Assigned            : No
Security Key Failed              : No
Security Key Not Backedup        : No
Default LD PowerSave Policy      : Controller Defined
Maximum number of direct attached drives to spin up in 1 min : 10
Auto Enhanced Import             : Yes
Any Offline VD Cache Preserved   : No
Allow Boot with Preserved Cache  : No
Disable Online Controller Reset  : No
PFK in NVRAM                     : Yes
Use disk activity for locate     : No
POST delay             : 90 seconds
BIOS Error Handling               : Ignore Errors
Current Boot Mode           :Normal
                Capabilities
                ================
RAID Level Supported             : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Supported Drives                 : SAS, SATA

Allowed Mixing:

Mix in Enclosure Allowed
Mix of SAS/SATA of HDD type in VD Allowed
Mix of SAS/SATA of SSD type in VD Allowed
Mix of SSD/HDD in VD Allowed

                Status
                ================
ECC Bucket Count                 : 0

                Limitations
                ================
Max Arms Per VD          : 32
Max Spans Per VD         : 8
Max Arrays               : 128
Max Number of VDs        : 64
Max Parallel Commands    : 1008
Max SGE Count            : 60
Max Data Transfer Size   : 8192 sectors
Max Strips PerIO         : 42
Max LD per array         : 16
Min Strip Size           : 8 KB
Max Strip Size           : 1.0 MB
Max Configurable CacheCade Size: 0 GB
Current Size of CacheCade      : 0 GB
Current Size of FW Cache       : 875 MB

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 3
  Disks           : 2
  Critical Disks  : 0
  Failed Disks    : 0

                Supported Adapter Operations
                ================
Rebuild Rate                    : Yes
CC Rate                         : Yes
BGI Rate                        : Yes
Reconstruct Rate                : Yes
Patrol Read Rate                : Yes
Alarm Control                   : Yes
Cluster Support                 : No
BBU                             : Yes
Spanning                        : Yes
Dedicated Hot Spare             : Yes
Revertible Hot Spares           : Yes
Foreign Config Import           : Yes
Self Diagnostic                 : Yes
Allow Mixed Redundancy on Array : No
Global Hot Spares               : Yes
Deny SCSI Passthrough           : No
Deny SMP Passthrough            : No
Deny STP Passthrough            : No
Support Security                : No
Snapshot Enabled                : No
Support the OCE without adding drives : Yes
Support PFK                     : Yes
Support PI                      : Yes
Support Boot Time PFK Change    : No
Disable Online PFK Change       : No
Support LDPI Type1                      : No
Support LDPI Type2                      : No
Support LDPI Type3                      : No
PFK TrailTime Remaining         : 0 days 0 hours
Support Shield State            : Yes
Block SSD Write Disk Cache Change: Yes
Support Online FW Update    : Yes

                Supported VD Operations
                ================
Read Policy          : Yes
Write Policy         : Yes
IO Policy            : Yes
Access Policy        : Yes
Disk Cache Policy    : Yes
Reconstruction       : Yes
Deny Locate          : No
Deny CC              : No
Allow Ctrl Encryption: No
Enable LDBBM         : Yes
Support Breakmirror  : No
Power Savings        : No

                Supported PD Operations
                ================
Force Online                            : Yes
Force Offline                           : Yes
Force Rebuild                           : Yes
Deny Force Failed                       : No
Deny Force Good/Bad                     : No
Deny Missing Replace                    : No
Deny Clear                              : No
Deny Locate                             : No
Support Temperature                     : Yes
NCQ                                     : Yes
Disable Copyback                        : No
Enable JBOD                             : No
Enable Copyback on SMART                : No
Enable Copyback to SSD on SMART Error   : Yes
Enable SSD Patrol Read                  : No
PR Correct Unconfigured Areas           : Yes
Enable Spin Down of UnConfigured Drives : Yes
Disable Spin Down of hot spares         : No
Spin Down time                          : 30
T10 Power State                         : No
                Error Counters
                ================
Memory Correctable Errors   : 0
Memory Uncorrectable Errors : 0

                Cluster Information
                ================
Cluster Permitted     : No
Cluster Active        : No

                Default Settings
                ================
Phy Polarity                     : 0
Phy PolaritySplit                : 0
Background Rate                  : 30
Strip Size                       : 256kB
Flush Time                       : 4 seconds
Write Policy                     : WB
Read Policy                      : RA
Cache When BBU Bad               : Disabled
Cached IO                        : No
SMART Mode                       : Mode 6
Alarm Disable                    : Yes
Coercion Mode                    : None
ZCR Config                       : Unknown
Dirty LED Shows Drive Activity   : No
BIOS Continue on Error           : 2
Spin Down Mode                   : None
Allowed Device Type              : SAS/SATA Mix
Allow Mix in Enclosure           : Yes
Allow HDD SAS/SATA Mix in VD     : Yes
Allow SSD SAS/SATA Mix in VD     : Yes
Allow HDD/SSD Mix in VD          : Yes
Allow SATA in Cluster            : No
Max Chained Enclosures           : 16
Disable Ctrl-R                   : Yes
Enable Web BIOS                  : Yes
Direct PD Mapping                : No
BIOS Enumerate VDs               : Yes
Restore Hot Spare on Insertion   : No
Expose Enclosure Devices         : Yes
Maintain PD Fail History         : Yes
Disable Puncturing               : No
Zero Based Enclosure Enumeration : No
PreBoot CLI Enabled              : Yes
LED Show Drive Activity          : Yes
Cluster Disable                  : Yes
SAS Disable                      : No
Auto Detect BackPlane Enable     : SGPIO/i2c SEP
Use FDE Only                     : No
Enable Led Header                : No
Delay during POST                : 0
EnableCrashDump                  : No
Disable Online Controller Reset  : No
EnableLDBBM                      : Yes
Un-Certified Hard Disk Drives    : Allow
Treat Single span R1E as R10     : No
Max LD per array                 : 16
Power Saving option              : Don't Auto spin down Configured Drives
Max power savings option is  not allowed for LDs. Only T10 power conditions are to be used.
Default spin down time in minutes: 30
Enable JBOD                      : No
TTY Log In Flash                 : No
Auto Enhanced Import             : Yes
BreakMirror RAID Support         : No
Disable Join Mirror              : No
Enable Shield State              : Yes
Time taken to detect CME         : 60s

Exit Code: 0x00
 
Thanks for your replay RJ45,
Took me sometime to setup Storcli,
It's only has 32-bit available on broadcom support list of 9271-4i.
here is the out put:

Code:
root@proxmox:/# storcli /c0 show cc
Controller = 0
Status = Success
Description = None


Controller Properties :
=====================

----------------------------------------------
Ctrl_Prop                 Value             
----------------------------------------------
CC Operation Mode         Concurrent       
CC Execution Delay        168               
CC Next Starttime         03/16/2024, 3:00:00
CC Current State          Stopped           
CC Number of iterations   13               
CC Number of VD completed 0                 
CC Excluded VDs           None             
----------------------------------------------

Code:
root@proxmox:/# storcli /c0 show all
Controller = 0
Status = Success
Description = None


Basics :
======
Controller = 0
Model = LSI MegaRAID SAS 9271-4i
Serial Number = SK61821849
Current Controller Date/Time = 03/15/2024, 10:58:59
Current System Date/time = 03/15/2024, 18:58:58
SAS Address = 500605b00bdf9a60
Mfg Date = 05/21/16
Rework Date = 00/00/00
Revision No = 001


Version :
=======
Firmware Package Build = 23.34.0-0019
Firmware Version = 3.460.115-6465
Bios Version = 5.50.03.0_4.17.08.00_0x06110200
NVDATA Version = 2.1507.03-0162
Boot Block Version = 2.05.00.00-0010
Bootloader Version = 07.26.26.219
Driver Name = megaraid_sas
Driver Version = 07.725.01.00-rc1


Bus :
===
Vendor Id = 0x1000
Device Id = 0x5B
SubVendor Id = 0x1000
SubDevice Id = 0x9276
Host Interface = PCIE
Device Interface = SAS-6G
Bus Number = 40455
Device Number = 0
Function Number = 0


Status :
======
Controller Status = OK
Memory Correctable Errors = 0
Memory UnCorrectable Errors = 0
ECC Bucket Count = 0
Any Offline VD Cache Preserved = No
BBU Status = NA


Advanced Software Option :
========================

-------------------------------------------------
Adv S/W Opt        Time Remaining           Mode
-------------------------------------------------
MegaRAID FastPath  Unlimited                - 
MegaRAID RAID6     Unlimited                - 
MegaRAID RAID5     Unlimited                - 
Cache Offload      Unlimited (Unsupported)  - 
-------------------------------------------------

Safe ID =  

HwCfg :
=====
ChipRevision =  D1
BatteryFRU = N/A
Front End Port Count = 0
Backend Port Count = 8
BBU = Absent
Alarm = On
Serial Debugger = Present
NVRAM Size = 32KB
Flash Size = 16MB
On Board Memory Size = 1024MB
On Board Expander = Absent
Temperature Sensor for ROC = Present
Temperature Sensor for Controller = Absent
Current Size of CacheCade (GB) = 0
Current Size of FW Cache (MB) = 875


Policies :
========

------------------------------------------------
Policy                          Current Default
------------------------------------------------
Predictive Fail Poll Interval   300 sec       
Interrupt Throttle Active Count 16           
Interrupt Throttle Completion   50 us         
Rebuild Rate                    30 %    30%   
PR Rate                         30 %    30%   
BGI Rate                        30 %    30%   
Check Consistency Rate          30 %    30%   
Reconstruction Rate             30 %    30%   
Cache Flush Interval            4s           
------------------------------------------------

Flush Time(Default) = 4s
Drive Coercion Mode = 1GB
Auto Rebuild = On
Battery Warning = On
ECC Bucket Size = 15
ECC Bucket Leak Rate (hrs) = 24
Restore HotSpare on Insertion = Off
Expose Enclosure Devices = On
Maintain PD Fail History = On
Reorder Host Requests = On
Auto detect BackPlane = logic of SGPIO,i2c SEP using h/w mechanism like GPIO pins
Load Balance Mode = Auto
Security Key Assigned = Off
Disable Online Controller Reset = Off
Use drive activity for locate = Off

Boot :
====
BIOS Enumerate VDs = 1
Stop BIOS on Error = Off
Delay during POST = 0
Spin Down Mode = None
Enable Ctrl-R = Yes
Enable Web BIOS = Yes
Enable PreBoot CLI = Yes
Enable BIOS = Yes
Max Drives to Spinup at One Time = 2
Maximum number of direct attached drives to spin up in 1 min = 10
Delay Among Spinup Groups (sec) = 12
Allow Boot with Preserved Cache = On


Defaults :
========
Phy Polarity = 0
Phy PolaritySplit = 0
Strip Size = 256kB
Write Policy = WB
Read Policy = RA
Cache When BBU Bad = Off
Cached IO = Off
VD PowerSave Policy = Controller Defined
Default spin down time (mins) = 30
Coercion Mode = 1 GB
ZCR Config = Unknown
Max Chained Enclosures = 16
Direct PD Mapping = No
Restore Hot Spare on Insertion = No
Expose Enclosure Devices = Yes
Maintain PD Fail History = Yes
Zero Based Enclosure Enumeration = No
Disable Puncturing = No
EnableLDBBM = Yes
Un-Certified Hard Disk Drives = Allow
SMART Mode = Mode 6
Enable LED Header = No
LED Show Drive Activity = Yes
Dirty LED Shows Drive Activity = No
EnableCrashDump = No
Disable Online Controller Reset = No
Treat Single span R1E as R10 = No
Power Saving option = Enable
TTY Log In Flash = No
Auto Enhanced Import = Yes
BreakMirror RAID Support = No
Disable Join Mirror = No
Enable Shield State = Yes
Time taken to detect CME = 60 sec


Capabilities :
============
Supported Drives = SAS, SATA
RAID Level Supported = RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span
Enable JBOD = No
Mix in Enclosure = Allowed
Mix of SAS/SATA of HDD type in VD = Allowed
Mix of SAS/SATA of SSD type in VD = Allowed
Mix of SSD/HDD in VD = Allowed
SAS Disable = No
Max Arms Per VD = 32
Max Spans Per VD = 8
Max Arrays = 128
Max VD per array = 16
Max Number of VDs = 64
Max Parallel Commands = 1008
Max SGE Count = 60
Max Data Transfer Size = 8192 sectors
Max Strips PerIO = 42
Max Configurable CacheCade Size = 0
Min Strip Size = 8 KB
Max Strip Size = 1.0 MB


Scheduled Tasks :
===============
Consistency Check Reoccurrence = 168 hrs
Next Consistency check launch = 03/16/2024, 3:00:00
Patrol Read Reoccurrence = 168 hrs
Next Patrol Read launch = 03/16/2024, 3:00:00
Battery learn Reoccurrence = NA
Next Battery Learn = NA
OEMID = LSI

Drive Groups = 1

TOPOLOGY :
========

------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT     Size PDC  PI SED DS3  FSpace
------------------------------------------------------------------------
 0 -   -   -        -   RAID1 Optl  N  3.637 TB dsbl N  N   none N   
 0 0   -   -        -   RAID1 Optl  N  3.637 TB dsbl N  N   none N   
 0 0   0   252:0    4   DRIVE Onln  N  3.637 TB dsbl N  N   none -   
 0 0   1   252:1    5   DRIVE Onln  N  3.637 TB dsbl N  N   none -   
------------------------------------------------------------------------

UID=Unique Identification Number | DG=Disk Group Index
Arr=Array Index|Row=Row Index|EID=Enclosure Device ID
DID=Device ID|Type=Drive Type|BT=Background Task Active|PDC=PD Cache
PI=Protection Info|SED=Self Encrypting Drive|DS3=Dimmer Switch 3
FSpace=Free Space Present|dflt=Default|Msng=Missing|Frgn=Foreign

Virtual Drives = 1

VD LIST :
=======

---------------------------------------------------------
DG/VD TYPE  State Access Consist Cache sCC     Size Name
---------------------------------------------------------
0/0   RAID1 Optl  RW     Yes     RAWBC -   3.637 TB   
---------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|Dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|B=Blocked|Consist=Consistent|
R=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|AWB=Always WriteBack|
WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled Check Consistency

Physical Drives = 2

PD LIST :
=======

---------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model                Sp
---------------------------------------------------------------------------
252:0     4 Onln   0 3.637 TB SATA HDD N   N  512B WDC WD40EFPX-68C6CN0 U
252:1     5 Onln   0 3.637 TB SATA HDD N   N  512B WDC WD40EFPX-68C6CN0 U
---------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-Drive Group|
Intf-Interface|Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info|
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|UGood-Unconfigured Good|UBad-Unconfigured Bad|Offln-Offline|Onln-Online|Rbld-Rebuild|Cpybck-Copyback|GHS-Global Hot Spare|DHS-Dedicated Hot Spare
 
Last edited:
Quick update:

My Proxmox it's being up for almost 5 days, Normally MegaRAID_SAS crashes within 1 to 4 days.
3. And I just changed the /etc/default/grub, Not sure if this will solve the problem yet.
from :
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"
To :
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off iommu=pt
I forgot to mention, I also changed boot mode to UEFI only on motherboard.

I don't think it's related with the management packages I installed from https://hwraid.le-vert.net/wiki/DebianPackages,
Correct me if I’m wrong, the issue i have seems more like kernel & driver matter.


So I reckon those two changes fix the issue:
1: quiet pcie_aspm=off
2: Disable legacy boot, Use UEFI boot only.
Not sure which one particularly, I'll do more test to find out.

By hours research on internet I found those two posts:

Megaraid SAS 9341-8i on Linux - Cooling and initialization issues

Fix for UEFI + Hardware RAID + Linux = megaraid_sas io_page_fault

 
Last edited:
So I reckon those two changes fix the issue:
1: quiet pcie_aspm=off
2: Disable legacy boot, Use UEFI boot only.
Not sure which one particularly, I'll do more test to find out.
I'm pretty sure the issue is solved by use UEFI boot only.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!