No SAS2008 after upgrade

Richard Isted · Nov 22, 2023

kriansa said:
@Richard Isted I'm not sure if you need both lines, but regardless, could you attach your dmesg again?

Thanks, I jest left the config in GRUB_CMDLINE_LINUX_DEFAULT for now.

Attached current dmesg. Really appreciate the help, thank.

kriansa · Nov 22, 2023

@Richard Isted are you sure you ran update-grub after that change and then rebooted the host?

Looking at the second line of your dmesg shows that the boot params aren't present at all.

Richard Isted · Nov 22, 2023

Yes sir, I am indeed, but I appreciate what you are saying, wonder if I can enter them manually as a test. This is my update-grub output after editing /etc/default/grub which I presume is what I should have been doing.

Code:

update-grub
Generating grub configuration file ...
W: This system is booted via proxmox-boot-tool:
W: Executing 'update-grub' directly does not update the correct configs!
W: Running: 'proxmox-boot-tool refresh'


Copying and configuring kernels on /dev/disk/by-uuid/83FF-019E
        Copying kernel and creating boot-entry for 6.2.16-19-pve
        Copying kernel and creating boot-entry for 6.2.16-3-pve
Found linux image: /boot/vmlinuz-6.2.16-19-pve
Found initrd image: /boot/initrd.img-6.2.16-19-pve
Found linux image: /boot/vmlinuz-6.2.16-3-pve
Found initrd image: /boot/initrd.img-6.2.16-3-pve
Adding boot menu entry for UEFI Firmware Settings ...
done

Thanks.

Richard Isted · Nov 22, 2023

I think I might need to add the option in /etc/kernel/cmdline will try that.

Richard Isted · Nov 22, 2023

Splendid, that did the trick!

Adding either pci=realloc=off or reserve=0x80000000,0xfffffff to /etc/kernel/cmdline and running proxmox-boot-tool refresh resolved the issue for me. As I'm booting from a ZFS disk it seems that /etc/default/grub is ignored.

@kriansa for the assistance. For now I've left pci=realloc=off as my boot option, can always change to reserve=0x80000000,0xfffffff if that is a better option.

kriansa · Nov 22, 2023

@Richard Isted Thanks for confirming my hypothesis. In fact neither option is a proper fix, but workarounds. I was hoping someone would confirm the reserve works before I jumped to conclusions.

My findings

A recent patch introduced by 6.2 tries to reclaim memory back from bios at the early boot process, after the bios report they are reserved. Eventually during the boot process, it will be used for mmio (to communicate with devices such as pcie). The problem is that on some platforms, this isn't a safe operation as some of the previously reported as reserved memory block isn't actually usable, thus failing whenever any of the devices that happen to be assigned to use that block (in our case the SAS controller but really could be anything else).

You can see that behavior in the dmesg logs on lines starting with BIOS-e820 where it initially identifies and reserve memory blocks:

Code:

[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007398dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000007398e000-0x000000007458dfff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007458e000-0x000000007dcd1fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007dcd2000-0x000000007ddd9fff] ACPI NVS

Then later, on Kernel >= 6.2 you will notice the following:

Code:

[    0.000000] efi: Remove mem79: MMIO range=[0x80000000-0x8fffffff] (256MB) from e820 map
[    0.000000] efi: Not removing mem80: MMIO range=[0xfed1c000-0xfed1ffff] (16KB) from e820 map

Whenever it removes the range, it says to the Kernel that this block is usable, when in fact, as I found out, it's not always the case.

You will notice that precisely on that memory block is where the issue with the driver starts, as it tries to allocate memory from it and it fails when the driver is loading:

Code:

[    1.552235] mpt3sas 0000:0b:00.0: BAR 1: can't reserve [mem 0x809c0000-0x809c3fff 64bit]
[    1.552243] mpt2sas_cm1: pci_request_selected_regions: failed
[    1.552288] mpt2sas_cm1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:12348/_scsih_probe()!

And that's where it goes boom.

Workarounds

The workaround is either one of those: telling the kernel to reserve the blocks it's reclaiming early on, OR preventing mmconfig to happen so it doesn't attempt to seek out for "unused" blocks.

You can do the first one by reading your dmesg and seeking for the failure on the can't reserve message while loading the mpt3sas driver, then check if it's within a previously reclaimed block. If so, then simply add this block to a reserve kernel parameter.

The second workaround is simply adding either pci=nommconf OR pci=realloc=off, but to be honest I would stay away from those two as I'm not quite sure they can impact other devices you may have.

Proper fix

This issue came from a patch introduced in 6.2.

It also seems to have bitten other users as well, and not just us owners of this specific hardware. After it got merged, it quickly broke some other laptop hardware, and the developers patched it by avoiding that by avoiding the reclaim when the memory chunk is too small.

To avoid any of these workarounds at all, I would love some help bringing this up to the kernel upstream, and maybe to ubuntu kernel. Does that seem reasonable @t.lamprecht ?

t.lamprecht · Nov 22, 2023

kriansa said:
A recent patch introduced by 6.2 tries to reclaim memory back from bios at the early boot process, after the bios report they are reserved. Eventually during the boot process, it will be used for mmio (to communicate with devices such as pcie). The problem is that on some platforms, this isn't a safe operation as some of the previously reported as reserved memory block isn't actually usable, thus failing whenever any of the devices that happen to be assigned to use that block (in our case the SAS controller but really could be anything else).

Nice work here and big thanks for sharing your findings!

kriansa said:
This issue came from a patch introduced in 6.2.

It also seems to have bitten other users as well, and not just us owners of this specific hardware. After it got merged, it quickly broke some other laptop hardware, and the developers patched it by avoiding that by avoiding the reclaim when the memory chunk is too small.

To avoid any of these workarounds at all, I would love some help bringing this up to the kernel upstream, and maybe to ubuntu kernel. Does that seem reasonable @t.lamprecht ?

IMO it seems reasonable. After all the work you put in to dissect this issue it might best if you reply to the aforementioned patch that introduced this with all the details you gathered though. We certainly can jump in too, and definitively would cherry-pick any resulting patches.

Richard Isted · Nov 23, 2023

@kriansa Thank you for sharing those details, that is certainly an interesting find! Cracking detective work there and thanks for the assistance.

I will go back to the reserve option in that case. Thanks for helping me work round the issue so I have a usable system. Your work will also help others to identify issues with their PCI devices.

luckyluk83 · Dec 8, 2023

I can confirm that after adding the pci=realloc=off to the default grub file all of my disks attached to the sas2008 are recognised and i can boot to the newest kernel.

I've had to mitigate few previous issues there as well as I'm using x99 platform and I'm booting from BTRFS Raid1. At the moment my grub file looks like this:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=25 intel_iommu=on pcie_aspm=off acpi_enforce_resources=lax pci=realloc=off"
GRUB_CMDLINE_LINUX=""

Daxcor · Feb 7, 2024

Sorry I am new to grub and kernel parameters. I am trying to add pci=realloc=off but I don't know where to do that. is it here:

Code:

menuentry 'Install Proxmox VE (Graphical)' --class debian --class gnu-linux --class gnu --class os {
    echo    'Loading Proxmox VE Installer ...'
    linux   /boot/linux26 ro ramdisk_size=16777216 rw quiet splash=silent
    echo    'Loading initial ramdisk ...'
    initrd  /boot/initrd.img
}

I don't see any entryies that start with GRUB so I am not sure what to do, any help would be appreciated

LnxBil · Feb 7, 2024

Daxcor said:
Sorry I am new to grub and kernel parameters. I am trying to add pci=realloc=off but I don't know where to do that. is it here:

Code:

menuentry 'Install Proxmox VE (Graphical)' --class debian --class gnu-linux --class gnu --class os { echo 'Loading Proxmox VE Installer ...' linux /boot/linux26 ro ramdisk_size=16777216 rw quiet splash=silent echo 'Loading initial ramdisk ...' initrd /boot/initrd.img }

I don't see any entryies that start with GRUB so I am not sure what to do, any help would be appreciated

It's the linux entry. There are already settings there e.g. splash=silent. Just add it after that.

budy · Feb 12, 2024

You could always create your own config file in

/etc/default/grub.d/custom.cfg

and simply put it there. Just remember to run

Code:

update-grub

afterwards, which will update the grub config. This way, you will be safe from any distro updates messing with the default config. Once you don't need this workaround anymore, simply remove your custom.cfg or emtpy it and run

Code:

update-grub

again.

zash1958 · Feb 13, 2024

Here we have an SAS adapter DELL (Chipset SAS2008) which does also fail with newest kernels

I want to replace it with an new and officially supported adapter. On this SAS adapter we have an LTO-drive which should be supported. Has anyone a suggestion for me which SAS Adapter can be used stable for using our LTO device attached to it?

Romainp · May 15, 2024

Hi!
I will join the thread because I came to a similar situation on a X9DRE-TF+ SuperMicro MB with some LSI controllers and/or Perc H310 using megaraid_sas.
I will try to do the same tests as you already done to see if my issue is related or not and obviously to see if we can resolve the issue.
Thanks for all your efforts!

frank68 · May 15, 2024

megaraid-sas is broken with kernel 6.8. There are some other threads....

donhwyo · May 16, 2024

My old dell r715 with two of these cards is working fine and updated to current. See descriptions above in this thread. If any further output would be interesting I can provide it. Just ask.

Code:

proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libqb0: 1.0.5-1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
openvswitch-switch: 3.1.0-2+deb12u1
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

donhwyo · May 16, 2024

Code:

root@pve:~# lsmod |grep mpt
mptctl                 40960  1
mptbase               110592  1 mptctl
mpt3sas               364544  14
raid_class             12288  1 mpt3sas
scsi_transport_sas     53248  2 ses,mpt3sas
root@pve:~# dpkg -l |grep mpt
ii  libopenmpt0:amd64                    0.6.9-1                              amd64        module music library based on OpenMPT -- shared library
ii  mpt-status                           1.2.0-8+hwraid1+Debian.stretch.9.9   amd64        get RAID status out of mpt (and other) HW RAID controllers

frank68 · May 16, 2024

@donhwyo : You're using mpt3sas, that works. The problem is only with megaraid-sas.

JensF · May 16, 2024

No problems here. Updated everything yesterday.
On our PVE Dell R720 with H710P mini in RAID mode:

Code:

03:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID SAS 2208 [Thunderbolt] [1000:005b] (rev 05)
        DeviceName: Integrated RAID
        Subsystem: Dell PERC H710P Mini (for monolithics) [1028:1f34]
        Kernel driver in use: megaraid_sas
        Kernel modules: megaraid_sas

and on our PBS old IBM Server with a SAS2008 in IT mode:

Code:

01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)
        DeviceName: LSI SAS 1068
        Subsystem: Broadcom / LSI 9211-8i [1000:3020]
        Kernel driver in use: mpt3sas
        Kernel modules: mpt3sas

donhwyo · May 17, 2024

frank68 said:
@donhwyo : You're using mpt3sas, that works. The problem is only with megaraid-sas.

The thread title is "No SAS2008 after upgrade". Is that not what you have?

No SAS2008 after upgrade

Active Member

Attachments

Member

Active Member

Active Member

Active Member

Attachments

Member

My findings​

Workarounds​

Proper fix​

Proxmox Staff Member

Active Member

Member

Member

Distinguished Member

Well-Known Member

Active Member

Active Member

New Member

Member

Member

New Member

Renowned Member

Member

We value your privacy

My findings

Workarounds

Proper fix