No SAS2008 after upgrade

@Richard Isted are you sure you ran update-grub after that change and then rebooted the host?

Looking at the second line of your dmesg shows that the boot params aren't present at all.
 
Yes sir, I am indeed, but I appreciate what you are saying, wonder if I can enter them manually as a test. This is my update-grub output after editing /etc/default/grub which I presume is what I should have been doing.

Code:
update-grub
Generating grub configuration file ...
W: This system is booted via proxmox-boot-tool:
W: Executing 'update-grub' directly does not update the correct configs!
W: Running: 'proxmox-boot-tool refresh'


Copying and configuring kernels on /dev/disk/by-uuid/83FF-019E
        Copying kernel and creating boot-entry for 6.2.16-19-pve
        Copying kernel and creating boot-entry for 6.2.16-3-pve
Found linux image: /boot/vmlinuz-6.2.16-19-pve
Found initrd image: /boot/initrd.img-6.2.16-19-pve
Found linux image: /boot/vmlinuz-6.2.16-3-pve
Found initrd image: /boot/initrd.img-6.2.16-3-pve
Adding boot menu entry for UEFI Firmware Settings ...
done

Thanks.
 
Splendid, that did the trick!

Adding either pci=realloc=off or reserve=0x80000000,0xfffffff to /etc/kernel/cmdline and running proxmox-boot-tool refresh resolved the issue for me. As I'm booting from a ZFS disk it seems that /etc/default/grub is ignored.

@kriansa for the assistance. For now I've left pci=realloc=off as my boot option, can always change to reserve=0x80000000,0xfffffff if that is a better option.
 

Attachments

  • dmesg4.txt
    117.5 KB · Views: 3
Last edited:
@Richard Isted Thanks for confirming my hypothesis. In fact neither option is a proper fix, but workarounds. I was hoping someone would confirm the reserve works before I jumped to conclusions.

My findings​


A recent patch introduced by 6.2 tries to reclaim memory back from bios at the early boot process, after the bios report they are reserved. Eventually during the boot process, it will be used for mmio (to communicate with devices such as pcie). The problem is that on some platforms, this isn't a safe operation as some of the previously reported as reserved memory block isn't actually usable, thus failing whenever any of the devices that happen to be assigned to use that block (in our case the SAS controller but really could be anything else).

You can see that behavior in the dmesg logs on lines starting with BIOS-e820 where it initially identifies and reserve memory blocks:

Code:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007398dfff] usable
[    0.000000] BIOS-e820: [mem 0x000000007398e000-0x000000007458dfff] reserved
[    0.000000] BIOS-e820: [mem 0x000000007458e000-0x000000007dcd1fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007dcd2000-0x000000007ddd9fff] ACPI NVS

Then later, on Kernel >= 6.2 you will notice the following:

Code:
[    0.000000] efi: Remove mem79: MMIO range=[0x80000000-0x8fffffff] (256MB) from e820 map
[    0.000000] efi: Not removing mem80: MMIO range=[0xfed1c000-0xfed1ffff] (16KB) from e820 map

Whenever it removes the range, it says to the Kernel that this block is usable, when in fact, as I found out, it's not always the case.

You will notice that precisely on that memory block is where the issue with the driver starts, as it tries to allocate memory from it and it fails when the driver is loading:

Code:
[    1.552235] mpt3sas 0000:0b:00.0: BAR 1: can't reserve [mem 0x809c0000-0x809c3fff 64bit]
[    1.552243] mpt2sas_cm1: pci_request_selected_regions: failed
[    1.552288] mpt2sas_cm1: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:12348/_scsih_probe()!

And that's where it goes boom.

Workarounds​


The workaround is either one of those: telling the kernel to reserve the blocks it's reclaiming early on, OR preventing mmconfig to happen so it doesn't attempt to seek out for "unused" blocks.

You can do the first one by reading your dmesg and seeking for the failure on the can't reserve message while loading the mpt3sas driver, then check if it's within a previously reclaimed block. If so, then simply add this block to a reserve kernel parameter.

The second workaround is simply adding either pci=nommconf OR pci=realloc=off, but to be honest I would stay away from those two as I'm not quite sure they can impact other devices you may have.

Proper fix​


This issue came from a patch introduced in 6.2.

It also seems to have bitten other users as well, and not just us owners of this specific hardware. After it got merged, it quickly broke some other laptop hardware, and the developers patched it by avoiding that by avoiding the reclaim when the memory chunk is too small.

To avoid any of these workarounds at all, I would love some help bringing this up to the kernel upstream, and maybe to ubuntu kernel. Does that seem reasonable @t.lamprecht ?
 
A recent patch introduced by 6.2 tries to reclaim memory back from bios at the early boot process, after the bios report they are reserved. Eventually during the boot process, it will be used for mmio (to communicate with devices such as pcie). The problem is that on some platforms, this isn't a safe operation as some of the previously reported as reserved memory block isn't actually usable, thus failing whenever any of the devices that happen to be assigned to use that block (in our case the SAS controller but really could be anything else).
Nice work here and big thanks for sharing your findings!
This issue came from a patch introduced in 6.2.

It also seems to have bitten other users as well, and not just us owners of this specific hardware. After it got merged, it quickly broke some other laptop hardware, and the developers patched it by avoiding that by avoiding the reclaim when the memory chunk is too small.

To avoid any of these workarounds at all, I would love some help bringing this up to the kernel upstream, and maybe to ubuntu kernel. Does that seem reasonable @t.lamprecht ?
IMO it seems reasonable. After all the work you put in to dissect this issue it might best if you reply to the aforementioned patch that introduced this with all the details you gathered though. We certainly can jump in too, and definitively would cherry-pick any resulting patches.
 
@kriansa Thank you for sharing those details, that is certainly an interesting find! Cracking detective work there and thanks for the assistance.

I will go back to the reserve option in that case. Thanks for helping me work round the issue so I have a usable system. Your work will also help others to identify issues with their PCI devices.
 
I can confirm that after adding the pci=realloc=off to the default grub file all of my disks attached to the sas2008 are recognised and i can boot to the newest kernel.

I've had to mitigate few previous issues there as well as I'm using x99 platform and I'm booting from BTRFS Raid1. At the moment my grub file looks like this:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="rootdelay=25 intel_iommu=on pcie_aspm=off acpi_enforce_resources=lax pci=realloc=off"
GRUB_CMDLINE_LINUX=""
 
Sorry I am new to grub and kernel parameters. I am trying to add pci=realloc=off but I don't know where to do that. is it here:


Code:
menuentry 'Install Proxmox VE (Graphical)' --class debian --class gnu-linux --class gnu --class os {
    echo    'Loading Proxmox VE Installer ...'
    linux   /boot/linux26 ro ramdisk_size=16777216 rw quiet splash=silent
    echo    'Loading initial ramdisk ...'
    initrd  /boot/initrd.img
}


I don't see any entryies that start with GRUB so I am not sure what to do, any help would be appreciated
 
Sorry I am new to grub and kernel parameters. I am trying to add pci=realloc=off but I don't know where to do that. is it here:


Code:
menuentry 'Install Proxmox VE (Graphical)' --class debian --class gnu-linux --class gnu --class os {
    echo    'Loading Proxmox VE Installer ...'
    linux   /boot/linux26 ro ramdisk_size=16777216 rw quiet splash=silent
    echo    'Loading initial ramdisk ...'
    initrd  /boot/initrd.img
}


I don't see any entryies that start with GRUB so I am not sure what to do, any help would be appreciated
It's the linux entry. There are already settings there e.g. splash=silent. Just add it after that.
 
You could always create your own config file in

/etc/default/grub.d/custom.cfg

and simply put it there. Just remember to run
Code:
update-grub
afterwards, which will update the grub config. This way, you will be safe from any distro updates messing with the default config. Once you don't need this workaround anymore, simply remove your custom.cfg or emtpy it and run
Code:
update-grub
again.
 
Here we have an SAS adapter DELL (Chipset SAS2008) which does also fail with newest kernels

I want to replace it with an new and officially supported adapter. On this SAS adapter we have an LTO-drive which should be supported. Has anyone a suggestion for me which SAS Adapter can be used stable for using our LTO device attached to it?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!