PVE 9.1 running on a BOSS-S1 causing I/O errors and filesystem remounts as R/O

LerryV2

New Member
Jan 3, 2025
21
1
3
I have 3 Dell R640, they all have BOSS-S1 cards in them with Intel SSDSCKKB240G8 M.2 SATA drives. I have verified the m.2 drives are in good health by reading the smart data and you could not ask for better drives unless you were handed brand new ones.

I have verified that all 3 machines have the latest iDRAC/BIOS/LifeCycle/BOSS firmware. Ive looked on Dells website and tried to do an update through the machines themselves and everything says they have the latest firmware.

I get the following errors on a fresh install of PVE on the BOSS card on first login

exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
failed command: READ SECTOR(S) EXT and READ FPDMA QUEUED
  • I/O errors on /dev/sda (BOSS virtual disk)
  • ext4 journal aborts
  • Filesystem remounting read-only
  • System becomes unusable

Ive tried adding
libata.force=noncq
and
libata.dma=0
in grub and still have the same issue.

dmseg gives me this error
ata15.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata15.00: failed command: READ SECTOR(S) EXT
ata15.00: cmd 24/00:00:20:50:35/00:20:01:00:00/e0 tag 22 pio 4194304 in
I/O error, dev sda, sector 2027136 op 0x0:(READ) flags 0x84700 phys_seg 66 prio class 2
EXT4-fs error: ext4_journal_check_start:87: Detected aborted journal
EXT4-fs (dm-1): Remounting filesystem read-only

What are my options outside of not using the BOSS card and installing PVE on either one of the U.2 drives or an external drive
 
I should note that the same issues happens with Proxmox Backup Server as well. Same errors so I assume its a kernel/OS/hardware issue but not sure where the breakdown is.
 
What are my options outside of not using the BOSS card and installing PVE on either one of the U.2 drives or an external drive
Something's the problem with those Intel drives, not the BOSS card itself. I exchanged my Intel SSDSCKKB480G8 drives for a pair of Micron MTFDDAV480TDS and those work well.

You can also run kernel 6.14, as the issue was introduced with 6.17.
 
I ordered those same Micron drives you got so lets hope that fixes it. I downgraded to kernel 6.12 and all the I/O errors went away and the system runs much faster too, so Im not sure what the major differences are but fingers crossed that fixes them. Ill post an update when they come in. Thanks for the help.
 
I ran in to the exact same issue today when upgrading two 4 node clusters that use BOSS-S1 adapters for the OS.
However, only the second cluster gets R/O error :oops: Servers are PowerEdge R740.

I'm still investigating but so far the only difference I've found is.
Cluster 1: BOSS-S1 firmware 2.5.13.3022 (Works fine)
Cluster 2: BOSS-S1 firmware 2.5.13.3024 (IO errors)

edit:
Firmware seems unrelated as two servers in cluster 1 are running 2.5.13.3024
 
Last edited:
And 2.5.13.3024 is the version that I'm running. Interesting.
So maybe its not a M.2 SATA issue but a firmware BOSS issue.
Not sure how feasible it is to downgrade, but my first try will be the Micron m.2 SATA drives
If those dont work I might try dumping the BOSS cards completely and use a PCI-e card with 2 m.2 drives on them and just use software RAID
From what I can tell the PCI-e ports supports bifurcation

I should also note I tried updating the firmware on the Intel drives but had zero luck. I tried several methods including Intel and Solidigm/SK Hynix software. No luck. It could have been the StarTech card I was using or other issues, but having dealt with it for 2 days I was over it.

Im curious, when you look in iDRAC on both the 3022 and 3024, do they should a drive health of 0%? All mine are 3024 and they should they have 0 drive health left, but when connecting them to a machine to read the SMART data, they are in perfect shape. Zero issues.
 
Last edited:
Should have looked closer at the SSDs in the clusters. Just noticed what Cyberishf mentioned above.

The working hosts are using MICRON - MTFDDAV240TCB drives.
While the ones with the issue are using INTEL - SSDSCKJB240G7R drives.

Guess the only option at the moment is downgrading the Kernel...

Regarding the drive health, in my iDRAC version it is listed as "Remaining Rated Write Endurance". And it is between 94% - 100%
 
The Intel drives all show a 0% life left. Its a new to use server install so when the drives come Ill swap them out and see what happens. Thanks.
 
Alright, I have some progress. On one of the failing nodes.

BOSS-S1 adapter card:
DriveModelFirmware
0SSDSCKJB240G7RDL43
1SSDSCKKB240G8RDL6P

I upgraded the firmware on drive 1 from DL6P to DL6R and the errors are gone :D
Firmware upgrade was performing using the iDRAC.
 
  • Like
Reactions: Kingneutron
All mine are running firmware XC311151, not sure what that translates to from a Dell part number, but thats what the drives displayed when I read the SMART data on them. Mine are the SSDSCKKB240G8, non R versions so maybe thats why they show a different revision name.
 
The Intel drives all show a 0% life left. Its a new to use server install so when the drives come Ill swap them out and see what happens. Thanks.
For what its worth, on my machines Intel drives report accurately and Micron drives show 0% life left however they are all perfectly cromulent.
It must be the BOSS card's or IDRAC's inability to correctly parse the SMART readouts from the drives.