[SOLVED] HBA Card/JBOD issues

Tezza

New Member
Jun 4, 2024
14
4
3
So after reinstalling on my before working server whenever i have my JBOD plugged in i have to restart my networking service to have vmbr0 start correctly.

Whenever i have the JBOD unplugged from the server it works normally (meaning vmbr0 starts correctly).


Looking for any possible solutions that someone could have. This issue has been driving me nuts and i have been dealing with it for almost a month now.
 
As a first thought, I'd probably be looking for error messages (or other output that seems relevant) in the kernel logs.

Is this weird behaviour something you can reliably trigger when the server is already booted and running?

Asking because if it is, you could open a terminal window and view the live running kernel logs using this command:

Bash:
# journalctl -b 0 -n 20 -f

Though, I'm guessing it's the kind of thing that only shows up during the boot.

For that, you can capture all of the kernel logs of a particular session like this:

Bash:
# journalctl -b 0 > somefile.txt

So if you do a boot with it going wrong, then capture all of the kernel logs to a file (before rebooting), that would let you open the file and look for error messages.

That has a decent chance of showing up whatever is causing issues.



I don't suppose your previous server had custom grub options or kernel module settings? Any chance you have a backup of its config you could check?
 
Last edited:
whenever i have my JBOD plugged in i have to restart my networking service
You don't provide much detail. How is the JBOD "plugged into" the server? What controller card are you using? NICs? Etc....

This probably looks like some type of PCI issue? - but without the details/logs - I don't think your going to get much help.

From your post it would appear that after restarting the networking - both your network & your JBOD are fully operational. If this is in fact the case - its probably caused by a PCI bus and slot information change caused by the additional JBOD connection. However this is only a guess.
 
  • Like
Reactions: justinclift
Good point. The network card may have changed it's name (ie enp5s0 to enp6s0 or similar) if there's a new PCIe card added.

That would mean the bridge would need updating to use the new network card name.
 
So the JBOD is plugged in via external SAS. I have delved into the forums and i did find an issue that people are expieriencing with their network cards changing name due to pcie ID changing. That is however not the case. I took every pcie card except for the fiber NIC and it has the same name as it did when every pcie card i normally have is in the system.

From your post it would appear that after restarting the networking - both your network & your JBOD are fully operational.
Whenever i restart the networking service just my network is fully operational. That is my bad for not making that clear.


As the system is initialyzing the hardware i do see all the drives in my JBOD show up when the system is initialyzing the SAS card.

I don't suppose your previous server had custom grub options or kernel module settings?
Nope.

And as for the logs i will have to send them later since im currently at work but i will send them whenever i get home.
 
To summarize (as I'm still not 100% clear) either your Network fully functions OR your JBOD fully functions BUT not BOTH together?
 
To summarize (as I'm still not 100% clear) either your Network fully functions OR your JBOD fully functions BUT not BOTH together?
My JBOD never fully funtions. My network does not work normally when the JBOD is plugged into the SAS card (meaning i have to restart the networking service once proxmox is fully booted for the network to work). However when the JBOD is unplugged from the external SAS card my network works normally with no extra steps.

Also, the external SAS card is never removed from the server so the PCI ID's don't change as the placement of the cards or number of the cards is unchanged.
 
OK now I've got the picture.

1. Does the JBOD have some Ethernet ports that may be causing a conflict?
2. Does the server have enough power? (IDW your server HW).
3. Maybe one of the disks in the JBOD is malfunctioning? Maybe try attaching the JBOD without any disks (& then add the disks one at a time)?

So after reinstalling on my before working server
Was that a PVE system? If yes, then maybe try pinning with the older 6.5 kernel? There have been enough reported problems on the 6.8 version. What happens if you now live-boot with another distro, do you have a fully functional NW & JBOD?
 
  • Like
Reactions: justinclift
1. Does the JBOD have some Ethernet ports that may be causing a conflict?
It does not

2. Does the server have enough power? (IDW your server HW).
I believe it does

3. Maybe one of the disks in the JBOD is malfunctioning? Maybe try attaching the JBOD without any disks (& then add the disks one at a time)?
I have tried and i was having the same issue with no drives in.

Was that a PVE system? If yes, then maybe try pinning with the older 6.5 kernel?
It is a PVE system. I will try an older kernel when i get home today.
 
  • Like
Reactions: gfngfn256
After reading this back, there's something else to check when you have time.

When the system has the JBOD plugged in (and before you restart the network) it'd be useful to know what systemd thinks the state of things are.

Running this should give the summary:

Bash:
# systemctl status

In the first few lines of the output it should say if there are any queued or failed jobs.

If there are, that might point towards the problem. From the behaviour you're describing, I'm wondering if there's something taking a super long time to start up (likely JBOD related), or if something is just outright failing.

If it says there's anything queued, then check what those are:

Bash:
# systemctl list-jobs

Likewise, if it says anything failed, get that list as well:

Bash:
# systemctl list-units --failed
 
  • Like
Reactions: Tezza
it'd be useful to know what systemd thinks the state of things are
So i just got home and i was most curious about this and i checked this first.

there are two failed ill type the 2 i got

ifupdown2-pre.service loaded failed failed Helper to synchronize boot up for ifupdown
systemd-udev-settle.service loaded failed failed Wait for udev To Complete Device Initialization



Im now going to get the other logs that you mentioned
 
Last edited:
ifupdown2-pre.service

Cool, that one seems to match the behaviour you're seeing.

Proxmox uses a package called ifupdown2 for managing network interfaces, and that ifupdown2-pre service seems to be a part of it.

You can check if there are errors showing up for a given systemd service by using journalctl with the name of the service (passed with -u):

Bash:
# journalctl -u ifupdown2-pre -n 50

That'll show the last 50 lines of the log output from that service. Generally 50 lines should be enough for it to contain some kind of hint as to what's going wrong, but feel free to choose bigger/smaller/etc numbers as needed.

You can do the same thing with the systemd-udev-settle service too, to try and get useful info out of that:

Bash:
# journalctl -u systemd-udev-settle -n 50



Looking through that logs.txt some stuff jumps out. :)

This looks like your network cards?
Code:
Jun 04 18:34:17 clotho kernel: igb 0000:05:00.1: Intel(R) Gigabit Ethernet Network Connection
...
Jun 04 18:34:17 clotho kernel: scsi host10: Emulex OneConnect OCe10100, FCoE Initiator  on PCI bus 82 device 02 irq 86
Jun 04 18:34:17 clotho kernel: be2net 0000:82:00.0 enp130s0f0: renamed from eth2

And I'm guessing this is your HBA?

Code:
Jun 04 18:34:17 clotho kernel: mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05)
Jun 04 18:34:17 clotho kernel: mpt2sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)

There's a kernel problem showing up at around line 1520 of that log file too:

Code:
Jun 04 18:34:17 clotho kernel: scsi: waiting for bus probes to complete ...
Jun 04 18:34:17 clotho kernel: scsi host9: ioc0: LSISAS1068E B3, FwRev=011a0000h, Ports=1, MaxQ=478, IRQ=24
Jun 04 18:34:17 clotho kernel: detected buffer overflow in strnlen
Jun 04 18:34:17 clotho kernel: ------------[ cut here ]------------
Jun 04 18:34:17 clotho kernel: kernel BUG at lib/string_helpers.c:1048!
Jun 04 18:34:17 clotho kernel: invalid opcode: 0000 [#1] PREEMPT SMP PTI
Jun 04 18:34:17 clotho kernel: CPU: 0 PID: 8 Comm: kworker/0:0 Not tainted 6.8.4-3-pve #1
Jun 04 18:34:17 clotho kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.3 08/23/2018
Jun 04 18:34:17 clotho kernel: Workqueue: events work_for_cpu_fn
Jun 04 18:34:17 clotho kernel: RIP: 0010:fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel: Code: cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 fe 48 c7 c7 88 ef fe a8 48 89 e5 e8 4d 40 9b ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Jun 04 18:34:17 clotho kernel: RSP: 0018:ffff9c6800093c38 EFLAGS: 00010246
Jun 04 18:34:17 clotho kernel: RAX: 0000000000000023 RBX: ffff8c95db5d8038 RCX: 0000000000000000
Jun 04 18:34:17 clotho kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 04 18:34:17 clotho kernel: RBP: ffff9c6800093c38 R08: 0000000000000000 R09: 0000000000000000
Jun 04 18:34:17 clotho kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c95c4f29800
Jun 04 18:34:17 clotho kernel: R13: ffff8c95da04a000 R14: ffff8c95da04a028 R15: ffff8c95c65d8000
Jun 04 18:34:17 clotho kernel: FS:  0000000000000000(0000) GS:ffff8ca4ff600000(0000) knlGS:0000000000000000
Jun 04 18:34:17 clotho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 04 18:34:17 clotho kernel: CR2: 00007fff11bbef70 CR3: 0000000781636004 CR4: 00000000001706f0
Jun 04 18:34:17 clotho kernel: Call Trace:
Jun 04 18:34:17 clotho kernel:  <TASK>
Jun 04 18:34:17 clotho kernel:  ? show_regs+0x6d/0x80
Jun 04 18:34:17 clotho kernel:  ? die+0x37/0xa0
Jun 04 18:34:17 clotho kernel:  ? do_trap+0xd4/0xf0
Jun 04 18:34:17 clotho kernel:  ? do_error_trap+0x71/0xb0
Jun 04 18:34:17 clotho kernel:  ? fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel:  ? exc_invalid_op+0x52/0x80
Jun 04 18:34:17 clotho kernel:  ? fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel:  ? asm_exc_invalid_op+0x1b/0x20
Jun 04 18:34:17 clotho kernel:  ? fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel:  ? fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel:  mptsas_probe_one_phy.constprop.0.isra.0+0xaad/0xac0 [mptsas]
Jun 04 18:34:17 clotho kernel:  mptsas_probe_hba_phys.isra.0+0x795/0x910 [mptsas]
Jun 04 18:34:17 clotho kernel:  mptsas_scan_sas_topology+0x42/0x380 [mptsas]
Jun 04 18:34:17 clotho kernel:  ? scsi_autopm_put_host+0x1a/0x30
Jun 04 18:34:17 clotho kernel:  mptsas_probe+0x3f8/0x570 [mptsas]
Jun 04 18:34:17 clotho kernel:  local_pci_probe+0x47/0xb0
Jun 04 18:34:17 clotho kernel:  work_for_cpu_fn+0x1a/0x30
Jun 04 18:34:17 clotho kernel:  process_one_work+0x16d/0x350
Jun 04 18:34:17 clotho kernel:  worker_thread+0x306/0x440
Jun 04 18:34:17 clotho kernel:  ? __pfx_worker_thread+0x10/0x10
Jun 04 18:34:17 clotho kernel:  kthread+0xf2/0x120
Jun 04 18:34:17 clotho kernel:  ? __pfx_kthread+0x10/0x10
Jun 04 18:34:17 clotho kernel:  ret_from_fork+0x47/0x70
Jun 04 18:34:17 clotho kernel:  ? __pfx_kthread+0x10/0x10
Jun 04 18:34:17 clotho kernel:  ret_from_fork_asm+0x1b/0x30
Jun 04 18:34:17 clotho kernel:  </TASK>
Jun 04 18:34:17 clotho kernel: Modules linked in: usbhid hid nvme_core mptsas(+) nvme_auth crc32_pclmul mpt3sas be2net mptscsih igb scsi_transport_fc ehci_pci ahci i2c_i801 mptbase raid_class i2c_algo_bit ehci_hcd libahci i2c_smbus lpc_ich scsi_transport_sas dca wmi
Jun 04 18:34:17 clotho kernel: ---[ end trace 0000000000000000 ]---
Jun 04 18:34:17 clotho kernel: mpt2sas_cm1: port enable: SUCCESS
Jun 04 18:34:17 clotho kernel: RIP: 0010:fortify_panic+0x13/0x20
Jun 04 18:34:17 clotho kernel: Code: cc 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 fe 48 c7 c7 88 ef fe a8 48 89 e5 e8 4d 40 9b ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
Jun 04 18:34:17 clotho kernel: RSP: 0018:ffff9c6800093c38 EFLAGS: 00010246
Jun 04 18:34:17 clotho kernel: RAX: 0000000000000023 RBX: ffff8c95db5d8038 RCX: 0000000000000000
Jun 04 18:34:17 clotho kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Jun 04 18:34:17 clotho kernel: RBP: ffff9c6800093c38 R08: 0000000000000000 R09: 0000000000000000
Jun 04 18:34:17 clotho kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c95c4f29800
Jun 04 18:34:17 clotho kernel: R13: ffff8c95da04a000 R14: ffff8c95da04a028 R15: ffff8c95c65d8000
Jun 04 18:34:17 clotho kernel: FS:  0000000000000000(0000) GS:ffff8ca4ff600000(0000) knlGS:0000000000000000
Jun 04 18:34:17 clotho kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 04 18:34:17 clotho kernel: CR2: 00007fff11bbef70 CR3: 0000000781636004 CR4: 00000000001706f0

That's not good at all. My suspicion is that kernel error is what's upsetting the systemd-udev-settle service.

I'd definitely try dropping back to the the older 6.5 kernel as suggested by @gfngfn256 above to see if anything changes.

The instructions for doing that are in the Proxmox Roadmap's ver 8.2 Release notes here. Under the "Kernel 6.8" heading.
 
Last edited:
This might be worth looking at when you have time too:

Code:
Jun 04 18:36:21 clotho smartd[5892]: Device: /dev/sdp [SAT], found in smartd database 7.3/5319: Seagate Barracuda 7200.14 (AF)
Jun 04 18:36:21 clotho smartd[5892]: Device: /dev/sdp [SAT], WARNING: A firmware update for this drive may be available,

...with useful urls just after that:

Code:
Jun 04 18:36:21 clotho smartd[5892]: see the following Seagate web pages:
Jun 04 18:36:21 clotho smartd[5892]: http://knowledge.seagate.com/articles/en_US/FAQ/207931en
Jun 04 18:36:21 clotho smartd[5892]: http://knowledge.seagate.com/articles/en_US/FAQ/223651en
 
Last edited:
Oh, this is probably relevant too, a few hundred lines after the kernel error above:

Code:
Jun 04 18:34:17 clotho kernel: lpfc 0000:82:00.2: 0:1421 Failed to set up hba
 
This looks like your network cards?
Correct that is my network card.


I'd definitely try dropping back to the the older 6.5 kernel as suggested by @gfngfn256 above to see if anything changes.
I will try that now. I hope i can get it to boot into the previous kernel and it work. I have one shot at this roght now since im again at work. But later, if this doesnt work, i can try something else.

Its just if i tell it to reboot remotely and that doesnt fix it ill have to wait to later to continue this cry for help haha.
 
I set the server to reboot to the 6.2 kernel that is what i remember working. If it doesnt come up in 30 minutes i'll update here.
 
Oh, this is probably relevant too, a few hundred lines after the kernel error above:

Code:
Jun 04 18:34:17 clotho kernel: lpfc 0000:82:00.2: 0:1421 Failed to set up hba
I was re reading what you have sent. This is kinda strange, that pci device is for my servers drived not for the JBOD but when proxmox boots i see all the drives plugged into that hba card.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!