Mellanox CX312B in PVE host causing host to crash networking when connected to Horaco RJ45/SFP+ transceiver from an Intel X550-AT2

munkiemagik

New Member
Jun 14, 2024
9
0
1
Setup:

PVE 8.4 Host - HP Prodesk 600 G4 with Mellanox CX312B Connectx-3 Pro Dual SFP+ NIC with both ports attached to vmbr0.
Horaco 2x SFP+ 10Gb and 4x RJ45 2.5GbE switch
Threadripper LLM server (motherboard - Gigabyte MC62-G40 with Intel X550-AT2 10Gbps dual RJ45)
[unrelated machines also on switch- M720Q (another PVE node) with intel i226 4port and an AM5 build (personal machine) with CX312B]

For two years have been using this PVE Prodesk node without any complaint, in last few months I have seen the Prodesk node suddenly crash/go offline every few days. When for the last 2 years it has never displayed any errant behaviour (aside from the initial issue with the intel i219 onboard RJ45 port which was resolved by disabling tso gso gro with ethtool).

PVE Prodesk node crash - it appears, isnt a total system hang. On previous occasions where I have had direct HDMI output to monitor I could see that everything in PVE was still running but only networking had failed. However back then I wasnt smart enough to at least try and restart networking to see if it would come back online. This time around I didnt have HDMI connection so couldn't attempt networking restart to see if that would remedy the situation. I will attempt to simulate crash again later tonight with HDMI direct connection.

There have been no changes to this node in terms of hardware or what LXC/VM are running the last two years (other than what is listed below). There are a lot of apparmor error messages and networking related messages from journalctl, But I just need to redact some details from it before I can post that.

The only noteworthy changes that coincide with the appearances of this instability are:

1 - Change of DIMMMs in PVE Prodesk node - 2 of 4 dimms were swapped out. I was running 2x16 and 2x8 but I needed the 2x16 for Threadripper build so replaced with another set of 2x8. system RAM peak useage approx 70% (but I dont believe this to be the cause as PVE seems to be active just with no networking, so not reachable)​
2 - Introduced a Threadripper Ubuntu 24.04 machine (Gigabyte MC62-G40 motherboard) into network with onboard intel X550-AT2 NIC. - This being a dual RJ45 NIC, am using a Horaco RJ45 to SFP+ 10Gb transceiver to connect into the free SFP+ of the CX312B in PVE Prodesk node as I dont have enough SFP+ in my switch, horaco switch has 2xSFP+ one for the PVE Prodesk node and one for another unrelated CX312B equipped machine.​

The Threadripper machine is an Ubuntu24.04 LLM server. And is the only machine that is NOT 24/7 operational.

I believe it is only when the LLM server is running the PVE Prodesk node will crash/go offline. The last week and a half the X550-AT2 machine has been off (no crashing of PVE node) but last night I had LLM Server on overnight and this afternoon PVE Prodesk node is unreachable again. Only this time I didnt have HDMI connection to be able to dig around in PVE directly with keyboard.

I have no idea how I can find what is specifically causing the crash to trigger from this X550-AT2 > RJ45/SFP Transceiver > ConectX3-Pro, in order to resolve within Proxmox or maybe even something in Ubuntu relating to the X550-AT2, this is a little beyond my knowledge and skillset.

I can confirm that when I've had AM5 machine with CX312B connected directly into PVE Prodesk node CX312B (vmbr0) via DAC I never experienced this issue.

It only started with the introduction of the Horaco RJ45 to SFP+ transceiver. Is it likely a faulty transceiver, but then why would it work 'most' of the time?

Any guidance would be greatly appreciated.