ISCSI Multipath : add a new storage, and an error raised

dominique.fournier · Aug 2, 2021

Hi,
We are in Proxmox 6.4 with a ISCSI Dell Compellent storage. This storage used multipath. It works really well with proxmox since years. We have added a new storage SCv3020 and since, there is a log in it with : "CTL:856522 SUB:CHELSIOT4 FNC:ActivateObjectCallback FNM:chelsioT4Connection.cxx FLN:555 MID:0 MSG:CHELSIOT4Connection CA Activate Failed: ControllerId=856522 (0x000D11CA) lp=2147549190 (0x80010006) ObjId=4544219 (0x004556db)"

This log appears 300000 times a day. It was never seen on the old storage.

We have try to check ISCSI, Multipath, routing... without luck.

Do you have any idea ? We are stuck.

The cluster is based on 4 Dell R710 and 2 Dell R640, on 10Gb/s. The firmware of the network cards is up to date.

bbgeek17 · Aug 2, 2021

I think your best bet is to work with Dell support. From the limited information you provided, this appears to be a message from the NIC. While the firmware might be up to date, is the Kernel module/driver the one recommended by Dell?

Proxmox is just an application from the NIC/storage point of view. While PVE comes with a Kernel that they feel is good for most people, it doesnt mean its a perfect fit for all custom hardware.

Reach out to Dell, tell them the OS/Kernel version you are using. I wouldn't volunteer the app (Proxmox) because its irrelevant.

Good luck

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dominique.fournier · Aug 2, 2021

We have seen with Dell support all these parameters and they would like to see if someone already have this problem in Proxmox, because they never see that.

bbgeek17 · Aug 2, 2021

Sorry, sounds like you caught support on a bad day... Its almost like if you were running an Apache web server on top instead of Proxmox, and they told you to ask Apache group if anyone saw this before.

Its hard to say based on that single line, but I would definitely investigate Kernel modules on the host. Also, check the network stats - is there a Flow/MTU/Speed/Etc mismatch?

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dominique.fournier · Aug 2, 2021

We didn't see anything of that. MTU 9000 set, 10G/s every where with DAC cables, flow control in switch (don't know how check that in Proxmox, but we have set "/sbin/ethtool -A eth5 autoneg off rx on tx on" in /etc/nework/interfaces).

bbgeek17 · Aug 2, 2021

I just realized that the error message you posted is actually from your storage log, not from the host??
If so, I would push hard on Dell to explain what this means, thats what you are paying them for!

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dominique.fournier · Aug 3, 2021

You are right, the log is from the storage. But Dell has already work very hard on this topic, takes a lot of time with engineers at level 2, 3 and 4, and they request the Proxmox help. Do you think the version 7 of Proxmox can change something ?

bbgeek17 · Aug 3, 2021

Proxmox 7 comes with a new Kernel, there may be some relevant fixes that might help. Its also possible it may make it worse... It's impossible to say without getting to root cause of the issue.
You dont need to upgrade to PVE7 to get new Kernel, you can just try the kernel itself. There are guides available on how to do it. If you only upgrade the kernel there is always option to back down to prior version. You wont be able to downgrade if you upgrade to PVE7.

What I would do is find the compatibility guide for your Storage system and make sure that you are on supported OS and Kernel release. Follow the configuration guide to set all the environment variables.
What you need to do is configure Debian/Ubuntu to work with your storage. Put Proxmox aside - its just an application in this situation.
Good luck.

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dominique.fournier · Sep 24, 2021

Hi guys
The error coming from the storage is due to proxmox code : it is generated each time pvestatd try to connect to the portal to test the connectivity. It was not due to ISCSI stack.

Actually, I remove the test in /usr/share/perl5/PVE/Storage/ISCSIPlugin.pm by adding "return 1;" before the line "return PVE::Network::tcp_ping($server, $port || 3260, 2);"
Then there is no more log in the storage.

We will see how Dell will solve that, as a aborted TCP connexion must not generate a log.

bbgeek17 · Sep 24, 2021

Interesting, sounds like a firewall issue. However, there is probably no RFC requirement for iSCSI to reply to TCP ping of port 3260, if thats indeed what tcp_ping() function does.

Good find!

Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pierre-Yves · Dec 19, 2022

Hi
This hack works perfectly for months, and last week, we've update our servers on the lastest Proxmox 7.3.3 and the message resturns back in our COMPELENT SANs. We've modifyed the /usr/share/perl5/PVE/Storage/ISCSIPlugin file again but the "hack" seems not working any more ...
Any idea ?

bbgeek17 · Dec 19, 2022

did you reload the daemons after modifying the file? Things get loaded in memory and source files are not expected to change randomly.

Code:

 systemctl try-reload-or-restart pvedaemon pveproxy pvestatd pvescheduler

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pierre-Yves · Dec 19, 2022

I reboot the servers after applying the hack ...

bbgeek17 · Dec 19, 2022

Pierre-Yves said:
I reboot the servers after applying the hack ...

Its possible that while the symptom/error is the same , the cause is different. You'd need to track down if perhaps something else is now causing the issue. You may need to run pvedaemon in debug mode, or collect a network trace to analyze.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

Pierre-Yves · Dec 19, 2022

The only thing that changes is that we've update the Proxmox servers.
Maybe the hack is not exactly the same now in the new PVE version.
Everything works fine, it's just generating lots of "dish" logs on the SANs.

bbgeek17 · Dec 19, 2022

Pierre-Yves said:
The only thing that changes is that we've update the Proxmox servers.

I realize that, however other things could have changed in healthchecks and other parts of the code.
What I am saying is that a generic connectivity error could be caused by other things that has changed in the code.

Since you are using storage that is not widely deployed with Proxmox, and the error is generated on the storage side - its impossible for PVE developers to determine what change could be causing your commercial storage heartburn. Your best path of action is to either troubleshoot the situation on the wire, correlating the network traffic to errors in SAN log, or open a case with storage vendor.
Once the culprit is identified PVE stuff or community may recommend a solution.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox

dominique.fournier · Dec 20, 2022

I am in PVE7.3, and the patch is applied and working for me. I need to restart pvestatd as usual to use it correctely.
In my version, the file to modify is /usr/share/perl5/PVE/Storage/ISCSIPlugin.pm on line 68, to add a "return 1;" before the line "return PVE::Network::tcp_ping($server, $port || 3260, 2);"

Pierre-Yves · Dec 21, 2022

Very strange
I reboot the servers ... the errors messages are still displayed on the SANs ...
I restart pvestatd daemon on each servers : no errors messages after ...
Christmas is a magical moment

mplssilva · Jul 2, 2023

Hi,
We are in Proxmox 7.4-3 with a ISCSI Dell Compellent SC4020 Storage.
This storage used multipath too.

there is a log in Storage with many Lines :
CHELSIOConnection CA Activate Failed: ControllerId=81254 (0x00013D66) lp=1 (0x00000001) ObjId=478 (0x000001de)
CHELSIOConnection CA Activate Failed: ControllerId=81254 (0x00013D66) lp=2147614725 (0x80020005) ObjId=477 (0x000001dd)

About the Hack in /usr/share/perl5/PVE/Storage/ISCSIPlugin.pm
by adding "return 1;" before the line "return PVE::Network::tcp_ping($server, $port || 3260, 2);"

Howto Apply:

Code:

sub iscsi_test_portal {
    my ($portal) = @_;

    my ($server, $port) = PVE::Tools::parse_host_and_port($portal);
    return 0 if !$server;
    return 1;
    return PVE::Network::tcp_ping($server, $port || 3260, 2);
}

Or
sub iscsi_test_portal {
my ($portal) = @_;

my ($server, $port) = PVE::Tools:arse_host_and_port($portal);
return 0 if !$server;
return 1 if return PVE::Network::tcp_ping($server, $port || 3260, 2);
}

Or
sub iscsi_test_portal {
my ($portal) = @_;

my ($server, $port) = PVE::Tools:arse_host_and_port($portal);
return 0 if !$server;
return 1; return PVE::Network::tcp_ping($server, $port || 3260, 2);
}

Thanks for Any Help!

dominique.fournier · Jul 3, 2023

It is the case 1 :

Code:

    return 1;
    return PVE::Network::tcp_ping($server, $port || 3260, 2);

And restart the pvestatd service after modification

Dom

ISCSI Multipath : add a new storage, and an error raised

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Distinguished Member

Active Member

Active Member

Active Member

Attachments

Active Member

We value your privacy