Proxmox v.5.4 vs QLogic FC HBA

coppola_f

Renowned Member
Apr 2, 2012
64
8
73
Italy
Hi all,

done today a bit of maintenance on 6 nodes cluster running from year at customer's datacenter!

noticed that after upgrading a node to 'top' Proxmox 5.4 available kernel (v.4.15.18-26)
we've loosed connection to all luns located on HP MSA 2040 (these are FC connected in ful mesh mode using QLogic QLA2xxx HBA with 8Gb fiber switches from Brocade)

reverting default kernel to be started to 4.13.13-5 (changing default in grub configuration)
has returned us to have luns back accessible!

looking around on the web,
I've found this Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1863044

may be this related to our issue?!?

comparing logs before and after upgrade seems that on kernel 4.15.xx,
after qla2xxx driver has stsrted, no scsi bus scan was operated...
then... as result, no luns detected and so on...
(see attached logs!)

may be someone is able to help us?!

many thanks,

Francesco
 

Attachments

  • syslog-4.13.log
    546.7 KB · Views: 5
  • syslog-4.15.log
    181.5 KB · Views: 3
  • Like
Reactions: aasami
given that PVE 5.x will be EOL in a few months (see https://pve.proxmox.com/pve-docs/chapter-pve-faq.html) you could consider testing if the problem persists with PVE 6.x (both with the 5.3 kernel and the 5.4 kernel)
maybe you can either use a different machine and test with the FC-storage.

I hope this helps!
 
given that PVE 5.x will be EOL in a few months (see https://pve.proxmox.com/pve-docs/chapter-pve-faq.html) you could consider testing if the problem persists with PVE 6.x (both with the 5.3 kernel and the 5.4 kernel)
maybe you can either use a different machine and test with the FC-storage.

I hope this helps!

Stoiko,
thanks for your answer,

as you suggested,
moving this production environment from 5.4 to actual 6.x is one of the alternatives we're actually evaluating...
the cluster is actually built on 6 nodes, all HPE servers, with MSA2040 FC storage and full mesh connection using 2x Brocade 300 FC Switches

as you sure can mind, we need to evaluate all scenarios carefully before choosing an action plan!

but in this specific case may be we've a good solution to apply.....

well,
as i said,
6 nodes....
the nodes have been in the past installed in pairs,
so we've actually:
2x Proliant DL380 G6
2x Proliant DL380P Gen8
2x Proliant DL360 Gen10
so,
if we remove from actual cluster 1 node for any server type we may built 2 hardware identycal but distinct clusters!
(1x G6 + 1x Gen8 + 1x Gen10 for each cluster!)
this will allow to fully test compatibility with hardware platforms!
but will also make our h/w resources be the half during test phase!
please note also we've only one MSA unit, then the storage space will be share between clusters!!!
and i think this situation has never been tested in the past!

waiting your suggestions....

regards,
Francesco
 
guys,
still searching on google for similar issues.....

seems more than one linux distro is experiencing issues with qla2xxx adapters
searching with these terms 'kernel qla2xxx lun issue' and limiting search to last month....
finds a lot of results...

before we take any new step....

two big questions:

has someone already tested new kernels for proxmox 6.x family (5.3 / 5.4 kernel families!)
with qla2xxx family cards?!? (this specific is HP Branded dual port 8gb P/N: AJ764A)

has someone already tested environment where 2 clusters shares the same 'shared' storage?!?

I've exposed my idea for a migration plan to customer's IT manager,
we're actually evaluating any new step,
may be we take time to operate during this reduced workload period due to global COVID-19 presence
and related operative limitations we're experiencing.....

I'm unable to read 'German' section of this forum (without use of google translate),
may be someone has posted something related inside this section,
may please someone check this area for similar posts / issues?!

still waiting for someone any kind of suggestions!!!

regards,
Francesco
 
Last edited:
has someone already tested environment where 2 clusters shares the same 'shared' storage?!?
This is something, which we don't support in general for good reason!
Depending on how you expose the storage to the 2 clusters this can be a good way to corrupt your data:
* PVE expects that only one cluster writes to a ('shared') storage - e.g. it assumes that if a cluster creates the VM with id 20000 all disks named vm-20000-disk-X belong to that VM and will delete them (or access them in parallel and thus corrupt the data).

You should be able to separate access on a LUN base - i.e. create 2 LUNs for 2 clusters - with separate backing stores - and limit access to them to one cluster per node!

I hope this helps!
 
@Stoiko Ivanov

actually we've used all storage space we have for this cluster,
we're not able to split storage in more LUNs...

sure i agree that,
by design, only one cluster may access the shared storage filesystem!

inside scenario I'm minding on....
when cluster structure will be splitted in 2 entities.....
(3 nodes on old configuration + 3 nodes on the new one)
we may try to move VMs one by one....

i.e.
disable HA on VM 100
gently shut down VM 100
backup VM 100 to external backup storage
restore on new cluster using DIFFERENT VM id (non existing on old cluster!!) it may be id 1000
make VM1000 back online and HA mode..
if all running fine, remove VM100 from old cluster to gain free space
step over to next VM

and so on.....
one VM at a time!

if all actions will take place step by step,
i think all will run fine....

it seems to me impossibile that....
nobody has never tested similar situation before!!

waiting,
regards,

Francesco
 
  • Like
Reactions: aasami
@Stoiko Ivanov

going on with tests...
today I've moved all VMs away from one of the node (Proliant DL380 G6)
the node was removed from actual cluster, made a full clean-up and fresh installation using latest Proxmox 6.1 iso....
with this kernel seems qlogic qla2xxx cards are correctly initialized and luns discovery was successful...

we've some minor issues due to multipath configuration not working....
(pls see there it seems similar to this!: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=932307)
wwids file was'nt automatically updated from multipath daemon, when i've manually inserted volumes wwids inside this file,
multipath daemon has started correctly...

actually all existing VMs are on old 5 nodes cluster,
I'm going on refining network configuration on the node I've freshly installed with v.6.1
keeping in mind to absolutely avoid any name collision between VMs...

on tomorrow morning I'll try to take one of the old VMs (something like a test only VM!) offline, make backup and try to restore with new VM id
and test....

I'll keep you up to date,
regards,

Francesco
 
  • Like
Reactions: aasami
guys,

moved some less relevant VMs from old to new installation (actually 1 node on v.6.1, 5 nodes om v.5.4)
all seems running fine,
no issues....
obviously, we're moving with extreme caution operating on VMs,
transferring them from less to more relevant one....

if no issue during weekend,
planning to move another node from old to new configuration,
time for DL380P gen8 to be reconfigured with v.6.1.

regards,
Francesco
 
  • Like
Reactions: aasami
well,
we've moved 1 unit by type to new configuration!
(actually 3 nodes v.5.4 / 3 nodes v.6.1)

I can confirm,
QLogic HBAs are working fine with Proxmox v.6.1,

the new kernel seems to run fine, actually moving all VMs, 1 by 1 from old to new environment,
just doing as i sayd before,
always keeping in mind to avoid name conflicts!

thanking again Stoiko Ivanov for his suggestions,

but you all,
please kee in mind,
I've done this because no other canonical way was available!

so...
as they usually say on TV shows where....
someone stupid...
does something crazy.....

"please, don't try this at home!!"

regards,

Francesco
 
  • Like
Reactions: aasami
Hi @coppola_f
Sorry, Francesco, did you try to update from version 6.1 to 6.2? Unfortunately with version 6.2 and kernel 5.4 the QLogic HBAs card does not see any storage and related LVMS.


Sorry,
not tested yet...
but sincerely if you report this....
I'll avoid any kind of updates!

we've already saved a risky situation moving with extereme caution on a minefield!
doing something defined as crazy (not officially supported!) using extreme caution and finally reaching stable environment

I'll wait if someone will report a successfully working QLogic HBA adapter with this class of kernel!

regards,
Francesco
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!