iSCSI Issues - Dell Compellent SAN - Corruption of superblocks and controllers

rwron · Mar 8, 2014

Hi all,

I have the following setup:

6 x physical nodes all setup in proxmox pve cluster (2 x E5-2667 v2 CPU's, 256 GB mem per box)
2 x SAN Controllers in my reundant Dell Compellent SAN with an individual portal ip each.

Everything is up and running and since proxmox wont show all the LUN's from 1 primary portal ip i had to setup 2 x ISCSI's - 1 for each controller.

So i have:

SAN (35 LUN's)
SAN2 (35 LUN's)

And then i had setup 70 LVM's as per the recommendations.

In CLI the LUN's are reported 100% correctly and multipathing etc. is all setup.

Is there no way to get Proxmox to see all the LUN's and controllers from just 1 ISCSI addition?

I basically setup some VM's and all was fine and dandy - until they started randomly corrupting and becoming beyond repair.

They where setup by picking a LVM as the storage, using virtio for disk + network and the actual ubuntu installation was done with guided entire disk on LVM.

One VM had more than 2500 inode issues when trying to repair it and ultimately it was so ruined the only option was to delete it and start over.

So there seems to be some form of misalignment of the data happening on that model.

Right now i have tried to start over entirely and setup the iSCSI targets to allow to use LUN's directly and i am installing with guided entire disk without LVM - so i removed 2 x LVM compared to before - and untill now the system hasnt corrupted yet.

So there is multiple questions here:

1) Is it possible to get Proxmox to only show 1 SAN and all the actual LUN's no matter which controller they are on.
2) Has anyone experienced these issues we have with corruption or ideas on why and what to do?

Just pasting my Multipath conf here for referencing (local discs sda and sdb blacklisted).

multipath.conf

Code:

defaults {
        polling_interval        2
        path_selector           "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/lib/udev/scsi_id -g -u -d /dev/%n"
        rr_min_io               100
        failback                immediate
        no_path_retry           queue
}


devices {
        device {
                vendor "COMPELNT"
                product "Compellent Vol"
                path_grouping_policy multibus
                getuid_callout "/lib/udev/scsi_id --whitelisted --replace-whitespace --device=/dev/%n"
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "0"
                prio const
                failback immediate
                rr_weight uniform
                no_path_retry queue
                rr_min_io 1000
                rr_min_io_rq 1
        }
}




blacklist {
        wwid 36c81f660e05508001a6cf31a13d4e399
        wwid 1Dell_Internal_Dual_SD_0123456789AB
}

From a CLI based point of view with multipath -v3, multipath -f, multipath -l etc. everything is looking good and correctly.

But still someone is behind the corruption thats happening so good ideas or experiences are welcome.

Ps. ill update later as to testing on direct LUN's with no LVM at all involved.

chrisalavoine · Dec 15, 2015

Hi rwron,

Did you get any further with this? We are experiencing a similar problem and would be good to know if (and how you resolved it).
Thanks in advance,
Chris.

adamb · Dec 15, 2015

Is this environment Proxmox 3 or Proxmox 4?

chrisalavoine · Dec 15, 2015

We are currently using Proxmox 4 fully licensed.

Thanks,
c

chrisalavoine · Dec 15, 2015

proxmox-ve: 4.1-26 (running kernel: 4.2.6-1-pve)
pve-manager: 4.1-1 (running version: 4.1-1/2f9650d4)
pve-kernel-4.2.6-1-pve: 4.2.6-26
pve-kernel-4.2.2-1-pve: 4.2.2-16
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 0.17.2-1
pve-cluster: 4.0-29
qemu-server: 4.0-41
pve-firmware: 1.1-7
libpve-common-perl: 4.0-41
libpve-access-control: 4.0-10
libpve-storage-perl: 4.0-38
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-17
pve-container: 1.0-32
pve-firewall: 2.0-14
pve-ha-manager: 1.0-14
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-5
lxcfs: 0.13-pve1
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve6~jessie

adamb · Dec 15, 2015

We to are running into issues with Proxmox4 and a Nimble CS300 iSCSI enclosure. Very similar by the sounds of it and the end result is filesystem corruption, but so far I haven't seen it to the degree you are describing.

I setup a lab for the Proxmox Dev's and they have been digging in for a week or two, but nothing has been pinned down yet. In my environment the issue is related to using cache=none and direct IO. What disk cache setting are you using for the VM? It would be fantastic if there is someone else seeing this issue as it seems to be quite a tough one.

chrisalavoine · Dec 15, 2015

Hi Adamb,

I'm afraid you're a little further on that me.

I'm still stuck trying to mount the storage. We have an Equallogic array that works fine. We just create a volume, point the iSCSI target at it, create LVM group on top and away we go.

The problem I have here is that we don't see any iSCSI volumes listed when trying to find targets on the Compellent SC4020. We just see the 4 physical controllers (and none of the fault domain targets). Of course I can connect to one of the 4 physical controllers and then create LVM group on top of that but this spits out lots of IO errors, like this:

Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482102] sd 28:0:0:1: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482103] sd 28:0:0:1: [sdd] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482104] sd 28:0:0:1: [sdd] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482106] sd 28:0:0:1: [sdd] CDB: Write(10) 2a 00 0c 45 50 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483086] sd 28:0:0:1: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483088] sd 28:0:0:1: [sdd] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483089] sd 28:0:0:1: [sdd] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483091] sd 28:0:0:1: [sdd] CDB: Write(10) 2a 00 0c 44 d0 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651663] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651669] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651672] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651674] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 47 d0 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652254] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652659] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652660] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652662] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652663] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 44 10 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653545] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653547] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653548] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653549] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 47 10 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.669788] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.909463] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.061686] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.203513] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.388306] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.425995] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.501464] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.537361] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.774192] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.993548] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1015.258274] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:14 ess-prox-023 kernel: [ 1015.478397] device-mapper: multipath: Failing path 8:48.

I'm assuming because the iSCSI initator is only using one path to the SAN.

I'm trying various fault domain setups at present so I'll report back any successes but would be good to hear from anyone who has experience with an SC4020.

Thanks
Chris.

adamb · Dec 15, 2015

chrisalavoine said:
Hi Adamb,

I'm afraid you're a little further on that me.

I'm still stuck trying to mount the storage. We have an Equallogic array that works fine. We just create a volume, point the iSCSI target at it, create LVM group on top and away we go.

The problem I have here is that we don't see any iSCSI volumes listed when trying to find targets on the Compellent SC4020. We just see the 4 physical controllers (and none of the fault domain targets). Of course I can connect to one of the 4 physical controllers and then create LVM group on top of that but this spits out lots of IO errors, like this:

Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482102] sd 28:0:0:1: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482103] sd 28:0:0:1: [sdd] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482104] sd 28:0:0:1: [sdd] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.482106] sd 28:0:0:1: [sdd] CDB: Write(10) 2a 00 0c 45 50 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483086] sd 28:0:0:1: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483088] sd 28:0:0:1: [sdd] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483089] sd 28:0:0:1: [sdd] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.483091] sd 28:0:0:1: [sdd] CDB: Write(10) 2a 00 0c 44 d0 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651663] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651669] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651672] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.651674] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 47 d0 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652254] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652659] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652660] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652662] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.652663] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 44 10 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653545] sd 25:0:0:1: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653547] sd 25:0:0:1: [sdb] Sense Key : Aborted Command [current]
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653548] sd 25:0:0:1: [sdb] Add. Sense: Synchronous data transfer error
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.653549] sd 25:0:0:1: [sdb] CDB: Write(10) 2a 00 0c 47 10 00 00 40 00 00
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.669788] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1013.909463] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.061686] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.203513] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:12 ess-prox-023 kernel: [ 1014.388306] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.425995] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.501464] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.537361] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.774192] device-mapper: multipath: Failing path 8:32.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1014.993548] device-mapper: multipath: Failing path 8:16.
Dec 15 18:05:13 ess-prox-023 kernel: [ 1015.258274] device-mapper: multipath: Failing path 8:48.
Dec 15 18:05:14 ess-prox-023 kernel: [ 1015.478397] device-mapper: multipath: Failing path 8:48.

I'm assuming because the iSCSI initator is only using one path to the SAN.

I'm trying various fault domain setups at present so I'll report back any successes but would be good to hear from anyone who has experience with an SC4020.

Thanks
Chris.

Let me make sure I am on the same page. So you can't see the target from the host? I guess I am confused as you were talking about multipath and that would come after you can see the target and login. Or are you trying to present a iscsi partition to a VM which resides on local storage and it can't see the target?

chrisalavoine · Dec 15, 2015

I can see the targets from the host, but only the 4 physical ports show up, not the actual volume target which is what we get with our Equallogic units.

adamb · Dec 15, 2015

chrisalavoine said:
I can see the targets from the host, but only the 4 physical ports show up, not the actual volume target which is what we get with our Equallogic units.

Im not familiar with Equallogic but they all function the same for the most part. Are you allowing all initiators to login, or do you limit login based on the iqn of the host? I know on the nimble, if I don't allow the specific iqn, I run into a very similar issue.

Have you used any other iscsi initiators to use a lun from the Equallogic? Maybe a Proxmox 3.4 host? Just trying to help narrow down if its the host or the storage.

chrisalavoine · Dec 15, 2015

Hey,

Sorry, we have our wires crossed. My Equallogic SAN connections are working fine with Proxmox, all the way back to version 1.6

We have just bought a new Compellent SC4020 enclosure and this is causing us the problems.
c

adamb · Dec 15, 2015

chrisalavoine said:
Hey,

Sorry, we have our wires crossed. My Equallogic SAN connections are working fine with Proxmox, all the way back to version 1.6

We have just bought a new Compellent SC4020 enclosure and this is causing us the problems.
c

But have you tried a different host just to prove that it is indeed a proxmox 4 issue?

chrisalavoine · Dec 15, 2015

Yes, have tried connecting from another 3.4 host and we get the same results unfortunately.

adamb · Dec 15, 2015

chrisalavoine said:
Yes, have tried connecting from another 3.4 host and we get the same results unfortunately.

That would be enough for me to question the enclosure or the configuration on the enclosure. It might be worth trying a "supported" os type so you can at least trouble shoot the issue with Dell.

chrisalavoine · Dec 15, 2015

Dell support have actually been very good. I currently have a ticket open with them and they willing to help me whilst using Prox/Debian instead of RHEL etc.
I shall report back.

adamb · Dec 15, 2015

chrisalavoine said:
Dell support have actually been very good. I currently have a ticket open with them and they willing to help me whilst using Prox/Debian instead of RHEL etc.
I shall report back.

Yea HP and IBM are the same way until the issue gets passed level 3 and engineering starts digging in. Once they get their hands on it, they want a supported OS. Good luck, hope you get it pinned down!

chrisalavoine · Dec 17, 2015

Hi,

I have found that the Compellent seems to work fine with Proxmox 3.4 and this would appear to be down to a driver mismatch between 3.4 and 4.x for my Qlogic HBA.

This is the driver from Prox 3.4:
bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.710.53r

And this if from Prox 4.x:
bnx2x: Broadcom NetXtreme II 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.710.51-0

These are identical cards but they are read slightly differently on 3.4 and 4.x.

I'd really like to get this working on 4.x so can anyone suggest a method to transfer the 3.4 driver over to 4.x?

In the meantime things are working very nicely in 3.4 so I'll stick with that for now but these are production machines and I don't want them pinned to that version forever.

Thanks,
Chris.

RCK · Feb 15, 2016

Hello Chris,
Any news to make Dell Compellent works with Proxmox4 ?
I have the same setup and don't want to try to upgrade with the driver problem.

Cheers,

chrisalavoine · Feb 15, 2016

RCK said:
Hello Chris,
Any news to make Dell Compellent works with Proxmox4 ?
I have the same setup and don't want to try to upgrade with the driver problem.

Cheers,

Hi RCK,

I decided to swap out all the QLOGIC 10Gb cards for Intel X520 DP's. The system now works very well. I would steer clear of the QLOGIC cards for Linux due to limited driver support for newer Linux kernels.

Cheers,
c

RCK · Feb 15, 2016

Thanks for the quick reply !
Hum, I have some Intel 10G and some Broadcom 10G card, I will try proxmox 4 on one node to see if it still work.
Your main problem what iscsi multipath could no more see the SAN ?
I mean, if iscsi is seeing correctly the SAN volume, I don't have to fear some later corruption ?

Thanks

iSCSI Issues - Dell Compellent SAN - Corruption of superblocks and controllers

New Member

Well-Known Member

Famous Member

Well-Known Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Famous Member

Well-Known Member

Renowned Member

Well-Known Member

Renowned Member