iSCSI Issues - Dell Compellent SAN - Corruption of superblocks and controllers

My particular problem was that the QLOGIC (showing up as bnx2 which is the broadcom driver I believe) cards were being recognised and iSCSI connections working fine, but as soon as I put any kind of load on the connection, restoring a snapshot for instance, I would get loads of IO errors and the host would crash to the point where I had to do a manual hard reset to get it back up again.
As soon as I installed the Intel's everything worked as expected.
 
Rule of thumb when dealing with Linux and Nics: Stick to Intel, their driver support on Linux is unbeaten and new drivers comes to Linux at the same time and sometimes before on Windows
 
Chris, I am running into the same errors you posted earlier (see below) on Dell FX2 servers connected Dell compellent system. Proxmox version is 4.2 latest. It has four 'Intel Corporation Ethernet Controller X710 for 10GbE backplane' NICs.

Your first post said you got everything working with so many LUNs. Did it really withstand the load? Can you please share the config information from Compellent side?

We used the Proxmox GUI to configure iSCSI and LVM. We tried multipath and singlepath; both give the same error and system becomes unresponsive after trying to restore a VM with bunch of these errors in the logs. We ended up cold-booting the server a few times.

Code:
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783874] sd 13:0:0:1: [sdd] tag#16 Add. Sense: Synchronous data transfer error
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783875] sd 13:0:0:1: [sdd] tag#16 CDB: Write(16) 8a 00 00 00 00 00 07 73 e8 c0 00 00 40 00 00 00
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783876] blk_update_request: I/O error, dev sdd, sector 125036736
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786139] sd 13:0:0:1: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786141] sd 13:0:0:1: [sdd] tag#19 Sense Key : Aborted Command [current]
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786143] sd 13:0:0:1: [sdd] tag#19 Add. Sense: Synchronous data transfer error
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786144] sd 13:0:0:1: [sdd] tag#19 CDB: Write(16) 8a 00 00 00 00 00 07 73 a8 c0 00 00 40 00 00 00
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786145] blk_update_request: I/O error, dev sdd, sector 125020352

thanks
 
Chris, I am running into the same errors you posted earlier (see below) on Dell FX2 servers connected Dell compellent system. Proxmox version is 4.2 latest. It has four 'Intel Corporation Ethernet Controller X710 for 10GbE backplane' NICs.

Your first post said you got everything working with so many LUNs. Did it really withstand the load? Can you please share the config information from Compellent side?

We used the Proxmox GUI to configure iSCSI and LVM. We tried multipath and singlepath; both give the same error and system becomes unresponsive after trying to restore a VM with bunch of these errors in the logs. We ended up cold-booting the server a few times.

Code:
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783874] sd 13:0:0:1: [sdd] tag#16 Add. Sense: Synchronous data transfer error
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783875] sd 13:0:0:1: [sdd] tag#16 CDB: Write(16) 8a 00 00 00 00 00 07 73 e8 c0 00 00 40 00 00 00
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.783876] blk_update_request: I/O error, dev sdd, sector 125036736
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786139] sd 13:0:0:1: [sdd] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786141] sd 13:0:0:1: [sdd] tag#19 Sense Key : Aborted Command [current]
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786143] sd 13:0:0:1: [sdd] tag#19 Add. Sense: Synchronous data transfer error
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786144] sd 13:0:0:1: [sdd] tag#19 CDB: Write(16) 8a 00 00 00 00 00 07 73 a8 c0 00 00 40 00 00 00
Jun 28 16:06:02 ch3rdprox01 kernel: [  885.786145] blk_update_request: I/O error, dev sdd, sector 125020352

thanks

Hi Bravo,

My original problem was the fact that the drivers for the QLOGIC cards were not working properly, so I swapped them all out for Intel X520's. These worked fine under 4.1 (kernel 4.2.6-1) but I recently tried upgrading to 4.2 (kernel 4.4.6) and got the same IO errors even with the X520's. I have temporarily pinned my kernel back to 4.2.6-1 until either a newer (fixed) version is released or the IO errors are resolved on kernel 4.4.6.

Thanks,
Chris.
 
Thanks Chris. I tried 4.2, 4.1 and 3.4. And all of them give the same error. Trying out 4.0 now. Tried swapping Intel card with a broadcom one, and it gives the same error. I will post the results with 4.0 later today.
 
Hi Bravo,

We also have the same kind of problem with our Compellent SAN.


According to Compellent Support, the problem seems to be related to a change merged the version 3.19 of the kernel.

https://git.kernel.org/cgit/linux/k.../?id=34b48db66e08ca1c1bc07cf305d672ac940268dc


Each block device that exists on a Linux system is allocated a queue directory (located at /sys/block/xxx/queue/). This directory contains 2 parameters that are relevant to this problem:


max_hw_sectors_kb (RO)
----------------------
This is the maximum number of kilobytes supported in a single data transfer.

max_sectors_kb (RW)
-------------------
This is the maximum number of kilobytes that the block layer will allow
for a filesystem request. Must be smaller than or equal to the maximum
size allowed by the hardware.



The kernel 3.19 (and above) has changed the way that max_sectors_kb is calculated. In previous kernel releases, the value was always set to 512KB. After this change, the value of max_sectors_kb is set to the value of max_hw_sectors_kb. This essentially increased the maximum size of a single IO transfer to a given block device from 512KB to 32MB.


This change also exposed a problem with the Compellent SAN 10Gb iSCSI driver for IO sizes greater than 8MB. The server process producing large IO may hang or the volume may become unavailable due to the large IO size.

In order to prevent this problem, the max_sectors_kb for each block device must be changed dynamically on the Linux server. For example:


echo 512 > /sys/block/sdc/queue/max_sectors_kb
echo 512 > /sys/block/sdd/queue/max_sectors_kb
...



These changes are not persistent through reboots. An init script must be implemented to run when a given server boots that will modify this parameter for all iSCSI block devices.


You can probably refer to this link to create udev script

http://longwhiteclouds.com/2016/03/06/default-io-size-change-in-linux-kernel/


Yanick
 
Thank you Yanick!!! It worked. Appreciate the response and the detailed explanation. It is a great relief after toiling for several days to get this working.

I used the udev script from the link you provided above to set the size to 1MB and it is working.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!