Cluster 2.1: ISCSI detected conn error (1021)

rbg · May 16, 2013

hi,

we have a Open-E DSS7 ISCSI Cluster with two virtual IPs for the target. Also we have 3 ProxMox 2.1 cluster, but I have massive problems to get a stable ISCSI connections on all nodes:

Code:

[...]
May 16 18:28:01 node-01 iscsid: Kernel reported iSCSI connection 1:0 error (1021) state (3)
May 16 18:28:01 node-01 iscsid: connection1:0 is operational after recovery (1 attempts)
May 16 18:28:01 node-01 kernel: scsi 6:0:0:0: Device offlined - not ready after error recovery
May 16 18:28:43 node-01 kernel: connection2:0: detected conn error (1021)
May 16 18:28:44 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 16 18:28:45 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 16 18:28:54 node-01 kernel: connection2:0: detected conn error (1021)
May 16 18:28:55 node-01 kernel: scsi 7:0:0:0: Device offlined - not ready after error recovery
May 16 18:28:55 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 16 18:28:55 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 16 18:28:55 node-01 pvestatd[2201]: status update time (411.585 seconds)
May 16 18:29:57 node-01 kernel: connection3:0: detected conn error (1021)
May 16 18:29:58 node-01 iscsid: Kernel reported iSCSI connection 3:0 error (1021) state (3)
May 16 18:29:59 node-01 iscsid: connection3:0 is operational after recovery (1 attempts)
[...]

The nodes and the ISCSI are on a own LACP channel bounding.

Code:

Current active iSCSI sessions:
tcp: [1] 192.168.220.20:3260,1 iqn.2013-03:san.backuphost
tcp: [10] 130.xx.xx.xx:3260,1 iqn.2013-04:san.ldap2host
tcp: [2] 130.xx.xx.xx:3260,1 iqn.2013-03:san.backuphost
tcp: [3] 192.168.220.20:3260,1 iqn.2013-05:san.supporthost
tcp: [4] 130.xx.xx.xx:3260,1 iqn.2013-05:san.supporthost
tcp: [5] 192.168.220.20:3260,1 iqn.2013-04:san.icingahost
tcp: [6] 130.xx.xx.xx:3260,1 iqn.2013-04:san.icingahost
tcp: [7] 192.168.220.20:3260,1 iqn.2013-05:san.nypdhost
tcp: [8] 130.xx.xx.xx:3260,1 iqn.2013-05:san.nypdhost
tcp: [9] 192.168.220.20:3260,1 iqn.2013-04:san.ldap2host

The nodes has three IPs:

1. 130.xx.xx.xx -> IP for external connections and vmbr0 -> also second path to cluster-ip for ISCSI
2. 192.168.200.x -> Cluster communication -> bond0 ->LACP
3. 192.168.220.x -> main ISCSI initiator -> ISCSI target for open-e 192.168.220.20 cluster IP

Adding the target is only done via 192.168.220.20

I also tried to use multipath, but this changes nothing. Both path are lost. Sometimes it works for only minutes, sometimes it works for hours.

We have also one initiator (Ubuntu) which works without any loosing connection.

Any suggestions?

mir · May 16, 2013

Have you tried without LACP, eg. using a single connection?
What kind of switch are you using?
Anything found in the switch log (Providing the switch have a log)?

rbg · May 17, 2013

hi,

mir said:
Have you tried without LACP, eg. using a single connection?
What kind of switch are you using?
Anything found in the switch log (Providing the switch have a log)?

It is a Cisco WS-C2960S-48ts-L. All none ProxMox (also Debian Squeeze) server running also with LACP. It was working a few hours, but now I get the errors back. I removed also yesterday a virtual Cluster IP from open-E, so that I'm sure, that the connections is handled only over one switch:

Code:

....
May 17 10:56:28 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 17 10:56:58 node-01 kernel: connection2:0: detected conn error (1021)
May 17 10:56:59 node-01 kernel: sd 7:0:0:0: Device offlined - not ready after error recovery
May 17 10:56:59 node-01 kernel: sd 7:0:0:0: rejecting I/O to offline device
May 17 10:56:59 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 17 10:56:59 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 17 10:58:20 node-01 kernel: connection2:0: detected conn error (1021)
May 17 10:58:21 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 17 10:58:21 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 17 10:58:51 node-01 kernel: connection2:0: detected conn error (1021)
May 17 10:58:52 node-01 kernel: sd 7:0:0:0: Device offlined - not ready after error recovery
May 17 10:58:52 node-01 kernel: sd 7:0:0:0: [sdd] READ CAPACITY failed
May 17 10:58:52 node-01 kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
May 17 10:58:52 node-01 kernel: sd 7:0:0:0: [sdd] Sense not available.
May 17 10:58:52 node-01 kernel: sd 7:0:0:0: rejecting I/O to offline device
May 17 10:58:52 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 17 10:58:52 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
May 17 11:00:13 node-01 kernel: connection2:0: detected conn error (1021)
May 17 11:00:14 node-01 iscsid: Kernel reported iSCSI connection 2:0 error (1021) state (3)
May 17 11:00:14 node-01 iscsid: connection2:0 is operational after recovery (1 attempts)
...

All three nodes have now this problem.

cu denny

ps. we buy next working day a Standard Support pack for all three nodes.

rbg · May 17, 2013

hi,

uhh bad:

Code:

May 17 11:23:54 node-01 iscsid: Kernel reported iSCSI connection 6:0 error (1021) state (3)
May 17 11:23:55 node-01 iscsid: connection6:0 is operational after recovery (1 attempts)
May 17 11:24:11 node-01 kernel: INFO: task iscsiadm:30111 blocked for more than 120 seconds.
May 17 11:24:11 node-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 17 11:24:11 node-01 kernel: iscsiadm      D ffff880478a453e0     0 30111   2281    0 0x00000000
May 17 11:24:11 node-01 kernel: ffff880463f79958 0000000000000082 0000000000000000 0000000000000082
May 17 11:24:11 node-01 kernel: ffff880463f798e8 ffffffff81090f81 ffff880476030000 ffff880478809580
May 17 11:24:11 node-01 kernel: 0000000000000000 00000001036684e8 ffff880478a459a8 000000000001e9c0
May 17 11:24:11 node-01 kernel: Call Trace:
May 17 11:24:11 node-01 kernel: [<ffffffff81090f81>] ? __queue_work+0x41/0x50
May 17 11:24:11 node-01 kernel: [<ffffffff8151f185>] schedule_timeout+0x215/0x2e0
May 17 11:24:11 node-01 kernel: [<ffffffff81272f27>] ? kobject_put+0x27/0x60
May 17 11:24:11 node-01 kernel: [<ffffffff81012a49>] ? read_tsc+0x9/0x20
May 17 11:24:11 node-01 kernel: [<ffffffff810a1e09>] ? ktime_get_ts+0xa9/0xe0
May 17 11:24:11 node-01 kernel: [<ffffffff8151d79f>] io_schedule_timeout+0x7f/0xd0
May 17 11:24:11 node-01 kernel: [<ffffffff8151e8a4>] wait_for_completion_io+0xe4/0x120
May 17 11:24:11 node-01 kernel: [<ffffffff8105a520>] ? default_wake_function+0x0/0x20
May 17 11:24:11 node-01 kernel: [<ffffffff8125afcf>] ? blk_execute_rq_nowait+0x7f/0x100
May 17 11:24:11 node-01 kernel: [<ffffffff8125b0dc>] blk_execute_rq+0x8c/0xf0
May 17 11:24:11 node-01 kernel: [<ffffffff812546e0>] ? blk_rq_bio_prep+0x30/0xb0
May 17 11:24:11 node-01 kernel: [<ffffffff8125ac16>] ? blk_rq_map_kern+0xd6/0x150
May 17 11:24:11 node-01 kernel: [<ffffffff8136f758>] scsi_execute+0xe8/0x180
May 17 11:24:11 node-01 kernel: [<ffffffff8136f88b>] scsi_execute_req+0x9b/0x110
May 17 11:24:11 node-01 kernel: [<ffffffff81379109>] read_capacity_10+0x99/0x250
May 17 11:24:11 node-01 kernel: [<ffffffff8137b677>] sd_revalidate_disk+0x1127/0x17b0
May 17 11:24:11 node-01 kernel: [<ffffffff8127327a>] ? kobject_get+0x1a/0x30
May 17 11:24:11 node-01 kernel: [<ffffffff81270000>] ? compat_blkdev_ioctl+0x12b0/0x1570
May 17 11:24:11 node-01 kernel: [<ffffffff811d4b08>] revalidate_disk+0x38/0x90
May 17 11:24:11 node-01 kernel: [<ffffffff81378727>] sd_rescan+0x27/0x40
May 17 11:24:11 node-01 kernel: [<ffffffff813703bd>] scsi_rescan_device+0x8d/0xe0
May 17 11:24:11 node-01 kernel: [<ffffffff81373786>] store_rescan_field+0x16/0x20
May 17 11:24:11 node-01 kernel: [<ffffffff8134cb80>] dev_attr_store+0x20/0x30
May 17 11:24:11 node-01 kernel: [<ffffffff81213455>] sysfs_write_file+0xe5/0x170
May 17 11:24:11 node-01 kernel: [<ffffffff81198ba8>] vfs_write+0xb8/0x1a0
May 17 11:24:11 node-01 kernel: [<ffffffff811994a1>] sys_write+0x51/0x90
May 17 11:24:11 node-01 kernel: [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b
May 17 11:24:25 node-01 kernel: connection6:0: detected conn error (1021)
May 17 11:24:25 node-01 iscsid: Kernel reported iSCSI connection 6:0 error (1021) state (3)
May 17 11:24:26 node-01 kernel: sd 11:0:0:0: Device offlined - not ready after error recovery
May 17 11:24:26 node-01 kernel: sd 11:0:0:0: rejecting I/O to offline device
May 17 11:24:26 node-01 iscsid: connection6:0 is operational after recovery (1 attempts)

rbg · May 17, 2013

hi,

we moved now completely from LACP to single port connection, all three ProxMox server and the open-e DSS7. No changes.

Code:

eth2      Link encap:Ethernet  Hardware Adresse 00:0a:f7:10:01:dc
          inet Adresse:192.168.200.4  Bcast:192.168.200.255  Maske:255.255.255.0
          inet6-Adresse: fe80::20a:f7ff:fe10:1dc/64 Gültigkeitsbereich:Verbindung
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metrik:1
          RX packets:66812 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34805 errors:0 dropped:0 overruns:0 carrier:0
          Kollisionen:0 Sendewarteschlangenlänge:1000
          RX bytes:70842118 (67.5 MiB)  TX bytes:6593238 (6.2 MiB)

Code:

 connection7:0: detected conn error (1021)
sd 11:0:0:0: Device offlined - not ready after error recovery
sd 11:0:0:0: rejecting I/O to offline device
 connection11:0: detected conn error (1021)
 connection11:0: detected conn error (1021)
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: Device offlined - not ready after error recovery
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 dc b9 e6 00 00 10 00
sd 15:0:0:0: rejecting I/O to offline device
sd 15:0:0:0: [sdd] killing request
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 be 28 fe 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 b9 3b 26 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 b8 3d 4e 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 ae 61 9e 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 a9 73 c6 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
sd 15:0:0:0: [sdd] CDB: Write(10): 2a 00 06 a8 75 ee 00 00 10 00
sd 15:0:0:0: [sdd] Unhandled error code
sd 15:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVE

any suggestions? Also I pinged the cluster IP round about 10 minutes. No loosing packages.

danypd69 · Dec 9, 2013

Hello rbg, did you solve this problem? I have a similar problem (conn error) and I cannot find the reason for it.

rbg · Dec 10, 2013

hi,

danypd69 said:
Hello rbg, did you solve this problem? I have a similar problem (conn error) and I cannot find the reason for it.

yes, I solved the problem, but not on the ProxMox side. The main problem was a broken ISCSI / DRBD construction from open-e DSS7. DSS7 is a black box so I don't know what the main problem was, but the support service fixed the issue.

Sorry, that I can't help you.

Search

Search

Cluster 2.1: ISCSI detected conn error (1021)

rbg

Member

mir

Famous Member

rbg

Member

rbg

Member

rbg

Member

danypd69

Renowned Member

rbg

Member