HA problem

bemar · Apr 25, 2012

Hello,

I've created an HA Cluster with 3 nodes.
Fencing is working with the APC7921 fencing device.

I've tried to start a VM and got the following error:

PHP:

Executing HA start for CT 109
Member vmhost2 trying to enable pvevm:109...Could not connect to resource group manager
TASK ERROR: command 'clusvcadm -e pvevm:109 -m vmhost2' failed: exit code 1

Starting the rgmanager failed with:

PHP:

root@vmhost2:~# /etc/init.d/rgmanager start
Starting Cluster Service Manager: [FAILED]

What could be the problem?

Thank you an best regards

Ben

bemar · Apr 25, 2012

Thats what I've got from "cat /var/log/syslog | grep dlm"

PHP:

Apr 25 15:13:05 vmhost2 kernel: dlm: closing connection to node 3
Apr 25 15:48:07 vmhost2 dlm_controld[2186]: dlm_controld 1324544458 started
Apr 25 15:48:18 vmhost2 kernel: dlm: Using TCP for communications
Apr 25 15:48:19 vmhost2 dlm_controld[2186]: dlm_join_lockspace no fence domain
Apr 25 15:48:19 vmhost2 dlm_controld[2186]: process_uevent online@ error -1 errno 2
Apr 25 15:48:19 vmhost2 kernel: dlm: rgmanager: group join failed -1 -1
Apr 25 16:02:12 vmhost2 kernel: dlm: Using TCP for communications
Apr 25 16:02:12 vmhost2 dlm_controld[2186]: dlm_join_lockspace no fence domain
Apr 25 16:02:12 vmhost2 dlm_controld[2186]: process_uevent online@ error -1 errno 11
Apr 25 16:02:12 vmhost2 kernel: dlm: rgmanager: group join failed -1 -1

Thats my cluster.conf

PHP:

<?xml version="1.0"?>
<cluster config_version="10" name="FinawareCluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="192.168.61.14" login="apc" name="apc" passwd="apc"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="vmhost1" nodeid="1" votes="1">
      <fence>
        <method name="power">
          <device name="apc" port="1" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vmhost2" nodeid="2" votes="1">
      <fence>
        <method name="power">
          <device name="apc" port="2" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="vmhost3" nodeid="3" votes="1">
      <fence>
        <method name="power">
          <device name="apc" port="3" secure="on"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="102"/>
    <pvevm autostart="1" vmid="107"/>
    <pvevm autostart="1" vmid="109"/>
  </rm>
</cluster>

bemar · Apr 25, 2012

Got it.

the rgmanager must run on the master node (vmhost1). I've tried to start it first on vmhost2 and that failed.
After starting rgmanager on master node vmhost1 all other startups on the other nodes succeeded.

Little hint: execute "update-rc.d rgmanager defaults" to make sure it will be started on node reboot.

Now I've earned a cigarette ;-)

Best regards

Bena

bemar · Apr 25, 2012

Hi,

I guess that was to early:

The rgmanager is running on vmhost1 (master).

On the 2 other nodes I get the error in syslog:

PHP:

dlm: Using TCP for communications
dlm: rgmanager: group join failed -1 -1

The rgmanager is starting but is stopping right after the start.

I have no clue what the problem is because there is no more info.

Any ideas?

Best regards

Ben

mir · Apr 25, 2012

bemar said:
Hi,
On the 2 other nodes I get the error in syslog:

PHP:

dlm: Using TCP for communications dlm: rgmanager: group join failed -1 -1

The rgmanager is starting but is stopping right after the start.

I have no clue what the problem is because there is no more info.

Any ideas?

Firewall issues or managed switch blocking communication between ports involved (vlan tagging)?

bemar · Apr 25, 2012

Nothing of them. The PVE Communication is running through the same route and cards and that works.

Thats my network config:

PHP:

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.61.10
        netmask 255.255.255.0
        gateway 192.168.61.1
        bridge_ports eth1
        bridge_stp off
        bridge_fd 0

auto bond0
iface bond0 inet static
        slaves eth0 eth2
        address 172.60.23.6
        netmask 255.255.255.240
        network 172.60.23.0
        broadcast 172.60.23.15
        bond miimon 100
        bond_mode balance-rr

mir · Apr 25, 2012

bemar said:

Nothing of them. The PVE Communication is running through the same route and cards and that works.

Thats my network config:

PHP:

auto lo
iface lo inet loopback

iface eth0 inet manual

iface eth1 inet manual

iface eth2 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.61.10
        netmask 255.255.255.0
        gateway 192.168.61.1
        bridge_ports eth1
        bridge_stp off
        bridge_fd 0

auto bond0
iface bond0 inet static
        slaves eth0 eth2
        address 172.60.23.6
        netmask 255.255.255.240
        network 172.60.23.0
        broadcast 172.60.23.15
        bond miimon 100
        bond_mode balance-rr

Are the nodes assigned IP on 172.60.23.0 or 192.168.61.0? routing does not automatically take place between vmbr0 and bond0

bemar · Apr 25, 2012

They are assigned to each other to the 172.60.23.0 network

mir · Apr 25, 2012

bemar said:
They are assigned to each other to the 172.60.23.0 network

But your fence agent:

<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.61.14" login="apc" name="apc" passwd="apc"/>
</fencedevices>

How are the fence agent suppose to communicate with the nodes and the node manager?

webstaff · Apr 25, 2012

bemar said:
They are assigned to each other to the 172.60.23.0 network

sounds like a little issue i'm having. Out of interest try stopping CRON CMAN and restarting PVECluster manager then start CRON CMAN and then RGManager sounds very similar to part of the issue i'm having with HA not quite playing ball.

Dave

bemar · Apr 25, 2012

Through the other network.

When I execute the "fence_apc" command on the nodes it works

PHP:

Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.60.23.0     *               255.255.255.240 U     0      0        0 bond0
192.168.61.0    *               255.255.255.0   U     0      0        0 vmbr0
default         192.168.61.1    0.0.0.0         UG    0      0        0 vmbr0

bemar · Apr 25, 2012

No luck. Got

PHP:

Apr 25 17:48:41 vmhost2 dlm_controld[2626]: dlm_controld 1324544458 started
Apr 25 17:48:49 vmhost2 kernel: dlm: Using TCP for communications
Apr 25 17:48:49 vmhost2 dlm_controld[2626]: dlm_join_lockspace no fence domain
Apr 25 17:48:49 vmhost2 dlm_controld[2626]: process_uevent online@ error -1 errno 2
Apr 25 17:48:49 vmhost2 kernel: dlm: rgmanager: group join failed -1 -1

bemar · Apr 25, 2012

That solved my problem:

http://forum.proxmox.com/threads/7872-Problems-after-updating-cluster

rgmanagers are running on all thre nodes until now

mir · Apr 25, 2012

bemar said:
Through the other network.

When I execute the "fence_apc" command on the nodes it works

And your fence can communicate to the nodes?

dietmar · Apr 25, 2012

Make sure you enable fencing in /etc/default/redhat-cluster-pve

see: http://pve.proxmox.com/wiki/Fencing

bemar · Apr 26, 2012

Today at 5:01 a.m. a very serious error occured on vmhost1 (master). As I came to the office today the machine was marked red in the GUI and all VMs and Containers off.

PHP:

Apr 26 05:01:10 vmhost1 kernel: connection1:0: detected conn error (1011)
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: Device offlined - not ready after error recovery
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: Device offlined - not ready after error recovery
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: Device offlined - not ready after error recovery
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: Device offlined - not ready after error recovery
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: Device offlined - not ready after error recovery
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Unhandled error code
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] CDB: Write(10): 2a 00 0f 88 32 08 00 00 08 00
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Unhandled error code
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] CDB: Write(10): 2a 00 0f 88 44 d0 00 00 08 00
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Unhandled error code
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] CDB: Write(10): 2a 00 0f 55 5f e8 00 00 40 00
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Unhandled error code
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] CDB: Write(10): 2a 00 0f 55 60 a0 00 00 08 00
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Unhandled error code
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Apr 26 05:03:17 vmhost1 kernel: sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 10 80 e7 80 00 00 08 00
Apr 26 05:04:30 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:05:22 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:06:14 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:07:06 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:07:58 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:08:50 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:09:42 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:10:34 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:11:26 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:12:39 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:13:31 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:14:23 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:15:15 vmhost1 kernel: scsi 5:0:0:1: Device offlined - not ready after error recovery
Apr 26 05:17:56 vmhost1 kernel: iscsiadm      D ffff880f3a47c580     0 188738   3071    0 0x00000000
Apr 26 05:17:56 vmhost1 kernel: ffff880f03fd77f8 0000000000000082 0000000000000000 ffffffff8100984c
Apr 26 05:17:56 vmhost1 kernel: ffff8807bcfe1278 0000000000000000 0000000000fd77b8 ffff88002825bd80
Apr 26 05:17:56 vmhost1 kernel: ffff88002825e268 ffff880f3a47cb20 ffff880f03fd7fd8 ffff880f03fd7fd8
Apr 26 05:17:56 vmhost1 kernel: Call Trace:
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8150b4a5>] schedule_timeout+0x215/0x2e0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8150b113>] wait_for_common+0x123/0x190
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81059b50>] ? default_wake_function+0x0/0x20
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81240643>] ? __generic_unplug_device+0x33/0x40
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8150b23d>] wait_for_completion+0x1d/0x20
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81247eec>] blk_execute_rq+0x8c/0xf0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81241970>] ? blk_rq_bio_prep+0x30/0xb0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81247a66>] ? blk_rq_map_kern+0xd6/0x150
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8135d80c>] scsi_execute+0xfc/0x160
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8135da88>] scsi_execute_req+0xb8/0x190
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8135f20c>] scsi_probe_and_add_lun+0x2dc/0xef0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81242659>] ? blk_put_request+0x49/0x60
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81260127>] ? kobject_put+0x27/0x60
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8136023c>] __scsi_scan_target+0x41c/0x750
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81360ca5>] scsi_scan_target+0xd5/0xf0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffffa029bd69>] iscsi_user_scan_session+0x159/0x190 [scsi_transport_iscsi]
Apr 26 05:17:56 vmhost1 kernel: [<ffffffffa029bc10>] ? iscsi_user_scan_session+0x0/0x190 [scsi_transport_iscsi]
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8133ba8c>] device_for_each_child+0x4c/0x80
Apr 26 05:17:56 vmhost1 kernel: [<ffffffffa029a69d>] iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff81361854>] store_scan+0xe4/0x120
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8133a960>] dev_attr_store+0x20/0x30
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff812032e5>] sysfs_write_file+0xe5/0x170
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8118ba61>] sys_write+0x51/0x90
Apr 26 05:17:56 vmhost1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 26 05:19:56 vmhost1 kernel: iscsiadm      D ffff880f3a47c580     0 188738   3071    0 0x00000000
Apr 26 05:19:56 vmhost1 kernel: ffff880f03fd77f8 0000000000000082 0000000000000000 ffffffff8100984c
Apr 26 05:19:56 vmhost1 kernel: ffff8807bcfe1278 0000000000000000 0000000000fd77b8 ffff88002825bd80
Apr 26 05:19:56 vmhost1 kernel: ffff88002825e268 ffff880f3a47cb20 ffff880f03fd7fd8 ffff880f03fd7fd8
Apr 26 05:19:56 vmhost1 kernel: Call Trace:
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8150b4a5>] schedule_timeout+0x215/0x2e0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8150b113>] wait_for_common+0x123/0x190
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81059b50>] ? default_wake_function+0x0/0x20
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81240643>] ? __generic_unplug_device+0x33/0x40
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8150b23d>] wait_for_completion+0x1d/0x20
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81247eec>] blk_execute_rq+0x8c/0xf0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81241970>] ? blk_rq_bio_prep+0x30/0xb0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81247a66>] ? blk_rq_map_kern+0xd6/0x150
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8135d80c>] scsi_execute+0xfc/0x160
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8135da88>] scsi_execute_req+0xb8/0x190
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8135f20c>] scsi_probe_and_add_lun+0x2dc/0xef0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81242659>] ? blk_put_request+0x49/0x60
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81260127>] ? kobject_put+0x27/0x60
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8136023c>] __scsi_scan_target+0x41c/0x750
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81360ca5>] scsi_scan_target+0xd5/0xf0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffffa029bd69>] iscsi_user_scan_session+0x159/0x190 [scsi_transport_iscsi]
Apr 26 05:19:56 vmhost1 kernel: [<ffffffffa029bc10>] ? iscsi_user_scan_session+0x0/0x190 [scsi_transport_iscsi]
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8133ba8c>] device_for_each_child+0x4c/0x80
Apr 26 05:19:56 vmhost1 kernel: [<ffffffffa029a69d>] iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff81361854>] store_scan+0xe4/0x120
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8133a960>] dev_attr_store+0x20/0x30
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff812032e5>] sysfs_write_file+0xe5/0x170
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8118ba61>] sys_write+0x51/0x90
Apr 26 05:19:56 vmhost1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 26 05:21:56 vmhost1 kernel: iscsiadm      D ffff880f3a47c580     0 188738   3071    0 0x00000000
Apr 26 05:21:56 vmhost1 kernel: ffff880f03fd77f8 0000000000000082 0000000000000000 ffffffff8100984c
Apr 26 05:21:56 vmhost1 kernel: ffff8807bcfe1278 0000000000000000 0000000000fd77b8 ffff88002825bd80
Apr 26 05:21:56 vmhost1 kernel: ffff88002825e268 ffff880f3a47cb20 ffff880f03fd7fd8 ffff880f03fd7fd8
Apr 26 05:21:56 vmhost1 kernel: Call Trace:
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8150b4a5>] schedule_timeout+0x215/0x2e0
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8150b113>] wait_for_common+0x123/0x190
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81059b50>] ? default_wake_function+0x0/0x20
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81240643>] ? __generic_unplug_device+0x33/0x40
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8150b23d>] wait_for_completion+0x1d/0x20
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81247eec>] blk_execute_rq+0x8c/0xf0
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81241970>] ? blk_rq_bio_prep+0x30/0xb0
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81247a66>] ? blk_rq_map_kern+0xd6/0x150
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8135d80c>] scsi_execute+0xfc/0x160
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8135da88>] scsi_execute_req+0xb8/0x190
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8135f20c>] scsi_probe_and_add_lun+0x2dc/0xef0
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81242659>] ? blk_put_request+0x49/0x60
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81260127>] ? kobject_put+0x27/0x60
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8136023c>] __scsi_scan_target+0x41c/0x750
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81360ca5>] scsi_scan_target+0xd5/0xf0
Apr 26 05:21:56 vmhost1 kernel: [<ffffffffa029bd69>] iscsi_user_scan_session+0x159/0x190 [scsi_transport_iscsi]
Apr 26 05:21:56 vmhost1 kernel: [<ffffffffa029bc10>] ? iscsi_user_scan_session+0x0/0x190 [scsi_transport_iscsi]
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8133ba8c>] device_for_each_child+0x4c/0x80
Apr 26 05:21:56 vmhost1 kernel: [<ffffffffa029a69d>] iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff81361854>] store_scan+0xe4/0x120
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8133a960>] dev_attr_store+0x20/0x30
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff812032e5>] sysfs_write_file+0xe5/0x170
Apr 26 05:21:56 vmhost1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 26 05:23:56 vmhost1 kernel: iscsiadm      D ffff880f3a47c580     0 188738   3071    0 0x00000000
Apr 26 05:23:56 vmhost1 kernel: ffff880f03fd77f8 0000000000000082 0000000000000000 ffffffff8100984c
Apr 26 05:23:56 vmhost1 kernel: ffff8807bcfe1278 0000000000000000 0000000000fd77b8 ffff88002825bd80
Apr 26 05:23:56 vmhost1 kernel: ffff88002825e268 ffff880f3a47cb20 ffff880f03fd7fd8 ffff880f03fd7fd8
Apr 26 05:23:56 vmhost1 kernel: Call Trace:
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8150b4a5>] schedule_timeout+0x215/0x2e0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8150b113>] wait_for_common+0x123/0x190
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81059b50>] ? default_wake_function+0x0/0x20
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81240643>] ? __generic_unplug_device+0x33/0x40
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8150b23d>] wait_for_completion+0x1d/0x20
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81247eec>] blk_execute_rq+0x8c/0xf0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81241970>] ? blk_rq_bio_prep+0x30/0xb0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81247a66>] ? blk_rq_map_kern+0xd6/0x150
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8135d80c>] scsi_execute+0xfc/0x160
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8135da88>] scsi_execute_req+0xb8/0x190
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8135f20c>] scsi_probe_and_add_lun+0x2dc/0xef0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81242659>] ? blk_put_request+0x49/0x60
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81260127>] ? kobject_put+0x27/0x60
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8136023c>] __scsi_scan_target+0x41c/0x750
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81360ca5>] scsi_scan_target+0xd5/0xf0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffffa029bd69>] iscsi_user_scan_session+0x159/0x190 [scsi_transport_iscsi]
Apr 26 05:23:56 vmhost1 kernel: [<ffffffffa029bc10>] ? iscsi_user_scan_session+0x0/0x190 [scsi_transport_iscsi]
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8133ba8c>] device_for_each_child+0x4c/0x80
Apr 26 05:23:56 vmhost1 kernel: [<ffffffffa029a69d>] iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff81361854>] store_scan+0xe4/0x120
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8133a960>] dev_attr_store+0x20/0x30
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff812032e5>] sysfs_write_file+0xe5/0x170
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8118ba61>] sys_write+0x51/0x90
Apr 26 05:23:56 vmhost1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b
Apr 26 05:25:56 vmhost1 kernel: iscsiadm      D ffff880f3a47c580     0 188738   3071    0 0x00000000
Apr 26 05:25:56 vmhost1 kernel: ffff880f03fd77f8 0000000000000082 0000000000000000 ffffffff8100984c
Apr 26 05:25:56 vmhost1 kernel: ffff8807bcfe1278 0000000000000000 0000000000fd77b8 ffff88002825bd80
Apr 26 05:25:56 vmhost1 kernel: ffff88002825e268 ffff880f3a47cb20 ffff880f03fd7fd8 ffff880f03fd7fd8
Apr 26 05:25:56 vmhost1 kernel: Call Trace:
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8100984c>] ? __switch_to+0x1ac/0x320
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8150b4a5>] schedule_timeout+0x215/0x2e0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8150b113>] wait_for_common+0x123/0x190
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81059b50>] ? default_wake_function+0x0/0x20
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81240643>] ? __generic_unplug_device+0x33/0x40
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8150b23d>] wait_for_completion+0x1d/0x20
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81247eec>] blk_execute_rq+0x8c/0xf0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81241970>] ? blk_rq_bio_prep+0x30/0xb0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81247a66>] ? blk_rq_map_kern+0xd6/0x150
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8135d80c>] scsi_execute+0xfc/0x160
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8135da88>] scsi_execute_req+0xb8/0x190
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8135f20c>] scsi_probe_and_add_lun+0x2dc/0xef0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81242659>] ? blk_put_request+0x49/0x60
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81260127>] ? kobject_put+0x27/0x60
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8136023c>] __scsi_scan_target+0x41c/0x750
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81360ca5>] scsi_scan_target+0xd5/0xf0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffffa029bd69>] iscsi_user_scan_session+0x159/0x190 [scsi_transport_iscsi]
Apr 26 05:25:56 vmhost1 kernel: [<ffffffffa029bc10>] ? iscsi_user_scan_session+0x0/0x190 [scsi_transport_iscsi]
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8133ba8c>] device_for_each_child+0x4c/0x80
Apr 26 05:25:56 vmhost1 kernel: [<ffffffffa029a69d>] iscsi_user_scan+0x2d/0x30 [scsi_transport_iscsi]
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff81361854>] store_scan+0xe4/0x120
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8133a960>] dev_attr_store+0x20/0x30
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff812032e5>] sysfs_write_file+0xe5/0x170
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8118b058>] vfs_write+0xb8/0x1a0
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8118ba61>] sys_write+0x51/0x90
Apr 26 05:25:56 vmhost1 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

Do anybody see what the problem was?

The strange thing is that the node wasn't fenced although of that errors in message.log. All VMs and Containers (several managed by HA) weren't migrated to the other 2 nodes which are running well.
When do the fencing machanism do something?

Thank you

Ben

bread-baker · Apr 26, 2012

that looks like a hardware disk issue.

bemar · Apr 26, 2012

hmm. /dev/sdb/ is the iscsi device

Search

Search

HA problem

bemar

Member

bemar

Member

bemar

Member

bemar

Member

mir

Famous Member

bemar

Member

mir

Famous Member

bemar

Member

mir

Famous Member

webstaff

New Member

bemar

Member

bemar

Member

bemar

Member

mir

Famous Member

dietmar

Proxmox Staff Member

bemar

Member

bread-baker

Member

bemar

Member