HA + GlusterFS for shared storage of containers

fknorn · Jun 5, 2014

Hi there,

Please forgive my ignorance, but I'm rather new to proxmox. After much trial and error I would like to check something with you, maybe I am hoping for too much.

I'm tasked to set up a HA cluster, where nodes can just fail and containers running on said nodes get automatically fired up on the remaining nodes. So, High Availability.

Now, I've set up a 3 node cluster with the latest proxmox version (pve-manager/3.2-4/e24a91c1 (running kernel: 2.6.32-29-pve)), got fencing and HA working, etc.

To test the clustering, I can online migrate (locally hosted) CTs around with zero downtime, migration taking about 40 seconds or so total.

Next I've created a 4th machine which runs a GlusterFS server (v 3.4.2), and added one of its volumes as share to my proxmox cluster.

I've then created another container, hosted on this shared storage. I can also online migrate this container around, although it takes longer (and the downtime in particular is much longer). But that's another thing to sort out.

More crucially --- if I pull the power cord on the node running that second container (living on the Gluster share...), the container does not get fired up somewhere else by the resource group manager. Worse, I can't even manually "migrate" --- or just fire up at all --- that container. If I try to do that, I get something along the lines of "no route to host".

So what am I doing wrong? I of course would like to see the proxmox cluster recognising the node failure and fire up that container on one of the two remaining nodes ASAP!

Thanks for helping me out!

fknorn · Jun 7, 2014

Is it possible that I'll have to look into failover domains for things to work?

cesarpk · Jun 8, 2014

fknorn said:
Is it possible that I'll have to look into failover domains for things to work?

May be for several reasons:
1- Your brick isn't replicated due to your configuration
2- Don't all PVE nodes have the same version of the GlusterFS communication (aptitude update && aptitude full-upgrade)
3- More RAM is requiered in the PVE node for start your VMs/CTs

In any way, you should see your logs, these are your friends, in the majority of the cases, these help to determine the root of the problem. So that you should see the logs, and post it here for that we can help you.

About of "failover domains", this configuration is necessary for get a order of preference for your VMs, when "HA" must be run.

fknorn · Jun 9, 2014

Hi cesarpk,

Thanks for those pointers. However, I'm just building a proof of concept testbed...

I did some more testing. I've narrowed the problem down somewhat. Two scenarios:

A) I remove the network cable (on the interface proxmox communication takes place). In that case, the node gets fenced and the CT previously running on it gets fired up again somewhere else. All automatic, all good!

B) However, if I remove the power cord from that node, I get the following syslog. The problem is that the other nodes continue to try to fence the node (bold lines in the log below), but won't go on the next step of firing up the CT somewhere else! In fact, rgmanager seems to get locked up that way... This only gets resolved if I put the power back onto the killed node. In that case, fencing eventually succeeds (red bold line), and only THEN the CT gets moved.

Now my question: why this locking up of rgmanager, and why doesn't the CT get fired up somewhere else in the meantime? Or: How can I achieve this?!

Thanks for your help, gentlemen!!

Code:

[COLOR=#000000][FONT=tahoma]Jun  9 11:36:29 dcmgm-proxmox1 corosync[2949]:   [TOTEM ] A processor failed, forming new configuration.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] CLM CONFIGURATION CHANGE[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] New Configuration:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] #011r(0) ip(10.144.0.11) [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] #011r(0) ip(10.144.0.13) [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] Members Left:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] #011r(0) ip(10.144.0.12) [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] Members Joined:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [QUORUM] Members[2]: 1 3[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] CLM CONFIGURATION CHANGE[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] New Configuration:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] #011r(0) ip(10.144.0.11) [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] #011r(0) ip(10.144.0.13) [/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] Members Left:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CLM   ] Members Joined:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 kernel: dlm: closing connection to node 2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 rgmanager[3275]: State change: dcmgm-proxmox2 DOWN[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [CPG   ] chosen downlist: sender r(0) ip(10.144.0.11) ; members(old:3 left:1)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: members: 1/2780, 3/2778[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: starting data syncronisation[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 corosync[2949]:   [MAIN  ] Completed service synchronization, ready to provide service.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 fenced[3005]: fencing node dcmgm-proxmox2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: cpg_send_message retried 1 times[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: members: 1/2780, 3/2778[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: starting data syncronisation[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: received sync request (epoch 1/2780/00000004)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: received sync request (epoch 1/2780/00000004)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: received all states[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: leader is 1/2780[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: synced members: 1/2780, 3/2778[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: start sending inode updates[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: sent all (0) updates[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: all data is up to date[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: received all states[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [dcdb] notice: all data is up to date[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:41 dcmgm-proxmox1 pmxcfs[2780]: [status] notice: dfsm_deliver_queue: queue length 18[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:51 dcmgm-proxmox1 pveproxy[43162]: WARNING: proxy detected vanished client connection[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:55 dcmgm-proxmox1 pveproxy[3806]: worker 43162 finished[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:55 dcmgm-proxmox1 pveproxy[3806]: starting 1 worker(s)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:36:55 dcmgm-proxmox1 pveproxy[3806]: worker 44530 started[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma][B]Jun  9 11:37:01 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 dev 0.0 agent fence_ilo3 result: error from agent[/B][/FONT][/COLOR]
[B][COLOR=#000000][FONT=tahoma]Jun  9 11:37:01 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:04 dcmgm-proxmox1 fenced[3005]: fencing node dcmgm-proxmox2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:24 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 dev 0.0 agent fence_ilo3 result: error from agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:24 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:27 dcmgm-proxmox1 fenced[3005]: fencing node dcmgm-proxmox2[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:47 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 dev 0.0 agent fence_ilo3 result: error from agent[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:37:47 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 failed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:38:40 dcmgm-proxmox1 pvedaemon[41453]: <root@pam> successful auth for user 'root@pam'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: INFO: task rgmanager:44491 blocked for more than 120 seconds.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel:      Not tainted 2.6.32-29-pve #1[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.[/FONT][/COLOR]
[/B][COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: rgmanager     D ffff880c1586cbb0     0 44491   3273    0 0x00000000[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: ffff880c15d19c90 0000000000000086 ffff880c15894df0 0000000000000003[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: 0000000000000000 ffff880c15d19c18 ffffffff8128abc9 ffff880c15d19e34[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: 0000000000000000 ffff880c15d19da8 ffff880c1586d178 000000000001ec80[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: Call Trace:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8128abc9>] ? cpumask_next_and+0x29/0x50[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8106bd31>] ? update_curr+0xe1/0x1f0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155ea55>] rwsem_down_failed_common+0x95/0x1d0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155ebe6>] rwsem_down_read_failed+0x26/0x30[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff812982d4>] call_rwsem_down_read_failed+0x14/0x30[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155e0d4>] ? down_read+0x24/0x30[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffffa053efb6>] dlm_user_convert+0x46/0x1d0 [dlm][/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155bd80>] ? thread_return+0xbe/0x89e[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff810a7635>] ? __hrtimer_start_range_ns+0x1a5/0x480[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff810a6cc1>] ? lock_hrtimer_base+0x31/0x60[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8119595f>] ? kmem_cache_alloc_trace+0x1df/0x200[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffffa054a15a>] device_write+0x42a/0x730 [dlm][/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155df68>] ? do_nanosleep+0x48/0xc0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff810a7b14>] ? hrtimer_nanosleep+0xc4/0x180[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff811ac258>] vfs_write+0xb8/0x1a0[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff811acb51>] sys_write+0x51/0x90[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8155f60e>] ? do_device_not_available+0xe/0x10[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:39:14 dcmgm-proxmox1 kernel: [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:40:23 dcmgm-proxmox1 pvedaemon[3657]: worker 40643 finished[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:40:23 dcmgm-proxmox1 pvedaemon[3657]: starting 1 worker(s)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Jun  9 11:40:23 dcmgm-proxmox1 pvedaemon[3657]: worker 44937 started[/FONT][/COLOR]
[B][COLOR=#ff0000][FONT=tahoma]Jun  9 11:40:39 dcmgm-proxmox1 fenced[3005]: fence dcmgm-proxmox2 success[/FONT][/COLOR][/B]
[COLOR=#000000][FONT=tahoma]Jun  9 11:40:40 dcmgm-proxmox1 rgmanager[44960]: [pvevm] CT 151 is not running[/FONT][/COLOR]

kobuki · Jun 9, 2014

This is interesting. On complete powerdown/PSU error the BMC is completely powered off, too, so the iLO fence agent will always fail since it cannot access the BMC in the target board. While using an integrated HW agent such as HP's iLO might be very convenient, it seems has these drawbacks too... Even if you add some checks in the fence agent like pinging the BMC's IP, you cannot tell whether it's a faulty BMC, a complete poweroff or a broken network path somewhere. For testing though, you could do a simple ping check and if it fails you return with a zero from the fence gent (assuming that ping failure means the target server is offline/unavailable).

cesarpk · Jun 9, 2014

The problem with ILO or some other integrated card into of the Server is that if the Server don't get electric power, fence will not work because the fence system can't connect to the integrated card for give the order of power off.

About of this, i see many alternatives:

1- Configure 2 methods of fence: "ILO" and "Manual fence", if "ILO" don't work, a alternative method of fence may be the "Manual fence" (that requires a bit of human intervention"), the "Manual fence" is supported officially by RHEL, (I always have configured this option and work well while that i cut the electric power in the PVE node with problems, and only after of this, i run in a PVE node that is alive by CLI a command for apply fencing to PVE node that have problems).

2- Change to a system more robust: using "PDUs" as the brand "APC" (also know as power switch), these equipments have a electric connector for power to work, a port RJ-45, and several electrical connectors for the computers, so that if your PVE node have some type of fails for any reason, independently of the failure of this Server, this PDU will do his work very well

But, what happens if the PDU also fail (for example the PDU is decomposed)?
Although it is unlikely, you can have 2 PDUs, and the fence system can have 2 methods of do fence, where the second is a alternative if the first fail.

3- Combinations of fence methods (may be by economy reasons): Of several manners you can combine the methods of fence, for example:
3A- IPMI (or ILO) + Manual fence
3B- PDU + Manual fence
3C- IPMI (or ILO) + PDU
3D- PDU + other PDU
3E- PDU + other PDU + manual fence
3F- IPMI (or ILO) + PDU + manual fence
etc.

See this link:
https://pve.proxmox.com/wiki/Fencing

Best regards
Cesar

cesarpk · Jun 9, 2014

Only as a comment

Why use GluserFS and not DRBD?, while that with few storage nodes, DRBD will work much more quick.

In this link i talk about of the advantage of use DRBD and not iSCSI, but with few storage nodes GlusterFS also is apply:
http://forum.proxmox.com/threads/18699-ISCSI-Storage-with-LVM-Partition?p=95661#post95661

Best regards
Cesar

mir · Jun 9, 2014

An alternative to manual fencing could be to use a managed switch as last resort. This one works with ping an in case of ping failure will deactivate the port(s) to the failed pve host and start the running VMs and CTs elsewhere. Perfectly safe and works always.

cesarpk · Jun 9, 2014

mir said:
An alternative to manual fencing could be to use a managed switch as last resort. This one works with ping an in case of ping failure will deactivate the port(s) to the failed pve host and start the running VMs and CTs elsewhere. Perfectly safe and works always.

Excelent idea mir!!!

Can you post here a example of the cluster.conf file (or some the other files if is necessary) for show us the complete configuration?

Best regards
Cesar

reedited: and how revert the port deactivated?

mir · Jun 9, 2014

Everything can be read here: https://pve.proxmox.com/wiki/Fencing#Fencing_using_a_managed_switch

fknorn · Jun 10, 2014

Gentlemen, thank you for your interesting suggestions. I'm really grateful for those and will look into both of them later today.

However, and I guess that's a more general question --- why doesn't the cluster just "carry on" (i.e. migrate nodes) even if the fencing returns an error? Or is there a way of configuring that?

Ideally, why can't the system just accept, say after three fencing attempts, that this is going nowhere, and as last resort fire up the dead CTs somewhere else?

tom · Jun 10, 2014

fknorn said:
...
Ideally, why can't the system just accept, say after three fencing attempts, that this is going nowhere, and as last resort fire up the dead CTs somewhere else?

If you do that you put your data on risk as there is no guarantee that only one node runs the container. So a successful fencing is the essential part.

Search

Search

HA + GlusterFS for shared storage of containers

fknorn

Renowned Member

fknorn

Renowned Member

cesarpk

Well-Known Member

fknorn

Renowned Member

kobuki

Renowned Member

cesarpk

Well-Known Member

cesarpk

Well-Known Member

mir

Famous Member

cesarpk

Well-Known Member

mir

Famous Member

fknorn

Renowned Member

tom

Proxmox Staff Member