If you do not have redundant switches and network ports for your cluster communications this is an issue you will want to be aware of.
I am communicating over bonded interfaces over redundant switches with redundant power supplies connected to redundant power.
So hopefully this issue never happens to me in real life.
But we all know that sometimes things go wrong so I wanted to know what will happen if my switches both failed at the same time.
So I disconnected all the network cables.
First obvious thing is that quorum is lost, that was expected.
But then some unexpected things happened when I reconnected all the cables.
Quorum came back, but rgmanager is missing!:
Then after a few minutes I get kernel messages!
I attempted all sorts of things to recover from this but the bottom line is that rgmanager can not be stopped so there is not way to recover other than rebooting every single node one by one.
Turn off as many services as possible.
Then run reboot twice, once will not work since it will hang trying to stop rgmanager.
I am communicating over bonded interfaces over redundant switches with redundant power supplies connected to redundant power.
So hopefully this issue never happens to me in real life.
But we all know that sometimes things go wrong so I wanted to know what will happen if my switches both failed at the same time.
So I disconnected all the network cables.
First obvious thing is that quorum is lost, that was expected.
But then some unexpected things happened when I reconnected all the cables.
Quorum came back, but rgmanager is missing!:
Code:
# clustat
Timed out waiting for a response from Resource Group Manager
Cluster Status for kmitestcluster @ Mon Feb 27 12:28:35 2012
Member Status: Quorate
Member Name ID Status
------ ---- ---- ------
vm5 1 Online
vm6 2 Online, Local
disaster 3 Offline
Then after a few minutes I get kernel messages!
Code:
INFO: task rgmanager:3378 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rgmanager D ffff8806310f2c80 0 3378 1 0 0x00000000
ffff88062d7db9d0 0000000000000046 0000000000000000 ffff88062d7db948
000000018109b3dd ffff880000000000 ffff88063fc0c300 ffff88004179e200
0000000000000069 ffff8806310f3220 ffff88062d7dbfd8 ffff88062d7dbfd8
Call Trace:
[<ffffffff8104d31d>] ? check_preempt_curr+0x6d/0x90
[<ffffffff814ff235>] rwsem_down_failed_common+0x95/0x1d0
[<ffffffff814ff3c6>] rwsem_down_read_failed+0x26/0x30
[<ffffffff8126b8c4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff814fe8b4>] ? down_read+0x24/0x30
[<ffffffffa052872d>] dlm_clear_proc_locks+0x3d/0x2a0 [dlm]
[<ffffffff811a63ef>] ? destroy_inode+0x4f/0x60
[<ffffffff811a26a5>] ? __d_free+0x45/0x60
[<ffffffffa0533c66>] device_close+0x66/0xc0 [dlm]
[<ffffffff8118cea5>] __fput+0xf5/0x280
[<ffffffff8118d055>] fput+0x25/0x30
[<ffffffff811885dd>] filp_close+0x5d/0x90
[<ffffffff8106dbbf>] put_files_struct+0x7f/0xf0
[<ffffffff8106dc83>] exit_files+0x53/0x70
[<ffffffff8106f86d>] do_exit+0x1ad/0x920
[<ffffffff81070038>] do_group_exit+0x58/0xd0
[<ffffffff81086606>] get_signal_to_deliver+0x1f6/0x470
[<ffffffff8100a335>] do_signal+0x75/0x800
[<ffffffff8125e131>] ? cpumask_any_but+0x31/0x50
[<ffffffff810b2ddb>] ? sys_futex+0x7b/0x170
[<ffffffff8100ab50>] do_notify_resume+0x90/0xc0
[<ffffffff8100b451>] int_signal+0x12/0x17
I attempted all sorts of things to recover from this but the bottom line is that rgmanager can not be stopped so there is not way to recover other than rebooting every single node one by one.
Turn off as many services as possible.
Then run reboot twice, once will not work since it will hang trying to stop rgmanager.