temporarily disable out of the box self-fencing

grin · Feb 2, 2017

What is the preferred and safe method to prevent a node from rebooting due to (observed) loss of quorum? Or in other words: how to temporarily disable fencing on a node?

tom · Feb 2, 2017

grin said:
What is the preferred and safe method to prevent a node from rebooting due to (observed) loss of quorum? Or in other words: how to temporarily disable fencing on a node?

If you use HA, you need at least 3 nodes.

That means, if you reboot one node, you still have 2 votes and therefore the quorum. So your case cannot happen.

grin · Feb 2, 2017

Oh please, don't. I can detail you how it can f*ck itself over big time, apart from the bug you fixed around 4.4, but really, the question wasn't that why do I need it, but that when I do need it how should it be done.

[Guess what, today another node rebooted since that rotten systemd didn't start corosync (due to other problems which really not relevant since it's not your problem but mine; I don't bore you with the details about a bug related to corosync you guys seem to be clueless about so I have to resolve it, but I really need the machines not to reboot while I'm typing), and that's what triggered the question. I could obviously go out, check fence code or write a daemon pinging softdog, but I have hoped you can come up with something better. :-(]

grin · Feb 2, 2017

But the easiest example for you: upgrading the hosts. If you upgrade them at once, pve-cluster (and relared pkgs) upgrade almost always trigger a reboot. Today's wasn't that one but we have had full cluster reboots due to upgrades in the past way too many times.

tom · Feb 2, 2017

we do not see this.

pls explain a simple scenario triggering this and file a bug on https://bugzilla.proxmox.com.

grin · Feb 2, 2017

1) If I dist-upgrade it and the disk is not magically fast there may be more than 60 seconds between upgrade start (stopping daemon) and setup finish (starting daemon), which causes reboot. When I upgrade all 3 node at once corosync may lose quorum for many minutes.

Why? I have asked the same:

2) https://forum.proxmox.com/threads/all-functions-became-slooow-corosync-problem.30332/

No resolution though. Do you believe that opening a bug instead of asking here the very same persons would result better answers?

Depends on the speed and utilisation of the hosts, the number of packages to upgrade, and the phase of the moon possibly. We have had plenty of reboots, as I've mentioned, in the past year due to softdog fencing. Also when there are 2 active nodes but corosync decides to lose quorum the nodes start playing "who can reboot faster than corosync notices" game ad infinitum. All of them.

Granted, they may be a network problem, or kernel problem, or whatever problem. I don't believe it so, omping doesn't see anything, I don't see (yet) anything, but without knowing the problem and without being able to tell fencing that "for the next 30 minutes stop rebooting the machine for fucks sake" it's indescribably annoying to touch any parts which is related to the clustering code, especially on the prod machines.

t.lamprecht · Feb 3, 2017

grin said:
I don't bore you with the details about a bug related to corosync you guys seem to be clueless about so I have to resolve it

Do you speak of the "sloow corosync" problem from this post:

grin said:
2) https://forum.proxmox.com/threads/all-functions-became-slooow-corosync-problem.30332/

The ouput:

Code:

Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1

does not mean there was an error, the "error" number 1 corresponds to CS_OK, which means successful.

Code:

Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 4
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1

Nov 15 16:36:30 node01 pvestatd[41380]: status update time (16.181 seconds)

Nov 15 16:36:47 node01 pvestatd[41380]: status update time (16.192 seconds)

Nov 15 16:37:03 node01 pvestatd[41380]: status update time (16.188 seconds)

Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-25)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-25) state:2

Here corosync either a) hangs or b) does not gets scheduled for quite some time.
As pvestatd needs quite some time too b could be reasonable, but pvestatd could also hang with corosync.

Can you send me the output of

Code:

corosync-cmapctl

I'd want to give it a look. If it has sensible information just send it as private message here or at my email which is this username [at]proxmox.com

No resolution though. Do you believe that opening a bug instead of asking here the very same persons would result better answers?

Normally it will, forum posts may get lost more easily than bug reports, more so if its a problem reported only once by one person, there a reason projects using bug trackers and forums, side by side

But as your case is really vague, has no indicators to reproduce the problem and we do not know your setup, it maybe isn't yet the time for such a report, though.

grin said:
Granted, they may be a network problem, or kernel problem, or whatever problem.

Do you share the cluster network with a storage, or other heavy used network?

grin said:
and without being able to tell fencing that "for the next 30 minutes stop rebooting the machine for fucks sake" it's indescribably annoying to touch any parts which is related to the clustering code, especially on the prod machines.

Use:

Code:

systemctl stop pve-ha-lrm.service pve-ha-crm.service

With this, LRM freezes its services and closes the watchdog gracefully. If the CRM held the manager lock it releases it and closes the watchdog.
After that you may even stop the watchdog-mux service as no other pve service open connections to it.

grin · Feb 15, 2017

t.lamprecht said:
Here corosync either a) hangs or b) does not gets scheduled for quite some time.

Last time I tried to follow where it hangs and here's an strace fragment:

Code:

10:23:06.155779 socket(PF_LOCAL, SOCK_STREAM, 0) = 3
10:23:06.155798 fcntl(3, F_GETFD)       = 0
10:23:06.155811 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
10:23:06.155824 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
10:23:06.155842 connect(3, {sa_family=AF_LOCAL, sun_path=@"cmap"}, 110) = 0
10:23:06.155874 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
10:23:06.155890 sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, MSG_NOSIGNAL, NULL, 0) = 24
10:23:06.155908 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
10:23:06.155923 recvfrom(3, 0x7ffc7b6c8cc0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
10:23:06.155940 poll([{fd=3, events=POLLIN}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
10:23:06.157842 recvfrom(3, "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\20\0@\3665\v]U\0\0cmap-request-3259-29841-22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
10:23:06.157868 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3b3c05000
10:23:06.157896 open("/dev/shm/qb-cmap-request-3259-29841-22-header", O_RDWR) = 4
10:23:06.157915 ftruncate(4, 8252)      = 0
10:23:06.157931 mmap(NULL, 8252, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7fd3b3d12000
10:23:06.157952 open("/dev/shm/qb-cmap-request-3259-29841-22-data", O_RDWR) = 5
10:23:06.157967 ftruncate(5, 1052672)   = 0

[...]

10:23:08.156786 sendto(3, "\30", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156804 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156822 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156840 write(1, "internal_configuration.service.4.name (str) = corosync_pload\n", 61) = 61
10:23:08.156857 sendto(3, "\30", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156874 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156892 futex(0x7fd3b3d11010, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1487150590, 155911684}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

10:23:10.156078 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
10:23:10.156114 write(1, "internal_configuration.service.4.ver (u32) = 0\n", 47) = 47

So the last point I was able to track was actually the *cmap* service being answering slowly, and various corosync code waits for its repeated timeouts.

t.lamprecht said:
Can you send me the output of

Code:

corosync-cmapctl

Here it comes (a bit obfuscated): http://pastebin.com/PD2dSFx2

On a problem node this basically timeouts on every single line.

As a sidenote: it seems to be related to timing; when stracing it many times there are no timeouts at all or just 2-3 ones and the results just pop out, while running on the same node without strace it timeouts every single line.

t.lamprecht said:
But as your case is really vague, has no indicators to reproduce the problem and we do not know your setup, it maybe isn't yet the time for such a report, though.

That's exactly the reason I didn't want to open a bugreport.

t.lamprecht said:
Do you share the cluster network with a storage, or other heavy used network?

They are actively used in the absolute sense, but the traffic is below 1Gbps on 10Gbps links, so no, relatively it's almost empty. There is 802.1ad (LAG/bond) multiple links, but I know no problems related to that, as one has suggested. At least both ping and omping are happy without limits.

t.lamprecht said:
Use:

Code:

systemctl stop pve-ha-lrm.service pve-ha-crm.service

With this, LRM freezes its services and closes the watchdog gracefully. If the CRM held the manager lock it releases it and closes the watchdog.
After that you may even stop the watchdog-mux service as no other pve service open connections to it.

This sounds reassuring, I will try next time I have to poke at corosync. Thank you.

t.lamprecht · Feb 22, 2017

grin said:
10:23:08.156874 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1 10:23:08.156892 futex(0x7fd3b3d11010, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1487150590, 155911684}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

grin said:
So the last point I was able to track was actually the *cmap* service being answering slowly, and various corosync code waits for its repeated timeouts.

Look like a problem with IPC, I saw that a library corosync used got a stable update release a little time ago, I'm testing it here currently, can make you a package if you want.
I'm further pocking around between the IPC connection of corosync and cmap, but without reproducible test its hard, although your strace helps.

grin said:
Here it comes (a bit obfuscated): http://pastebin.com/PD2dSFx2

looks ok, which certainly does not help to understand whats happening at all... :/

grin said:
As a sidenote: it seems to be related to timing; when stracing it many times there are no timeouts at all or just 2-3 ones and the results just pop out, while running on the same node without strace it timeouts every single line.

~~Do you have time synchronization up and running, could be a time drift related issue.~~
*Edit: scratch that, should not be a ntp issue as it hangs on local IPC...

grin said:
They are actively used in the absolute sense, but the traffic is below 1Gbps on 10Gbps links, so no, relatively it's almost empty. There is 802.1ad (LAG/bond) multiple links, but I know no problems related to that, as one has suggested. At least both ping and omping are happy without limits.

Should be no problem, as long the (multicast) latency is below 2 ms.

grin · Feb 24, 2017

(Sidenote: I have just tried to dist-upgrade 4.4 from a previous state, and fence booted off the machine while configuring pve-kernel.)

xcdr · Feb 25, 2017

I have exactly the same problem on two different clusters, one of them have separate cluster network interfaces and fast system drive, second have shared cluster network interfaces (but usage is very low) and slow sata dom - fence during dist-upgrade.

I also found another strange problem on 4.4, lost qurorum and all nodes restarts during migrate all VM from one to another host.

All interfaces are aggregated via 802.3ad, and hashing at layer 3+4.

Code:

root@px-c1n3:~# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.21-1-pve: 4.4.21-71
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1

robhost · Feb 25, 2017

We also faced this issue a few times in the past.
A good solution would be to have a "disable HA" (and fencing) button or CLI function for upgrade purposes.

You can migrate all VMs managed by HA away from that node to avoid fencing before upgrading, if possible.

t.lamprecht · Feb 28, 2017

caffecoder said:
I have exactly the same problem on two different clusters, one of them have separate cluster network interfaces and fast system drive, second have shared cluster network interfaces (but usage is very low) and slow sata dom - fence during dist-upgrade.

We freeze the LRM and its services during an upgrade of the HA Manager already, so that no fence recovery action may happen.
But what can happen on an upgrade when the HA manager is already upgraded is that the services get thawed again and then another (heavy) upgrade, combined maybe with some load from Virtual Guests renders the node slow to unusable. As the LRMs watchdog is then already active it can result in a self-fence action.
But that would need some high (I/O?) overload, else it should not happen that a process does not gets scheduled for over a minute.
Do you have some logs from around this time? `journalctl` accepts since - until timespans, those could be good to confirm my theory, use for example:

Code:

journalctl --since="2017-02-22 18:17" --until="2017-02-28 15:00"
# or
journalctl --since "-4days" --until "-1day"

robhost said:
We also faced this issue a few times in the past.
A good solution would be to have a "disable HA" (and fencing) button or CLI function for upgrade purposes.

You can migrate all VMs managed by HA away from that node to avoid fencing before upgrading, if possible.

You may do somewhat similar already, with the CLI as already mentioned in this thread:

t.lamprecht said:
Use:

Code:

systemctl stop pve-ha-lrm.service pve-ha-crm.service

With this, LRM freezes its services and closes the watchdog gracefully. If the CRM held the manager lock it releases it and closes the watchdog.
After that you may even stop the watchdog-mux service as no other pve service open connections to it.

And the same mechanics can be done via the webUI. Go to the nodes tree entry and select the `System` tab, here you may stop both the `pve-ha-crm` and the `pve-ha-lrm`.
Do the upgrade and then start them again.

xcdr · Mar 7, 2017

t.lamprecht said:
We freeze the LRM and its services during an upgrade of the HA Manager already, so that no fence recovery action may happen.
But what can happen on an upgrade when the HA manager is already upgraded is that the services get thawed again and then another (heavy) upgrade, combined maybe with some load from Virtual Guests renders the node slow to unusable. As the LRMs watchdog is then already active it can result in a self-fence action.
But that would need some high (I/O?) overload, else it should not happen that a process does not gets scheduled for over a minute.
Do you have some logs from around this time? `journalctl` accepts since - until timespans, those could be good to confirm my theory, use for example:

Code:

journalctl --since="2017-02-22 18:17" --until="2017-02-28 15:00" # or journalctl --since "-4days" --until "-1day"

I always migrate all VM to another node before upgrade, so HA manager has nothing to work and load is nearly 0.

Unfortunately by default PROXMOX/Debian has no pernament log via journalctl and I have no logs before current boot (currently I turn on according to Debian manual)

Last log on cosole:

Code:

Setting up qemu-server (4.0-109) ...
Setting up pve-ha-manager (1.0-40) ...
Job for pve-ha-lrm.service canceled.
Connection to 192.168.41.24 closed by remote host.
Connection to 192.168.41.24 closed.

Anyway, I tried to manualy stop systemctl stop pve-ha-lrm.service pve-ha-crm.service and on one node no problems but on two others after moment:

Code:

Feb 28 08:48:25 px-c1n1 systemd-logind[1276]: Failed to abandon session scope: Connection reset by peer
Feb 28 08:48:25 px-c1n1 systemd-logind[1276]: Failed to abandon session scope: Transport endpoint is not connected
Feb 28 08:48:25 px-c1n1 kernel: IPMI Watchdog: Unexpected close, not stopping watchdog!

Code:

Feb 28 09:01:25 px-c1n4 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon...
Feb 28 09:01:25 px-c1n4 pve-ha-lrm[1867]: received signal TERM
Feb 28 09:01:25 px-c1n4 pve-ha-lrm[1867]: restart LRM, freeze all services
Feb 28 09:03:00 px-c1n4 systemd[1]: pve-ha-lrm.service stopping timed out. Terminating.
Feb 28 09:03:00 px-c1n4 pve-ha-lrm[1867]: received signal TERM
Feb 28 09:04:35 px-c1n4 systemd[1]: pve-ha-lrm.service stop-sigterm timed out. Killing.
Feb 28 09:04:35 px-c1n4 watchdog-mux[1247]: client did not stop watchdog - disable watchdog updates
Feb 28 09:04:35 px-c1n4 systemd[1]: pve-ha-lrm.service: main process exited, code=killed, status=9/KILL
Feb 28 09:04:35 px-c1n4 systemd[1]: Stopped PVE Local HA Ressource Manager Daemon.
Feb 28 09:04:35 px-c1n4 systemd[1]: Unit pve-ha-lrm.service entered failed state.
Feb 28 09:04:35 px-c1n4 systemd[1]: Stopping PVE Cluster Ressource Manager Daemon...
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: received signal TERM
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: server received shutdown request
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: server stopped
Feb 28 09:04:37 px-c1n4 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon.
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: Power key pressed.
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: Powering Off...
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: System is powering down.
Connection to 192.168.41.24 closed by remote host.
Connection to 192.168.41.24 closed.

Used watchdog is ipmi_watchdog (supermicro).

dcsapak · Mar 7, 2017

did you follow
https://pve.proxmox.com/wiki/High_A...PMI_Watchdog_.28module_.22ipmi_watchdog.22.29

xcdr · Mar 7, 2017

dcsapak said:
did you follow
https://pve.proxmox.com/wiki/High_A...PMI_Watchdog_.28module_.22ipmi_watchdog.22.29

Yes, config is done, watchdog appears to hang randomly, it's not new cluster, I want to try iTCO on next service window.

Search

Search

temporarily disable out of the box self-fencing

grin

Renowned Member

tom

Proxmox Staff Member

grin

Renowned Member

grin

Renowned Member

tom

Proxmox Staff Member

grin

Renowned Member

t.lamprecht

Proxmox Staff Member

grin

Renowned Member

t.lamprecht

Proxmox Staff Member

grin

Renowned Member

xcdr

Member

robhost

Active Member

t.lamprecht

Proxmox Staff Member

xcdr

Member

dcsapak

Proxmox Staff Member

xcdr

Member