What is the preferred and safe method to prevent a node from rebooting due to (observed) loss of quorum? Or in other words: how to temporarily disable fencing on a node?
What is the preferred and safe method to prevent a node from rebooting due to (observed) loss of quorum? Or in other words: how to temporarily disable fencing on a node?
I don't bore you with the details about a bug related to corosync you guys seem to be clueless about so I have to resolve it
Nov 15 16:36:20 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] got getinfo request on 0x55fd73af0580 for node 4
Nov 15 16:36:24 node01 corosync[2941]: [VOTEQ ] getinfo response error: 1
Nov 15 16:36:30 node01 pvestatd[41380]: status update time (16.181 seconds)
Nov 15 16:36:47 node01 pvestatd[41380]: status update time (16.192 seconds)
Nov 15 16:37:03 node01 pvestatd[41380]: status update time (16.188 seconds)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] HUP conn (2941-19005-25)
Nov 15 16:37:14 node01 corosync[2941]: [QB ] qb_ipcs_disconnect(2941-19005-25) state:2
corosync-cmapctl
No resolution though. Do you believe that opening a bug instead of asking here the very same persons would result better answers?
Granted, they may be a network problem, or kernel problem, or whatever problem.
and without being able to tell fencing that "for the next 30 minutes stop rebooting the machine for fucks sake" it's indescribably annoying to touch any parts which is related to the clustering code, especially on the prod machines.
systemctl stop pve-ha-lrm.service pve-ha-crm.service
Here corosync either a) hangs or b) does not gets scheduled for quite some time.
10:23:06.155779 socket(PF_LOCAL, SOCK_STREAM, 0) = 3
10:23:06.155798 fcntl(3, F_GETFD) = 0
10:23:06.155811 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
10:23:06.155824 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
10:23:06.155842 connect(3, {sa_family=AF_LOCAL, sun_path=@"cmap"}, 110) = 0
10:23:06.155874 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
10:23:06.155890 sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, MSG_NOSIGNAL, NULL, 0) = 24
10:23:06.155908 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
10:23:06.155923 recvfrom(3, 0x7ffc7b6c8cc0, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
10:23:06.155940 poll([{fd=3, events=POLLIN}], 1, 4294967295) = 1 ([{fd=3, revents=POLLIN}])
10:23:06.157842 recvfrom(3, "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\20\0@\3665\v]U\0\0cmap-request-3259-29841-22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
10:23:06.157868 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd3b3c05000
10:23:06.157896 open("/dev/shm/qb-cmap-request-3259-29841-22-header", O_RDWR) = 4
10:23:06.157915 ftruncate(4, 8252) = 0
10:23:06.157931 mmap(NULL, 8252, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7fd3b3d12000
10:23:06.157952 open("/dev/shm/qb-cmap-request-3259-29841-22-data", O_RDWR) = 5
10:23:06.157967 ftruncate(5, 1052672) = 0
[...]
10:23:08.156786 sendto(3, "\30", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156804 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156822 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156840 write(1, "internal_configuration.service.4.name (str) = corosync_pload\n", 61) = 61
10:23:08.156857 sendto(3, "\30", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156874 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1
10:23:08.156892 futex(0x7fd3b3d11010, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1487150590, 155911684}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
10:23:10.156078 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
10:23:10.156114 write(1, "internal_configuration.service.4.ver (u32) = 0\n", 47) = 47
Can you send me the output of
Code:corosync-cmapctl
That's exactly the reason I didn't want to open a bugreport.But as your case is really vague, has no indicators to reproduce the problem and we do not know your setup, it maybe isn't yet the time for such a report, though.
They are actively used in the absolute sense, but the traffic is below 1Gbps on 10Gbps links, so no, relatively it's almost empty. There is 802.1ad (LAG/bond) multiple links, but I know no problems related to that, as one has suggested. At least both ping and omping are happy without limits.Do you share the cluster network with a storage, or other heavy used network?
Use:
Code:systemctl stop pve-ha-lrm.service pve-ha-crm.service
With this, LRM freezes its services and closes the watchdog gracefully. If the CRM held the manager lock it releases it and closes the watchdog.
After that you may even stop the watchdog-mux service as no other pve service open connections to it.
10:23:08.156874 sendto(3, " ", 1, MSG_NOSIGNAL, NULL, 0) = 1 10:23:08.156892 futex(0x7fd3b3d11010, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1487150590, 155911684}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
So the last point I was able to track was actually the *cmap* service being answering slowly, and various corosync code waits for its repeated timeouts.
Here it comes (a bit obfuscated): http://pastebin.com/PD2dSFx2
As a sidenote: it seems to be related to timing; when stracing it many times there are no timeouts at all or just 2-3 ones and the results just pop out, while running on the same node without strace it timeouts every single line.
They are actively used in the absolute sense, but the traffic is below 1Gbps on 10Gbps links, so no, relatively it's almost empty. There is 802.1ad (LAG/bond) multiple links, but I know no problems related to that, as one has suggested. At least both ping and omping are happy without limits.
root@px-c1n3:~# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
pve-kernel-4.4.21-1-pve: 4.4.21-71
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.0-10
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 0.94.9-1~bpo80+1
I have exactly the same problem on two different clusters, one of them have separate cluster network interfaces and fast system drive, second have shared cluster network interfaces (but usage is very low) and slow sata dom - fence during dist-upgrade.
journalctl --since="2017-02-22 18:17" --until="2017-02-28 15:00"
# or
journalctl --since "-4days" --until "-1day"
We also faced this issue a few times in the past.
A good solution would be to have a "disable HA" (and fencing) button or CLI function for upgrade purposes.
You can migrate all VMs managed by HA away from that node to avoid fencing before upgrading, if possible.
Use:
With this, LRM freezes its services and closes the watchdog gracefully. If the CRM held the manager lock it releases it and closes the watchdog.Code:systemctl stop pve-ha-lrm.service pve-ha-crm.service
After that you may even stop the watchdog-mux service as no other pve service open connections to it.
We freeze the LRM and its services during an upgrade of the HA Manager already, so that no fence recovery action may happen.
But what can happen on an upgrade when the HA manager is already upgraded is that the services get thawed again and then another (heavy) upgrade, combined maybe with some load from Virtual Guests renders the node slow to unusable. As the LRMs watchdog is then already active it can result in a self-fence action.
But that would need some high (I/O?) overload, else it should not happen that a process does not gets scheduled for over a minute.
Do you have some logs from around this time? `journalctl` accepts since - until timespans, those could be good to confirm my theory, use for example:
Code:journalctl --since="2017-02-22 18:17" --until="2017-02-28 15:00" # or journalctl --since "-4days" --until "-1day"
Setting up qemu-server (4.0-109) ...
Setting up pve-ha-manager (1.0-40) ...
Job for pve-ha-lrm.service canceled.
Connection to 192.168.41.24 closed by remote host.
Connection to 192.168.41.24 closed.
Feb 28 08:48:25 px-c1n1 systemd-logind[1276]: Failed to abandon session scope: Connection reset by peer
Feb 28 08:48:25 px-c1n1 systemd-logind[1276]: Failed to abandon session scope: Transport endpoint is not connected
Feb 28 08:48:25 px-c1n1 kernel: IPMI Watchdog: Unexpected close, not stopping watchdog!
Feb 28 09:01:25 px-c1n4 systemd[1]: Stopping PVE Local HA Ressource Manager Daemon...
Feb 28 09:01:25 px-c1n4 pve-ha-lrm[1867]: received signal TERM
Feb 28 09:01:25 px-c1n4 pve-ha-lrm[1867]: restart LRM, freeze all services
Feb 28 09:03:00 px-c1n4 systemd[1]: pve-ha-lrm.service stopping timed out. Terminating.
Feb 28 09:03:00 px-c1n4 pve-ha-lrm[1867]: received signal TERM
Feb 28 09:04:35 px-c1n4 systemd[1]: pve-ha-lrm.service stop-sigterm timed out. Killing.
Feb 28 09:04:35 px-c1n4 watchdog-mux[1247]: client did not stop watchdog - disable watchdog updates
Feb 28 09:04:35 px-c1n4 systemd[1]: pve-ha-lrm.service: main process exited, code=killed, status=9/KILL
Feb 28 09:04:35 px-c1n4 systemd[1]: Stopped PVE Local HA Ressource Manager Daemon.
Feb 28 09:04:35 px-c1n4 systemd[1]: Unit pve-ha-lrm.service entered failed state.
Feb 28 09:04:35 px-c1n4 systemd[1]: Stopping PVE Cluster Ressource Manager Daemon...
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: received signal TERM
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: server received shutdown request
Feb 28 09:04:36 px-c1n4 pve-ha-crm[1864]: server stopped
Feb 28 09:04:37 px-c1n4 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon.
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: Power key pressed.
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: Powering Off...
Feb 28 09:04:48 px-c1n4 systemd-logind[1275]: System is powering down.
Connection to 192.168.41.24 closed by remote host.
Connection to 192.168.41.24 closed.