[SOLVED] Constant crashing on one node in a two-node cluster

Deleted member 55253 · Nov 26, 2018

Hello,
I have two Dell servers, both identical hardware.
pve1 at 192.168.1.11, pve2 at 192.168.1.12

One keeps crashing every 30 minutes to 6 hours, so no recurring task is causing this as far as I can tell.
syslog and kern.log don't tell me anything useful, though I do see two common items before each crash:
Syslog, just before the crash:

One crash:

Code:

Nov 26 09:17:19 pve2 corosync[1793]: notice  [TOTEM ] A new membership (192.168.1.12:1520) was formed. Members
Nov 26 09:17:19 pve2 corosync[1793]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:17:19 pve2 corosync[1793]: notice  [QUORUM] Members[1]: 2
Nov 26 09:17:19 pve2 corosync[1793]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:17:19 pve2 corosync[1793]:  [TOTEM ] A new membership (192.168.1.12:1520) was formed. Members
Nov 26 09:17:19 pve2 corosync[1793]:  [CPG   ] downlist left_list: 0 received
Nov 26 09:17:19 pve2 corosync[1793]:  [QUORUM] Members[1]: 2
Nov 26 09:17:19 pve2 corosync[1793]:  [MAIN  ] Completed service synchronization, ready to provide service.

An earlier crash:

Code:

Nov 26 09:24:05 pve2 corosync[1813]: notice  [TOTEM ] A new membership (192.168.1.12:1620) was formed. Members
Nov 26 09:24:05 pve2 corosync[1813]:  [TOTEM ] A new membership (192.168.1.12:1620) was formed. Members
Nov 26 09:24:05 pve2 corosync[1813]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:24:05 pve2 corosync[1813]: notice  [QUORUM] Members[1]: 2
Nov 26 09:24:05 pve2 corosync[1813]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:24:05 pve2 corosync[1813]:  [CPG   ] downlist left_list: 0 received
Nov 26 09:24:05 pve2 corosync[1813]:  [QUORUM] Members[1]: 2
Nov 26 09:24:05 pve2 corosync[1813]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:24:06 pve2 pvesr[6181]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 26 09:24:06 pve2 corosync[1813]: notice  [TOTEM ] A new membership (192.168.1.12:1624) was formed. Members
Nov 26 09:24:06 pve2 corosync[1813]:  [TOTEM ] A new membership (192.168.1.12:1624) was formed. Members
Nov 26 09:24:06 pve2 corosync[1813]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:24:06 pve2 corosync[1813]: notice  [QUORUM] Members[1]: 2
Nov 26 09:24:06 pve2 corosync[1813]: notice  [MAIN  ] Completed service synchrNov 26 09:28:17 pve2 systemd-modules-load[453]: Inserted module 'iscsi_tcp'

The two repeating items i see are corosync and items dealing with file-replication_cfg, though I don't know what to make of it or where to begin fixing it.

pveversion reports:

Code:

root@pve2:~# pveversion  -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve2~bpo1

pvecm status:

Code:

root@pve2:/etc/pve/ha# pvecm status
Quorum information
------------------
Date:             Mon Nov 26 10:12:31 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/2020
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.11
0x00000002          1 192.168.1.12 (local)

Any help would be appreciated.
Thanks!

Deleted member 55253 · Nov 27, 2018

Further troubleshooting:
Read that this may be caused by multicast issues
Multicast is enabled on my network and the command

Code:

omping -c 10000 -i 0.001 -F -q pve1 pve2

returns 0 errors through multiple retries. Will continue to update in case any one else ever has this problem

Also both machines are timesynced to a domain controller and are identical as far as I can tell.

Stoiko Ivanov · Nov 27, 2018

* do you have HA activated for a ressource in the cluster?
* please post the complete output of omping (the latency is also interesting)

Deleted member 55253 · Nov 28, 2018

I did have HA activated for one container, though I disabled it shortly after

omping output:
pve1:

Code:

omping -c 10000 -i 0.001 -F -q pve1 pve2
pve2 : waiting for response msg
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : waiting for response msg
pve2 : server told us to stop

pve2 :   unicast, xmt/rcv/%loss = 8766/8766/0%, min/avg/max/std-dev = 0.038/0.174/3.782/0.100
pve2 : multicast, xmt/rcv/%loss = 8766/8766/0%, min/avg/max/std-dev = 0.063/0.203/0.703/0.085

pve2:

Code:

omping -c 10000 -i 0.001 -F -q pve1 pve2
pve1 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.038/0.124/3.198/0.067
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.058/0.171/3.224/0.076

I did notice late yesterday -- Restarting corosync on pve1 does fix the problem, even if temporarily.

Stoiko Ivanov · Nov 28, 2018

* make sure you've disabled all ha-resources on the cluster (HA _needs_ 3 nodes at least) - check out our documentation - a 2-node cluster with HA, will hard-reset the node when quorum is lost (as it seems happens in your case)! - https://pve.proxmox.com/pve-docs/chapter-ha-manager.html.
* please also run the longer omping test, to rule out some problems with a missing igmp querier - https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

Deleted member 55253 · Nov 28, 2018

^ This is exactly what I'm seeing -- Quorom would get lost for 30-ish seconds, then pve2 would crash and reboot. I know back in years prior it was a hard requirement, though now it will allow HA but display a warning. Seems like I need to buy a third R710

Results of the longer omping test:

Code:

root@pve1:~# omping -c 600 -i 1 -q pve1 pve2
pve2 : waiting for response msg
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : given amount of query messages was sent

pve2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.127/0.273/0.582/0.046
pve2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.165/0.324/0.677/0.051

root@pve2:~# omping -c 600 -i 1 -q pve1 pve2
pve1 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.123/0.296/0.669/0.055
pve1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.221/0.373/0.739/0.062

Stoiko Ivanov · Nov 29, 2018

the omping test looks ok - however if you have your corosync traffic on the same interface as your other traffic (VM-network, storage traffic), it still could very well happen, that load on the interface causes packet loss.
We recommend to keep the corosync-traffic on a separate physical interface - see our documentation https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network

Deleted member 55253 · Nov 29, 2018

Ah alright. I did change the architecture a bit recently -- I used to have a 1G connection to the LAN and a 10G connections direct to storage -- Now it's a single 10G for everything -- the old 1G really wasn't terribly busy since backups and the actual VM's were over the other network connection. I'll set this separate corosync network up and test things out.

On the bright side, restarting corosync on both nodes has seemingly stopped this from happening, at least in the mean time.

Thanks for your assistance!

Stoiko Ivanov · Nov 29, 2018

Glad to hear its working currently!
If you like please mark the thread as solved - that way others know what to expect!

Search

Search

[SOLVED] Constant crashing on one node in a two-node cluster

Deleted member 55253

Guest

Deleted member 55253

Guest

Stoiko Ivanov

Proxmox Staff Member

Deleted member 55253

Guest

Stoiko Ivanov

Proxmox Staff Member

Deleted member 55253

Guest

Stoiko Ivanov

Proxmox Staff Member

Deleted member 55253

Guest

Stoiko Ivanov

Proxmox Staff Member

We value your privacy