[SOLVED] Constant crashing on one node in a two-node cluster

  • Thread starter Thread starter Deleted member 55253
  • Start date Start date
D

Deleted member 55253

Guest
Hello,
I have two Dell servers, both identical hardware.
pve1 at 192.168.1.11, pve2 at 192.168.1.12

One keeps crashing every 30 minutes to 6 hours, so no recurring task is causing this as far as I can tell.
syslog and kern.log don't tell me anything useful, though I do see two common items before each crash:
Syslog, just before the crash:

One crash:
Code:
Nov 26 09:17:19 pve2 corosync[1793]: notice  [TOTEM ] A new membership (192.168.1.12:1520) was formed. Members
Nov 26 09:17:19 pve2 corosync[1793]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:17:19 pve2 corosync[1793]: notice  [QUORUM] Members[1]: 2
Nov 26 09:17:19 pve2 corosync[1793]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:17:19 pve2 corosync[1793]:  [TOTEM ] A new membership (192.168.1.12:1520) was formed. Members
Nov 26 09:17:19 pve2 corosync[1793]:  [CPG   ] downlist left_list: 0 received
Nov 26 09:17:19 pve2 corosync[1793]:  [QUORUM] Members[1]: 2
Nov 26 09:17:19 pve2 corosync[1793]:  [MAIN  ] Completed service synchronization, ready to provide service.

An earlier crash:
Code:
Nov 26 09:24:05 pve2 corosync[1813]: notice  [TOTEM ] A new membership (192.168.1.12:1620) was formed. Members
Nov 26 09:24:05 pve2 corosync[1813]:  [TOTEM ] A new membership (192.168.1.12:1620) was formed. Members
Nov 26 09:24:05 pve2 corosync[1813]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:24:05 pve2 corosync[1813]: notice  [QUORUM] Members[1]: 2
Nov 26 09:24:05 pve2 corosync[1813]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:24:05 pve2 corosync[1813]:  [CPG   ] downlist left_list: 0 received
Nov 26 09:24:05 pve2 corosync[1813]:  [QUORUM] Members[1]: 2
Nov 26 09:24:05 pve2 corosync[1813]:  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 09:24:06 pve2 pvesr[6181]: trying to acquire cfs lock 'file-replication_cfg' ...
Nov 26 09:24:06 pve2 corosync[1813]: notice  [TOTEM ] A new membership (192.168.1.12:1624) was formed. Members
Nov 26 09:24:06 pve2 corosync[1813]:  [TOTEM ] A new membership (192.168.1.12:1624) was formed. Members
Nov 26 09:24:06 pve2 corosync[1813]: warning [CPG   ] downlist left_list: 0 received
Nov 26 09:24:06 pve2 corosync[1813]: notice  [QUORUM] Members[1]: 2
Nov 26 09:24:06 pve2 corosync[1813]: notice  [MAIN  ] Completed service synchrNov 26 09:28:17 pve2 systemd-modules-load[453]: Inserted module 'iscsi_tcp'

The two repeating items i see are corosync and items dealing with file-replication_cfg, though I don't know what to make of it or where to begin fixing it.


pveversion reports:
Code:
root@pve2:~# pveversion  -v
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-41
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-30
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-3
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-29
pve-docs: 5.2-9
pve-firewall: 3.0-14
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-38
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve2~bpo1

pvecm status:
Code:
root@pve2:/etc/pve/ha# pvecm status
Quorum information
------------------
Date:             Mon Nov 26 10:12:31 2018
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          0x00000002
Ring ID:          1/2020
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           2  
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.11
0x00000002          1 192.168.1.12 (local)


Any help would be appreciated.
Thanks!
 
Further troubleshooting:
Read that this may be caused by multicast issues
Multicast is enabled on my network and the command
Code:
omping -c 10000 -i 0.001 -F -q pve1 pve2
returns 0 errors through multiple retries. Will continue to update in case any one else ever has this problem :)

Also both machines are timesynced to a domain controller and are identical as far as I can tell.
 
* do you have HA activated for a ressource in the cluster?
* please post the complete output of omping (the latency is also interesting)
 
I did have HA activated for one container, though I disabled it shortly after

omping output:
pve1:
Code:
omping -c 10000 -i 0.001 -F -q pve1 pve2
pve2 : waiting for response msg
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : waiting for response msg
pve2 : server told us to stop

pve2 :   unicast, xmt/rcv/%loss = 8766/8766/0%, min/avg/max/std-dev = 0.038/0.174/3.782/0.100
pve2 : multicast, xmt/rcv/%loss = 8766/8766/0%, min/avg/max/std-dev = 0.063/0.203/0.703/0.085


pve2:
Code:
omping -c 10000 -i 0.001 -F -q pve1 pve2
pve1 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.038/0.124/3.198/0.067
pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.058/0.171/3.224/0.076

I did notice late yesterday -- Restarting corosync on pve1 does fix the problem, even if temporarily.
 
^ This is exactly what I'm seeing -- Quorom would get lost for 30-ish seconds, then pve2 would crash and reboot. I know back in years prior it was a hard requirement, though now it will allow HA but display a warning. Seems like I need to buy a third R710 :)

Results of the longer omping test:
Code:
root@pve1:~# omping -c 600 -i 1 -q pve1 pve2
pve2 : waiting for response msg
pve2 : waiting for response msg
pve2 : joined (S,G) = (*, 232.43.211.234), pinging
pve2 : given amount of query messages was sent

pve2 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.127/0.273/0.582/0.046
pve2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.165/0.324/0.677/0.051

root@pve2:~# omping -c 600 -i 1 -q pve1 pve2
pve1 : waiting for response msg
pve1 : joined (S,G) = (*, 232.43.211.234), pinging
pve1 : given amount of query messages was sent

pve1 :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.123/0.296/0.669/0.055
pve1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.221/0.373/0.739/0.062
 
the omping test looks ok - however if you have your corosync traffic on the same interface as your other traffic (VM-network, storage traffic), it still could very well happen, that load on the interface causes packet loss.
We recommend to keep the corosync-traffic on a separate physical interface - see our documentation https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
 
Ah alright. I did change the architecture a bit recently -- I used to have a 1G connection to the LAN and a 10G connections direct to storage -- Now it's a single 10G for everything -- the old 1G really wasn't terribly busy since backups and the actual VM's were over the other network connection. I'll set this separate corosync network up and test things out.

On the bright side, restarting corosync on both nodes has seemingly stopped this from happening, at least in the mean time.

Thanks for your assistance!
 
Glad to hear its working currently!
If you like please mark the thread as solved - that way others know what to expect!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!