Restarting server crashes ethernet adapter on other nodes

Petr.114

Active Member
Jun 25, 2019
35
2
28
32
Hello,

we are facing pretty curios troubles in Proxmox cluster.

When we restart node or corosing on some host (systemctl restart corosync), then on some other nodes ethernet adapter crashes and does not recover himself.

Yesterday we performed a test - we restarted server "backup" and in few seconds we lost the connection with servers "pve, prox3, prox2-brno". I went physically to "prox3" server, thinking that i can do something about the situation via terminal. But unfortunatelly the server was in unusable state, was not responding to keystrokes, after 15 minutes of waiting just some enter keys came through. In this state "hard restart" is always necessary.

I am not sure how this can happen that one restarting server affect ethernet adapter on different nodes of the cluster.

I was trying to collect some log, just tell me if i forgot some necessary logs.

Iam not entirely sure if it can be related to some corosync setting, but we have added knet_transport: sctp and token: 10000, because the cluster was all the time falling apart. After adding those two lines the problem was fixed and cluster sticks together.

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: backup
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.0.14
}
node {
name: havirov-prox1
nodeid: 8
quorum_votes: 1
ring0_addr: 192.168.6.2
}
node {
name: prox1
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.0.11
}
node {
name: prox1-brno
nodeid: 9
quorum_votes: 1
ring0_addr: 192.168.7.2
}
node {
name: prox2
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.0.12
}
node {
name: prox2-brno
nodeid: 7
quorum_votes: 1
ring0_addr: 192.168.7.10
}
node {
name: prox3
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.0.13
}
node {
name: prox4
nodeid: 6
quorum_votes: 1
ring0_addr: 192.168.0.15
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.0.19
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cutter-pv
config_version: 46
interface {
knet_transport: sctp
linknumber: 0
}
ip_version: ipv4
secauth: on
token: 10000
version: 2
}
root@prox3:~$ pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.4.27-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-8
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
Time of the test was about 17:03

Thanks for any help!
 

Attachments

  • syslog_backup-restarting this server.txt
    562.9 KB · Views: 1
  • syslog_prox1-without problems.txt
    439 KB · Views: 1
  • syslog_prox3.txt
    303.9 KB · Views: 1
  • syslog_pve.txt
    242.4 KB · Views: 0
  • syslog_prox2-brno.txt
    69.2 KB · Views: 0
Hi,

why you switch from udp to sctp?
catching the default token setting is not recommended.
Is the Corosync network on an isolated network and is it all the same subnet?
 
Corosync cannot run on multiple subnets.
You will always have a problem with such a setup.
Create a dedicated network for Corosync on a single subnet without a router.
 
Why we can not have nodes in multiple subnets, could you please refer us to some relevant documentation?

We understand, that corosync problems (cluster falling apart) can be caused by multiple subnets, but is it possible, that our present problem (ethernet adapter fails) is caused by multiple subnets or the changed corosync configuration - knet transport and token?
It seems a little bit strange to us.

Thank you for your time wolfgang.
 
The main problem is latency.
There is no hard limit, therefore, there is no reference doc for this.
If you use multiple subnets you must have involved routers.
Routers will increase the latency of the packages.
And the next problem is if you are in a shared network.
The traffic in this network can also affect the latency in it.
Latency is for the most services no problem, so most users are more concerned about bandwidth.
The bandwidth is for corosync normally no problem, because in large clusters you need about 10MBit/s.

It is true that you can increase the token timeout but this can make other problems.
For more information about token see "man corosync.conf"
 
We are probably going to wait until knet update is released in proxmox repositories and give it a try. Maybe it can help the situation.
For info - we have nodes in 3 different cities and they are connected via VPN.
 
knet 1.16 slightly improved stability of the cluster

New info - we have restarted 1 node, but another 3 nodes fall down and restarted...on all 3 nodes, i found (around time of the restart - 12:57) some watchdog messages.
root@prox2:~# cat /var/log/syslog | grep watchdog
Jul 2 11:24:06 prox2 corosync3643: [SERV ] Service engine unloaded: corosync watchdog service
Jul 2 11:24:07 prox2 corosync24628: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 2 11:24:08 prox2 corosync24628: [SERV ] Service engine loaded: corosync watchdog service [7]
Jul 2 12:23:25 prox2 pve-ha-lrm2416: watchdog active
Jul 2 12:57:02 prox2 watchdog-mux1962: client watchdog expired - disable watchdog updates
Jul 2 12:57:38 prox2 kernel: [ 0.199389] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Jul 2 12:57:38 prox2 systemd1: Started Proxmox VE watchdog multiplexer.
Jul 2 12:57:38 prox2 watchdog-mux1832: Watchdog driver 'Software Watchdog', version 0
Jul 2 12:57:39 prox2 corosync2225: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 2 12:57:40 prox2 corosync2225: [SERV ] Service engine loaded: corosync watchdog service [7]
Is it possible, that watchdog is causing some panic and restarting the nodes?
Is it safe to turn the watchdog off?
I found that watchdog can be disable via sysctl kernel.nmi_watchdog=0 and permanently by adding kernel.nmi_watchdog=0 into /etc/sysctl.conf
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!