Restarting server crashes ethernet adapter on other nodes

Petr.114 · May 28, 2020

Hello,

we are facing pretty curios troubles in Proxmox cluster.

When we restart node or corosing on some host (systemctl restart corosync), then on some other nodes ethernet adapter crashes and does not recover himself.

Yesterday we performed a test - we restarted server "backup" and in few seconds we lost the connection with servers "pve, prox3, prox2-brno". I went physically to "prox3" server, thinking that i can do something about the situation via terminal. But unfortunatelly the server was in unusable state, was not responding to keystrokes, after 15 minutes of waiting just some enter keys came through. In this state "hard restart" is always necessary.

I am not sure how this can happen that one restarting server affect ethernet adapter on different nodes of the cluster.

I was trying to collect some log, just tell me if i forgot some necessary logs.

Iam not entirely sure if it can be related to some corosync setting, but we have added knet_transport: sctp and token: 10000, because the cluster was all the time falling apart. After adding those two lines the problem was fixed and cluster sticks together.

logging {
debug: off
to_syslog: yes
}

nodelist {
node {
name: backup
nodeid: 5
quorum_votes: 1
ring0_addr: 192.168.0.14
}
node {
name: havirov-prox1
nodeid: 8
quorum_votes: 1
ring0_addr: 192.168.6.2
}
node {
name: prox1
nodeid: 2
quorum_votes: 1
ring0_addr: 192.168.0.11
}
node {
name: prox1-brno
nodeid: 9
quorum_votes: 1
ring0_addr: 192.168.7.2
}
node {
name: prox2
nodeid: 3
quorum_votes: 1
ring0_addr: 192.168.0.12
}
node {
name: prox2-brno
nodeid: 7
quorum_votes: 1
ring0_addr: 192.168.7.10
}
node {
name: prox3
nodeid: 4
quorum_votes: 1
ring0_addr: 192.168.0.13
}
node {
name: prox4
nodeid: 6
quorum_votes: 1
ring0_addr: 192.168.0.15
}
node {
name: pve
nodeid: 1
quorum_votes: 1
ring0_addr: 192.168.0.19
}
}

quorum {
provider: corosync_votequorum
}

totem {
cluster_name: cutter-pv
config_version: 46
interface {
knet_transport: sctp
linknumber: 0
}
ip_version: ipv4
secauth: on
token: 10000
version: 2
}

root@prox3:~$ pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.4.27-1-pve)
pve-manager: 6.1-8 (running version: 6.1-8/806edfe1)
pve-kernel-5.4: 6.1-8
pve-kernel-helper: 6.1-8
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.27-1-pve: 5.4.27-1
pve-kernel-5.4.24-1-pve: 5.4.24-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-4.13.13-5-pve: 4.13.13-38
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.3-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.15-pve1
libpve-access-control: 6.0-6
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.0-17
libpve-guest-common-perl: 3.0-5
libpve-http-server-perl: 3.0-5
libpve-storage-perl: 6.1-5
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 3.2.1-1
lxcfs: 4.0.1-pve1
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-3
pve-cluster: 6.1-4
pve-container: 3.0-23
pve-docs: 6.1-6
pve-edk2-firmware: 2.20200229-1
pve-firewall: 4.0-10
pve-firmware: 3.0-7
pve-ha-manager: 3.0-9
pve-i18n: 2.0-4
pve-qemu-kvm: 4.1.1-4
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-7
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1

Time of the test was about 17:03

Thanks for any help!

wolfgang · Jun 2, 2020

Hi,

why you switch from udp to sctp?
catching the default token setting is not recommended.
Is the Corosync network on an isolated network and is it all the same subnet?

Petr.114 · Jun 2, 2020

Hello, we have switched to sctp and set token to 10000, because the cluster was all the time falling apart...few times a day. We found this fix on forum and after applying to our cluster, it fixed our problem and cluster now keeps together.
https://forum.proxmox.com/threads/a...-after-upgrade-5-4-to-6-0-4.56425/post-260570
Corosync network is not on isolated network. The nodes are not all in the same subnet.

wolfgang · Jun 2, 2020

Corosync cannot run on multiple subnets.
You will always have a problem with such a setup.
Create a dedicated network for Corosync on a single subnet without a router.

Petr.114 · Jun 2, 2020

Why we can not have nodes in multiple subnets, could you please refer us to some relevant documentation?

We understand, that corosync problems (cluster falling apart) can be caused by multiple subnets, but is it possible, that our present problem (ethernet adapter fails) is caused by multiple subnets or the changed corosync configuration - knet transport and token?
It seems a little bit strange to us.

Thank you for your time wolfgang.

wolfgang · Jun 2, 2020

The main problem is latency.
There is no hard limit, therefore, there is no reference doc for this.
If you use multiple subnets you must have involved routers.
Routers will increase the latency of the packages.
And the next problem is if you are in a shared network.
The traffic in this network can also affect the latency in it.
Latency is for the most services no problem, so most users are more concerned about bandwidth.
The bandwidth is for corosync normally no problem, because in large clusters you need about 10MBit/s.

It is true that you can increase the token timeout but this can make other problems.
For more information about token see "man corosync.conf"

spirit · Jun 2, 2020

I don't recommand sctp for production.
It's less tested by corosync/knet team , and still have bugs.

Some of them have been fixed recently in knet 1.16 (not yet available in proxmox repo)
for exemple:
https://github.com/kronosnet/kronosnet/commit/afd043343d9a56289353dd7f02f3d3c18e5833e6

Petr.114 · Jun 4, 2020

We are probably going to wait until knet update is released in proxmox repositories and give it a try. Maybe it can help the situation.
For info - we have nodes in 3 different cities and they are connected via VPN.

Petr.114 · Jul 10, 2020

knet 1.16 slightly improved stability of the cluster

New info - we have restarted 1 node, but another 3 nodes fall down and restarted...on all 3 nodes, i found (around time of the restart - 12:57) some watchdog messages.

root@prox2:~# cat /var/log/syslog | grep watchdog
Jul 2 11:24:06 prox2 corosync3643: [SERV ] Service engine unloaded: corosync watchdog service
Jul 2 11:24:07 prox2 corosync24628: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 2 11:24:08 prox2 corosync24628: [SERV ] Service engine loaded: corosync watchdog service [7]
Jul 2 12:23:25 prox2 pve-ha-lrm2416: watchdog active
Jul 2 12:57:02 prox2 watchdog-mux1962: client watchdog expired - disable watchdog updates
Jul 2 12:57:38 prox2 kernel: [ 0.199389] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Jul 2 12:57:38 prox2 systemd1: Started Proxmox VE watchdog multiplexer.
Jul 2 12:57:38 prox2 watchdog-mux1832: Watchdog driver 'Software Watchdog', version 0
Jul 2 12:57:39 prox2 corosync2225: [MAIN ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Jul 2 12:57:40 prox2 corosync2225: [SERV ] Service engine loaded: corosync watchdog service [7]

Is it possible, that watchdog is causing some panic and restarting the nodes?
Is it safe to turn the watchdog off?
I found that watchdog can be disable via sysctl kernel.nmi_watchdog=0 and permanently by adding kernel.nmi_watchdog=0 into /etc/sysctl.conf

Search

Search

Restarting server crashes ethernet adapter on other nodes

Petr.114

Well-Known Member

Attachments

wolfgang

Proxmox Retired Staff

Petr.114

Well-Known Member

wolfgang

Proxmox Retired Staff

Petr.114

Well-Known Member

wolfgang

Proxmox Retired Staff

spirit

Distinguished Member

Petr.114

Well-Known Member

Petr.114

Well-Known Member