I have a 5-node cluster with LXC only. From time to time, some LXC loses connectivity. It's sporadic, sort of random even. But when it happens I am very lost at how to properly debug it to know the specific cause of it.
This weekend happened again. Took the chance to upgrade the node from version 7.3 to 7.4 and rebooted, so the problem disappeared. An alternate option would/could be waiting for about not sure how long, but definitely more than the 30000 ms you see in the node via
LXC in the same node can ping among each other, but not to other nodes, which may point at a hardware or network problem outside of the server (on Hetzner). But other LXC on the "affected" node can ping, say the internal DNS server, which is in a different node, and you can see the ARP tables being updated (via
You can see the who-has ARP requests via tpcdump on the node where the affected LXC is:
Other LXC in the node of the affected LXC can ping the IP address
After a while, if I check the ARP table of the host, the Ip
At the moment I am running version 7.4, with three nodes still running 7.3 (will most probably update tonight) but, as I said, this has been happening to me randomly for the last year at least. Not always, not sure if this version had the problem and that version did not. All of them were version 7.x, though (I created the cluster with version 7.0).
I am not sure how to debug this problem. Could someone please provide instructions? I know how to use
Please let me know if you need further details.
This weekend happened again. Took the chance to upgrade the node from version 7.3 to 7.4 and rebooted, so the problem disappeared. An alternate option would/could be waiting for about not sure how long, but definitely more than the 30000 ms you see in the node via
cat /proc/sys/net/ipv4/neigh/eno1/base_reachable_time_ms
until it "fixes itself". Not sure it always does, though.LXC in the same node can ping among each other, but not to other nodes, which may point at a hardware or network problem outside of the server (on Hetzner). But other LXC on the "affected" node can ping, say the internal DNS server, which is in a different node, and you can see the ARP tables being updated (via
cat /proc/net/arp
. So not hardware problem. The ARP table of the LXC without connectivity looks like this:
Code:
# cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.0.254 0x1 0x0 00:00:00:00:00:00 * eth0
192.168.0.181 0x1 0x2 26:ec:c6:9a:9e:f2 * eth0
192.168.0.253 0x1 0x0 00:00:00:00:00:00 * eth0
192.168.0.113 0x1 0x0 00:00:00:00:00:00 * eth0
192.168.0.102 0x1 0x2 72:8d:c2:7e:cd:e8 * eth0
192.168.0.181
is in the same node, but 192.168.0.102
is not. 192.168.0.181
I can ping from the affected LXC, but 192.168.0.102
I cannot (even though the ARP tables have the right MAC address - the NGINX reverse proxy trying to contact the affected LXC with the application server, most probably).You can see the who-has ARP requests via tpcdump on the node where the affected LXC is:
Code:
# tcpdump host 192.168.0.180
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:28:11.599900 ARP, Request who-has 192.168.0.113 tell 192.168.0.180, length 28
15:28:12.624192 ARP, Request who-has 192.168.0.113 tell 192.168.0.180, length 28
Other LXC in the node of the affected LXC can ping the IP address
192.168.0.180
, so I am really lost here.After a while, if I check the ARP table of the host, the Ip
192.168.0.113
(or any other with HW address 00:00:00:00:00
) is gone from it, but it is still present in the ARP table of the affected LXC. If I flush the table with ip -s -s neigh flush all
in the affected LXC and try to ping again, the problem is not gone.At the moment I am running version 7.4, with three nodes still running 7.3 (will most probably update tonight) but, as I said, this has been happening to me randomly for the last year at least. Not always, not sure if this version had the problem and that version did not. All of them were version 7.x, though (I created the cluster with version 7.0).
I am not sure how to debug this problem. Could someone please provide instructions? I know how to use
tcpdump
, but just the basics. All servers have a single NIC. The private network goes through Hetzner's vSwitch service.Please let me know if you need further details.