Debugging sporadic networking problems on LXC

jsabater · Apr 3, 2023

I have a 5-node cluster with LXC only. From time to time, some LXC loses connectivity. It's sporadic, sort of random even. But when it happens I am very lost at how to properly debug it to know the specific cause of it.

This weekend happened again. Took the chance to upgrade the node from version 7.3 to 7.4 and rebooted, so the problem disappeared. An alternate option would/could be waiting for about not sure how long, but definitely more than the 30000 ms you see in the node via cat /proc/sys/net/ipv4/neigh/eno1/base_reachable_time_ms until it "fixes itself". Not sure it always does, though.

LXC in the same node can ping among each other, but not to other nodes, which may point at a hardware or network problem outside of the server (on Hetzner). But other LXC on the "affected" node can ping, say the internal DNS server, which is in a different node, and you can see the ARP tables being updated (via cat /proc/net/arp. So not hardware problem. The ARP table of the LXC without connectivity looks like this:

Code:

# cat /proc/net/arp
IP address       HW type     Flags       HW address            Mask     Device
192.168.0.254    0x1         0x0         00:00:00:00:00:00     *        eth0
192.168.0.181    0x1         0x2         26:ec:c6:9a:9e:f2     *        eth0
192.168.0.253    0x1         0x0         00:00:00:00:00:00     *        eth0
192.168.0.113    0x1         0x0         00:00:00:00:00:00     *        eth0
192.168.0.102    0x1         0x2         72:8d:c2:7e:cd:e8     *        eth0

192.168.0.181 is in the same node, but 192.168.0.102 is not. 192.168.0.181 I can ping from the affected LXC, but 192.168.0.102 I cannot (even though the ARP tables have the right MAC address - the NGINX reverse proxy trying to contact the affected LXC with the application server, most probably).

You can see the who-has ARP requests via tpcdump on the node where the affected LXC is:

Code:

# tcpdump host 192.168.0.180
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:28:11.599900 ARP, Request who-has 192.168.0.113 tell 192.168.0.180, length 28
15:28:12.624192 ARP, Request who-has 192.168.0.113 tell 192.168.0.180, length 28

Other LXC in the node of the affected LXC can ping the IP address 192.168.0.180, so I am really lost here.

After a while, if I check the ARP table of the host, the Ip 192.168.0.113 (or any other with HW address 00:00:00:00:00) is gone from it, but it is still present in the ARP table of the affected LXC. If I flush the table with ip -s -s neigh flush all in the affected LXC and try to ping again, the problem is not gone.

At the moment I am running version 7.4, with three nodes still running 7.3 (will most probably update tonight) but, as I said, this has been happening to me randomly for the last year at least. Not always, not sure if this version had the problem and that version did not. All of them were version 7.x, though (I created the cluster with version 7.0).

I am not sure how to debug this problem. Could someone please provide instructions? I know how to use tcpdump, but just the basics. All servers have a single NIC. The private network goes through Hetzner's vSwitch service.

Please let me know if you need further details.

jsabater · Apr 4, 2023

I just ran into this issue again. Yesterday night I upgraded the cluster to version 7.4-3 with the latest kernel and rebooted the nodes. Everything seems to be working fine.

I just deleted the container with id 126 from my proxmox4 node and about 10 seconds later I executed my LXC provisioning Ansible playbook to create that same container (with id 126) from scratch. It did create it, but the LXC cannot ping anything outside of the node where it was created (same node where it was before).

I waited for 3-4 minutes and tried again: now the LXC can ping outside of the node. Any clues?

Search

Search

Debugging sporadic networking problems on LXC

jsabater

Member

jsabater

Member

We value your privacy