Migrate LXC leaves it with no access to network beyond other LXC in the same node

jsabater · Apr 7, 2023

Hello.

As of lately (a week or two, using 7.3 and the problem persists after upgrading to 7.4), when I migrate a LXC from a node (e.g. proxmox1) to another node (e.g. proxmox4), the LXC cannot connect/ping to, or be connected/pinged from, any LXC but those in the same, new node.

When I migrate it back to its original node, it works again, instantly, and it can connect to LXC in the same and different nodes. I am very much concerned, to be honest.

A week ago I had problems with the filesystem on proxmox3, which came "out of nowhere", and I had to delete and recreate a bunch of LXC as their storages were corrupt (app servers and databases). Rebooting the node took about 20 minutes but eventually both the ext4-based local pool and the zfs-based zfspool pool were brought back to a good status. Any LXC I tried to move out of proxmox3 suffered the same problem, but I thought it had to do with the storages being corrupt (that is why I decided to re-create them in some other node).

Now more and more I think that proxmox3 was not properly recovered to a good state and there is garbage/issues in that node that are affecting the cluster. If I try to reuse the IDs of those LXC originally affected in proxmox3 there is always problems. If I use new codes in any node not being proxmox3, these new things work fine.

I've tried to enable the Firewall: Options: log_level_in and log_level_out (to info[ICODE] and [ICODE]debug[ICODE]), but I see no entries in the [ICODE]Firewall: Options: Log screen when I try to ping the LXC. If I do this in any other LXC (no need to restart it), I see traffic.

I just don't know where to look. Any hints, any ideas, anything to look for? It's like the existing LXC before proxmox3 suffered the issues are "stuck in the past".

Thanks. Any ideas will be greatly appreciated.

Code:

# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.104-1-pve)
pve-manager: 7.4-3 (running version: 7.4-3/9002ab8a)
pve-kernel-5.15: 7.4-1
pve-kernel-5.15.104-1-pve: 5.15.104-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
ceph-fuse: 14.2.21-1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4-2
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-4
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-1
libpve-rs-perl: 0.7.5
libpve-storage-perl: 7.4-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.1-1
proxmox-backup-file-restore: 2.4.1-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.6.5
pve-cluster: 7.3-3
pve-container: 4.4-3
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-1
pve-firewall: 4.3-1
pve-firmware: 3.6-4
pve-ha-manager: 3.6.0
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-1
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

This is the last LXC I tried to migrate, which was originally in proxmox1 and was being moved to proxmox4, so not one of those "affected" by the issues in proxmox3:

Code:

# pct config 102
arch: amd64
cores: 4
features: nesting=1
hostname: nginx1
memory: 2048
nameserver: 192.168.0.253 192.168.0.254
net0: name=eth0,bridge=vmbr4002,firewall=1,hwaddr=4E:48:75:31:5E:4A,ip=192.168.0.102/24,mtu=1400,type=veth
net1: name=eth1,bridge=vmbr4001,firewall=1,gw=116.X.Y.T,hwaddr=4A:75:45:2E:B9:68,ip=116.X.Y.Z/28,mtu=1400,type=veth
onboot: 1
ostype: debian
rootfs: local:102/vm-102-disk-0.raw,size=8G
searchdomain: internaldomain.com
swap: 512
tags: sysadmin
unprivileged: 1

jsabater · Apr 7, 2023

I started a tcpdump host 192.168.0.145 on the node where I migrated another LXC to (proxmox1 to proxmox4 again, but a LXC I can live without), and I see lots of ARP traffic:

Code:

 tcpdump host 192.168.0.145
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:17:35.131332 ARP, Request who-has 192.168.0.145 tell 192.168.0.114, length 42
12:17:35.131528 ARP, Reply 192.168.0.145 is-at de:ea:be:53:47:09 (oui Unknown), length 28
12:17:36.155180 ARP, Request who-has 192.168.0.145 tell 192.168.0.114, length 42
12:17:36.155387 ARP, Reply 192.168.0.145 is-at de:ea:be:53:47:09 (oui Unknown), length 28
12:17:37.179222 ARP, Request who-has 192.168.0.145 tell 192.168.0.114, length 42
12:17:37.179436 ARP, Reply 192.168.0.145 is-at de:ea:be:53:47:09 (oui Unknown), length 28
12:17:38.203275 ARP, Request who-has 192.168.0.145 tell 192.168.0.114, length 42
12:17:38.203510 ARP, Reply 192.168.0.145 is-at de:ea:be:53:47:09 (oui Unknown), length 28
12:17:39.227562 ARP, Request who-has 192.168.0.145 tell 192.168.0.114, length 42
12:17:39.227820 ARP, Reply 192.168.0.145 is-at de:ea:be:53:47:09 (oui Unknown), length 28

I was trying to ping from IP 192.168.0.145 on proxmox4 to IP address 192.168.0.102 on proxmox1. It did not work.

I also pinged IP address 192.168.0.180, which is in the same node, and ping worked fine. The ARP table was updated and also included 192.168.0.114, which is the Prometheus server that tries to reach every body in the cluster.

Code:

# arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
192.168.0.253                    (incomplete)                              eth0
192.168.0.114            ether   5e:7f:b5:07:45:d2   C                     eth0
192.168.0.110                    (incomplete)                              eth0
192.168.0.102                    (incomplete)                              eth0
192.168.0.180            ether   be:84:d1:ac:61:e1   C                     eth0

Restarting pve-firewall did not make any difference.

jsabater · Apr 7, 2023

I just found this thread which seems to describe my issue as well. Clearly the guys in that thread know more about networking than I do.

Search

Search

Migrate LXC leaves it with no access to network beyond other LXC in the same node

jsabater

Member

jsabater

Member

jsabater

Member