Perplexing: When a node is turned off, the whole cluster looses it's network

lifeboy

Renowned Member
On a 4 cluster Proxmox installation, when one node is shut down, access to the network on the others goes away somehow. Here is configuration:

Untitled.png

Each node is set up similarly, but with the LAN, corosync and other address changed with each node.
The enlan2.25 and enlan2.35 are legacy setups that will be removed in time, but the other VLAN are configured with Proxmox's SDLAN. vmbr1 is a bridge setup up the internet gateway.

The critical component is this: pfSense1a and pfSense1b run in VM's too.

1731315874658.png

vmbr1 is the gateway on pfSense.
vmbr0 is the bridge to the "LAN", that is all the different VM's and LXC's on the VLANs shown and on the default VLAN.
The 2 pfSense VM's are connected with CARP, so the they check on each other all the time to see who should be active.

Now, if I migrate pfSense1a to a different node, there is not service interruption. If however pfSense1a is running on node A and pfSense 1b on node C and I shutdown NodeD, the connection to the cluster is lost. We use OpenVPN to make a connection to pfSense, and the link drops because the gateway is not reachable anymore.

This doesn't make sense to me. The firewalls are not running on the node that gets shut down, yet they loose internet.

vmbr1 via enlan0 is connected to a Netgear switch, as is each other node's enlan0.

Code:
NodeA:~# ip a show dev enlan0
3: enlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether ac:1f:6b:c5:95:20 brd ff:ff:ff:ff:ff:ff

Code:
NodeB:~# ip a show dev enlan0
3: enlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr1 state UP group default qlen 1000
    link/ether ac:1f:6b:c5:95:44 brd ff:ff:ff:ff:ff:ff

etc.

I realise this may be a pfSense issue, but I'm not sure. So I'm starting here on the Proxmox forum.
 
I think I have eventually found a possible source of the problem: When I shut down a node, although I have disable most of the ceph rebalancing and checking functions, the kernel crashes due to lack of memory. We have now doubled the amount of RAM, so I don't believe it will happen again.

Code:
Nov 10 13:26:48 FT1-NodeA kernel: [15951824.810347] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0-1,oom_memcg=/lxc/146,task_memcg=/lxc/146/ns/system.slice/mysql.service,task=mysqld,pid=1107450,uid=100110
Nov 10 13:26:48 FT1-NodeA kernel: [15951824.810911] Memory cgroup out of memory: Killed process 1107450 (mysqld) total-vm:1905156kB, anon-rss:389996kB, file-rss:0kB, shmem-rss:0kB, UID:100110 pgtables:1256kB oom_score_adj:0
Nov 10 13:30:00 FT1-NodeA kernel: [15952016.125040] libceph: osd40 (1)192.168.131.1:6845 socket closed (con state OPEN)
Nov 10 13:30:00 FT1-NodeA kernel: [15952016.148231] libceph: osd40 (1)192.168.131.1:6845 socket closed (con state OPEN)
Nov 10 13:30:01 FT1-NodeA kernel: [15952017.103559] libceph: osd41 (1)192.168.131.2:6841 socket closed (con state OPEN)
Nov 10 13:30:01 FT1-NodeA kernel: [15952017.185012] libceph: osd41 (1)192.168.131.2:6841 socket closed (con state OPEN)
Nov 10 13:30:05 FT1-NodeA kernel: [15952021.763383] libceph: osd42 (1)192.168.131.4:6843 socket closed (con state OPEN)
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.695079] apache2 invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.696170] Hardware name: Supermicro SYS-2029BT-HNC0R/X11DPT-B, BIOS 3.1 04/30/2019
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.696784]  <TASK>
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.697347]  dump_stack+0x10/0x16
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.697933]  oom_kill_process.cold+0xb/0x10
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.698518]  mem_cgroup_out_of_memory+0x145/0x160
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.699104]  charge_memcg+0x45/0xb0
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.699647]  __add_to_page_cache_locked+0x2e1/0x360
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.700205]  add_to_page_cache_lru+0x4d/0xd0
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.700794]  filemap_fault+0x488/0xb10
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.701359]  __do_fault+0x39/0x120
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.701923]  handle_mm_fault+0xd8/0x2c0
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.702480]  ? exit_to_user_mode_prepare+0x90/0x1b0
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.703048]  asm_exc_page_fault+0x27/0x30
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.703618] Code: Unable to access opcode bytes at RIP 0x7f6b6f6989d6.
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.704207] RAX: 000000000000063f RBX: 00007fff2b1cd6c0 RCX: 0000000009000201
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.704797] RBP: 00007f6b6e9d162c R08: 00007fff2b1cd890 R09: 0000000000000001
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.705397] R13: 00007fff2b1cd6c0 R14: 00007f6b6e920390 R15: 0000000000000000
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.706011] memory: usage 4190208kB, limit 4190208kB, failcnt 114023815
Nov 10 13:30:43 FT1-NodeA kernel: [15952059.706618] kmem: usage 64404kB, limit 9007199254740988kB, failcnt 0
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] anon 4090122240
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] file 134680576
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] kernel_stack 2834432
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] pagetables 32231424
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] percpu 3593760
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] sock 0
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] shmem 134225920
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] file_mapped 134135808
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] file_dirty 0
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] file_writeback 0
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] swapcached 0
Nov 10 13:30:44 FT1-NodeA kernel: [15952059.961035] anon_thp 0

However, this seems to be an LXC in which mysql has run out of memory, which let's the OS kernel crash! Is that normal?[/CODE]
 
Last edited:
the log you posted isn't a crash, it's the kernel telling you it stepped in and killed a process to free up memory..
 
maybe your network went down because the pfsense VM got OOM-killed?
 
maybe your network went down because the pfsense VM got OOM-killed?
That's why I have two instances of pfSense who poll each other with CARP. If I shut one down, the other takes over within seconds. So it's not that.

The VM's on the nodes stay on, but they don't communicate with the control plane anymore as far as I can tell. So if I check the logs on a machine that appeared off during the some other nodes downtime, the logs show the machine was running all the time. However, it was unreachable during the other node's downtime.
 
could you maybe provide logs of all the nodes starting slightly before and ending a bit after such an outage started?
 
I don't really see much out of the ordinary there, except for the OOM kills and this line:

Code:
Nov 10 12:46:38 FT1-NodeB QEMU[1228209]: kvm: ../block/block-backend.c:1780: blk_drain: Assertion `qemu_in_main_thread()' failed.

that would indicate the VM 108 crashed there (similar lines also exist for nodes D and C).

could you maybe provide more details about your network setup?
 
I've seem similar misbehavior caused to misconfigured VLANs in the switch(es), where some VLANS were missing in the ports of some host(s), so moving the VMs that used them to such host made them unreachable. I suggest that you triple check switch configuration and use some test VM to verify that every vlan / network works in every host.
 
I don't really see much out of the ordinary there, except for the OOM kills and this line:

Code:
Nov 10 12:46:38 FT1-NodeB QEMU[1228209]: kvm: ../block/block-backend.c:1780: blk_drain: Assertion `qemu_in_main_thread()' failed.

that would indicate the VM 108 crashed there (similar lines also exist for nodes D and C).

could you maybe provide more details about your network setup?
Indeed, some VMs crashed. However, the 2 pfSense VMs are 100 and 101 and neither crashed. That was the first thing I checked for in the logs.

The reason for taking the nodes down was exactly that: We doubled the RAM in each node.
 
I've seem similar misbehavior caused to misconfigured VLANs in the switch(es), where some VLANS were missing in the ports of some host(s), so moving the VMs that used them to such host made them unreachable. I suggest that you triple check switch configuration and use some test VM to verify that every vlan / network works in every host.

Yes, we have checked that in great detail. The VLAN's on the Mellanox switch all have all the active ports joined to every VLAN, so no matter where the VM runs, the VLANs are active there.