VM network interruptions and Conntrack weirdness

Mar 20, 2024
17
1
3
Hi!

Since I upgraded to PVE 9 I have had several complaints from our OS Admin customers saying that they get periodic network disconnects on their VMs.
This happens on isolated VMs, not necessarily to all the VMs on a bridge.

I correlated these events to these messages:

Sep 25 05:52:56 pxmx-host kernel: net_ratelimit: 534 callbacks suppressed
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet
Sep 25 05:52:56 pxmx-host kernel: nf_conntrack: nf_conntrack: table full, dropping packet


it is during these periods they say they cannot ping some of their machines.

Anyone else seen similar? Any way to fix it?
Is it a good idea (or even possible) to increase the table size or will I just be shooting myself in the foot?

Thanks
 
Hi @ManFriday ,

Out of curiosity - how large if your environment? Nodes, VMs, etc?

What is your current max? sysctl net.netfilter.nf_conntrack_max

Write a basic loop to log number of connections, every 1-10seconds, along with date. That should tell you average use, as well as peak times:

Code:
while true; do
    echo "$(date '+%F %T') Count: $(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max)" >> /var/log/conntrack_usage.log
    sleep 5
done

You can somewhat safely raise the max number:
  • 262144 entries - 80–100 MB RAM.
  • 524288 entries - 160–200 MB RAM.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: fba
I have a TB of ram in each host so I imagine I can try bumping up that max.
Don't forget to ensure that you make the change persistent, or you will be in for a surprise sometime after next reboot.
I'd even recommend testing it by rebooting the host and checking that it was set again.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Don't forget to ensure that you make the change persistent, or you will be in for a surprise sometime after next reboot.
I'd even recommend testing it by rebooting the host and checking that it was set again.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
I created a file in /etc/sysctl.d/99-conntrack.conf
net.netfilter.nf_conntrack_max = 524288

restart the service and do
sysctl -n net.netfilter.nf_conntrack_max

it returns the correct 524288, but then a little while later it reverts to the original 262144.

is there somewhere else I need to modify this?
 
it returns the correct 524288, but then a little while later it reverts to the original 262144.
Curious.
Perhaps try:
echo "options nf_conntrack nf_conntrack_max=524288" > /etc/modprobe.d/nf_conntrack.conf

You'll need to reboot, or unload/reload the module which will cause connection reset.

Not sure, of the top my head, what could be causing the reversion.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Ahh look at that! I didnt even know that setting existed.
It is currently set to 'default' so im guessing thats the issue.

Thanks so much Victor!
Manual [1] says default is 262144, although it should not apply unless you have firewall enabled both for the host and at Datacenter level.
I would also think that "default" could mean "use whatever is in the system", but it does in fact apply PVE's default for the value. Which, OTOH, is high enough for many use cases. I would take a look at what is consuming that many connections, maybe you have some misbehaving application / VM / customer / or some kind of DoS attack.

[1] https://pve.proxmox.com/pve-docs/chapter-pve-firewall.html#pve_firewall_host_specific_configuration
 
  • Like
Reactions: ManFriday
Manual [1] says default is 262144, although it should not apply unless you have firewall enabled both for the host and at Datacenter level.
I would also think that "default" could mean "use whatever is in the system", but it does in fact apply PVE's default for the value. Which, OTOH, is high enough for many use cases. I would take a look at what is consuming that many connections, maybe you have some misbehaving application / VM / customer / or some kind of DoS attack.

[1] https://pve.proxmox.com/pve-docs/chapter-pve-firewall.html#pve_firewall_host_specific_configuration
It was occurring during Veeam backups.
The VLAN the veeam workers use is on the same 10g uplink as the VM traffic.
Not idea, I realize.
We are working on separating the veeam backup traffic onto its own 10G uplink.