Randomly Inter VM NW cnx issues

stefws

Renowned Member
Jan 29, 2015
302
4
83
Denmark
siimnet.dk
Got an application spread over multiple VMs across multiple Hypervisor Nodes - HN utilizing PVE FireWalling. Some central NW VMs (load balancers) have multiqueued NICs to be able to handle more packets. HNs are HP Proliant 360 Gen9, each having two bonded NICs, bond0 over two separate 1Gbs swiches, bond1 over two separate 10Gbs switches with 2x40Gbs interswitch connect.

bond0 is used for public access to HNs and corosync ring 1 cnx to vmbr0 (std linux bridge)
bond1 is used for public access to Application/VMs and corosync ring 0 cnx to vmbr1 (ovs 2.5 bridge)

Randomly we have see f.ex.
- 1. interVM connection attempts (after silence for a while) getting 'connection refused' only to split seconds later having no problems in following attempts (like some sort of cache state initialization/varm up).

- 2x VMs are internal name server and randomly other name client VMs are seeing timeout on name resolution.

Above seems to make health checking from load balancers and application traffic to fluctuate, more under heavier loads.

Hints appriciated on debugging this!

TIA

Info snippets:

root@n1:~# pveversion --verbose
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
root@n1:~# netstat -in
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0 9000 0 64682282 0 23808 0 56139474 0 0 0 BMmRU
bond1 9000 0 0 0 0 0 0 0 0 0 BMRU
eth0 9000 0 58223122 0 1 0 55738773 0 0 0 BMsRU
eth4 9000 0 6459160 0 0 0 400701 0 0 0 BMsRU
eth8 9000 0 2656781604 0 0 0 3822817813 0 0 0 BMRU
eth9 9000 0 4914314019 0 0 0 3832775928 0 0 0 BMRU
fwbr400i1 9000 0 519629 0 0 0 0 0 0 0 BMRU
fwln400o1 9000 0 2831738077 0 0 0 2309932822 0 0 0 BMRU
lo 65536 0 445243 0 0 0 445243 0 0 0 LRU
tap400i0 9000 0 403787 0 0 0 221355 0 0 0 BMPRU
tap400i1 9000 0 2309661198 0 0 0 2815793048 0 107 0 BMPRU
tap400i2 9000 0 2496532145 0 0 0 2284651197 0 0 0 BMPRU
tap400i3 9000 0 426370 0 0 0 401890 0 0 0 BMPRU
tap400i4 9000 0 286914 0 0 0 136134 0 0 0 BMPRU
tap400i5 9000 0 564233 0 0 0 563157 0 0 0 BMPRU
vlan11 9000 0 244909 0 20 0 362005 0 0 0 BMRU
vlan12 9000 0 266096 0 20 0 422677 0 0 0 BMRU
vlan13 9000 0 64248249 0 0 0 59963570 0 0 0 BMRU
vlan20 9000 0 8971555 0 0 0 13944236 0 0 0 BMRU
vlan21 9000 0 9154950 0 0 0 14024124 0 0 0 BMRU
vmbr0 9000 0 64506473 0 660530 0 55353343 0 0 0 BMRU
 
A few drops on bond0, but even more on vmbr0, only this isn't used for inter VM traffic

root@n1:~# ifconfig bond0
bond0 Link encap:Ethernet HWaddr 28:80:23:a7:e6:b4
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:64916827 errors:0 dropped:23899 overruns:0 frame:0
TX packets:56335617 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12061027105 (11.2 GiB) TX bytes:7513331047 (6.9 GiB)

root@n1:~# ifconfig vmbr0
vmbr0 Link encap:Ethernet HWaddr 28:80:23:a7:e6:b4
inet addr:<redacted> Bcast:<redacted> Mask:255.255.255.240
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:64741652 errors:0 dropped:663050 overruns:0 frame:0
TX packets:55547627 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:10868276081 (10.1 GiB) TX bytes:7193452012 (6.6 GiB)

Few packets later

root@n1:~# ifconfig bond0
bond0 Link encap:Ethernet HWaddr 28:80:23:a7:e6:b4
UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1
RX packets:64919285 errors:0 dropped:23900 overruns:0 frame:0
TX packets:56337677 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:12061489699 (11.2 GiB) TX bytes:7513610711 (6.9 GiB)

root@n1:~# ifconfig vmbr0
vmbr0 Link encap:Ethernet HWaddr 28:80:23:a7:e6:b4
inet addr:<redacted> Bcast:<redacted> Mask:255.255.255.240
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:64744159 errors:0 dropped:663110 overruns:0 frame:0
TX packets:55549679 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:10868707945 (10.1 GiB) TX bytes:7193715867 (6.6 GiB)

root@n1:~# bc -l
bc 1.06.95
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
# % drops on vmbr0
(663110-663050)*100/(64744159-64741652)
2.39329876346230554447
# % drops on bond0
(23900-23899)*100/(64919285-64741652)
.00056295845929528860

No error/drops etc on bond1 nor it's slaves

root@n1:~# ifconfig eth8
eth8 Link encap:Ethernet HWaddr 00:11:0a:66:4e:e4
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:2679360088 errors:0 dropped:0 overruns:0 frame:0
TX packets:3857199521 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2232821260467 (2.0 TiB) TX bytes:3005200856075 (2.7 TiB)

root@n1:~# ifconfig eth9
eth9 Link encap:Ethernet HWaddr 00:11:0a:66:4e:e5
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:4958820458 errors:0 dropped:0 overruns:0 frame:0
TX packets:3866674386 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:3864793842978 (3.5 TiB) TX bytes:2984789643489 (2.7 TiB)

root@n1:~# ifconfig bond1
bond1 Link encap:Ethernet HWaddr 6a:53:2e:33:06:94
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

But quite some drops (1 being too many) on vmbr1 (used for inter VM traffic) only no packet counts, why do I see dropped packets:

root@n1:~# ifconfig vmbr1
vmbr1 Link encap:Ethernet HWaddr 00:11:0a:66:4e:e4
BROADCAST MULTICAST MTU:9000 Metric:1
RX packets:0 errors:0 dropped:9697344 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

root@n1:~# ifconfig vmbr1
vmbr1 Link encap:Ethernet HWaddr 00:11:0a:66:4e:e4
BROADCAST MULTICAST MTU:9000 Metric:1
RX packets:0 errors:0 dropped:9697516 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Wondering if std bridges plays well when connected to OVS bridges as PVE FW ends doing:

root@n1:~# brctl show
bridge name bridge id STP enabled interfaces
fwbr400i1 8000.52c09022faa1 no fwln400o1
tap400i1
vmbr0 8000.288023a7e6b4 no bond0
root@n1:~# ovs-vsctl show
953aa71d-6533-4571-9b04-15f7f18573cb
Bridge "vmbr1"
Port "vlan20"
tag: 20
Interface "vlan20"
type: internal
Port "vlan21"
tag: 21
Interface "vlan21"
type: internal
Port "vlan13"
tag: 13
Interface "vlan13"
type: internal
Port "vlan12"
tag: 12
Interface "vlan12"
type: internal
Port "fwln400o1"
tag: 41
Interface "fwln400o1"
type: internal
Port "tap400i0"
tag: 40
Interface "tap400i0"
Port "tap400i3"
tag: 43
Interface "tap400i3"
Port "tap400i4"
tag: 44
Interface "tap400i4"
Port "tap400i5"
tag: 45
Interface "tap400i5"
Port "vlan11"
tag: 11
Interface "vlan11"
type: internal
Port "bond1"
Interface "eth9"
Interface "eth8"
Port "tap400i2"
tag: 42
Interface "tap400i2"
Port "vmbr1"
Interface "vmbr1"
type: internal
ovs_version: "2.5.0"

Only these drops seems not to added up from VM ifaces in the vmbr1 SW only on the SW it self... wondering why
 
VMs use Virtio_net on NICs @vmbr1 and usually e1000 on NICs @vmbr0 unless the few that have multiQs to this bridge as well.

No HN vhost process (HN side userland process of virtio_net's vring - shared memory buffer) are running maxed out on HNs. So I don't understand the presumed packet drops in the OV SW and poor network connectivity under relative heavy NW traffic.

See attached view of top from HNs ovs_perf_issue_top-on-HNs.png , neither HN nor VMs are more than lightly load ImHO.

BTW last apt-get update this weekend, which came with OVS 2.5.0, seem to easy the cpu load quite well, most properly due improvement in virtio_net and/or kvm. See attached VM image of our HA proxy VM after live migrated back to an updated HN after the green mark around 18:15PM, makes you happy... at least. It goes from 15 - +20% cpu load down to <5% for same or even more NW traffic :)
hapAcpu_vs_netio_on_latest_pve+ovs.png

Hints appreciated on vswitch/kvm performance analysis/tuning, PVE NW BCP...
 
Last edited:
This NW connectivity issue causes our HA proxy see real servers as flapping up/down as health checking randomly fails to connect. So are customers seeing the LB service latency flapping as well :confused:

See HA proxy status split samples as part:
HA-flaping_realsrv.png HA-flaping_realsrv2.png
 
Stupid me :oops:

Missed events like these in our HA proxy VM at peak traffic time:

May 31 12:10:00 hapA kernel: nf_conntrack: table full, dropping packet
May 31 12:10:00 hapA kernel: nf_conntrack: table full, dropping packet
May 31 12:10:00 hapA kernel: nf_conntrack: table full, dropping packet
May 31 12:10:00 hapA kernel: nf_conntrack: table full, dropping packet
May 31 12:10:00 hapA kernel: nf_conntrack: table full, dropping packet
...

Will try to better tune the VM netfilter, eg. initially with these settings:

Add to /etc/sysctl.conf:
# tune net filter to track more connections than default
# nf_conntrack_max => also raise hashsize in rc.local
net.netfilter.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_generic_timeout = 180
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 30
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 60

Add to /etc/rc.local:
# increase netfilter hash table size as we do netfilter_tune in sysctl
echo 24576 > /sys/module/nf_conntrack/parameters/hashsize

Will see monday if this isn't much better :)

Would I also possible need to increase nf_conntrack_max at HN level as we run iptables on HNs?

Or will these drop tracking in iptables for brigdes:
root@n1:~# more /etc/sysctl.d/pve.conf
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
fs.aio-max-nr = 1048576