Got an application spread over multiple VMs across multiple Hypervisor Nodes - HN utilizing PVE FireWalling. Some central NW VMs (load balancers) have multiqueued NICs to be able to handle more packets. HNs are HP Proliant 360 Gen9, each having two bonded NICs, bond0 over two separate 1Gbs swiches, bond1 over two separate 10Gbs switches with 2x40Gbs interswitch connect.
bond0 is used for public access to HNs and corosync ring 1 cnx to vmbr0 (std linux bridge)
bond1 is used for public access to Application/VMs and corosync ring 0 cnx to vmbr1 (ovs 2.5 bridge)
Randomly we have see f.ex.
- 1. interVM connection attempts (after silence for a while) getting 'connection refused' only to split seconds later having no problems in following attempts (like some sort of cache state initialization/varm up).
- 2x VMs are internal name server and randomly other name client VMs are seeing timeout on name resolution.
Above seems to make health checking from load balancers and application traffic to fluctuate, more under heavier loads.
Hints appriciated on debugging this!
TIA
Info snippets:
bond0 is used for public access to HNs and corosync ring 1 cnx to vmbr0 (std linux bridge)
bond1 is used for public access to Application/VMs and corosync ring 0 cnx to vmbr1 (ovs 2.5 bridge)
Randomly we have see f.ex.
- 1. interVM connection attempts (after silence for a while) getting 'connection refused' only to split seconds later having no problems in following attempts (like some sort of cache state initialization/varm up).
- 2x VMs are internal name server and randomly other name client VMs are seeing timeout on name resolution.
Above seems to make health checking from load balancers and application traffic to fluctuate, more under heavier loads.
Hints appriciated on debugging this!
TIA
Info snippets:
root@n1:~# pveversion --verbose
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
root@n1:~# netstat -in
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0 9000 0 64682282 0 23808 0 56139474 0 0 0 BMmRU
bond1 9000 0 0 0 0 0 0 0 0 0 BMRU
eth0 9000 0 58223122 0 1 0 55738773 0 0 0 BMsRU
eth4 9000 0 6459160 0 0 0 400701 0 0 0 BMsRU
eth8 9000 0 2656781604 0 0 0 3822817813 0 0 0 BMRU
eth9 9000 0 4914314019 0 0 0 3832775928 0 0 0 BMRU
fwbr400i1 9000 0 519629 0 0 0 0 0 0 0 BMRU
fwln400o1 9000 0 2831738077 0 0 0 2309932822 0 0 0 BMRU
lo 65536 0 445243 0 0 0 445243 0 0 0 LRU
tap400i0 9000 0 403787 0 0 0 221355 0 0 0 BMPRU
tap400i1 9000 0 2309661198 0 0 0 2815793048 0 107 0 BMPRU
tap400i2 9000 0 2496532145 0 0 0 2284651197 0 0 0 BMPRU
tap400i3 9000 0 426370 0 0 0 401890 0 0 0 BMPRU
tap400i4 9000 0 286914 0 0 0 136134 0 0 0 BMPRU
tap400i5 9000 0 564233 0 0 0 563157 0 0 0 BMPRU
vlan11 9000 0 244909 0 20 0 362005 0 0 0 BMRU
vlan12 9000 0 266096 0 20 0 422677 0 0 0 BMRU
vlan13 9000 0 64248249 0 0 0 59963570 0 0 0 BMRU
vlan20 9000 0 8971555 0 0 0 13944236 0 0 0 BMRU
vlan21 9000 0 9154950 0 0 0 14024124 0 0 0 BMRU
vmbr0 9000 0 64506473 0 660530 0 55353343 0 0 0 BMRU