Initial cnx refused

stefws · Jun 8, 2016

Wondering if initial connection attempts between two VMs on the same PVE 4.2 cluster been refused is due to using PVE FW and if so could this be avoided. It seems like some kind connection [state] cache needs to be set initially before been allowed as iptables rules dictate. Happens again after idling for a while (like a cache ttl expires). Any hints & clues appreciated, TIA!

Below cnx attempts rapidly follows one another:

#n2:/> telnet dcs3.<redacted> 389
Trying <redacted>.155...
telnet: connect to address <redacted>.155: Connection refused
#n2:/> telnet dcs3.<redacted> 389
Trying <redacted>.155...
Connected to dcs3.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

#n2:/> telnet dcs4.<redacted> 389
Trying <redacted>.156...
telnet: connect to address <redacted>.156: Connection refused
#n2:/> telnet dcs4.<redacted> 389
Trying <redacted>.156...
Connected to dcs4.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

PVE 4.2 @:

root@n1:~# pveversion -verbose
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1

stefws · Jun 8, 2016

Seems it's properly only VMs that have their default gateway via a floating ip on a HA proxy LB cluster... will investigate further

stefws · Jun 8, 2016

only the tested destination are not routed via default GW but via directly attached NIC, very strange, other VMs are not seeing the same issue, though they are similar... got a felling that I'm overlooking something...

stefws · Jun 8, 2016

Anyone known what might trigger a RST reply to an intial (in a while) SYN request between two VM's firewalled NICs attached to same vLAN only split seconds later not to?

tcpdump of inital cnx attempt getting refused:

Code:

15:04:56.786105 IP <redacted>.185.60362 > <redacted>.154.ldap: Flags [S], seq 3402
134510, win 26880, options [mss 8960,sackOK,TS val 429607411 ecr 0,nop,wscale 7]
, length 0
E..<:.@.@./V>.).>.).......s.......i..g....#....
..I.........
15:04:56.786377 IP <redacted>.154.ldap > <redacted>.185.60362: Flags [R.], seq 0,
ack 3402134511, win 0, length 0
E..(..@.@.i.>.).>.)...........s.P.......

following tcpdump of successful cnx attempt:

Code:

15:05:11.363088 IP <redacted>.154.ldap > <redacted>.185.60378: Flags [S.], seq 310
2507846, ack 3252771331, win 26844, options [mss 8960,sackOK,TS val 538726974 ec
r 429621988,nop,wscale 7], length 0
E..<..@.@.i.>.).>.)........F..Z...h..<....#....
.R>........
15:05:11.363120 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [.], ack 1, w
in 210, options [nop,nop,TS val 429621988 ecr 538726974], length 0
E..4..@.@...>.).>.).......Z....G....._.....
.... .R>
15:05:15.480669 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [F.], seq 1,
ack 1, win 210, options [nop,nop,TS val 429626106 ecr 538726974], length 0
E..4..@.@...>.).>.).......Z....G....._.....
.... .R>
15:05:15.480981 IP <redacted>.154.ldap > <redacted>.185.60378: Flags [F.], seq 1,
ack 2, win 210, options [nop,nop,TS val 538731092 ecr 429626106], length 0
E..4.y@.@...>.).>.)........G..Z.....01.....
.bT....
15:05:15.481004 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [.], ack 2, w
in 210, options [nop,nop,TS val 429626106 ecr 538731092], length 0
E..4..@.@...>.).>.).......Z....H....._.....
.... .bT

stefws · Jun 8, 2016

According to this it should be remote side not listening, only this is not the case. Seem more like some sort of cache needs to make a note of peers wanting to connect...

stefws · Jun 8, 2016

could it be arp cache maybe?

On destination VM I'm seeing relative frequent arp requests like these whenever there's communication otherwise not, like server wants to ensure client peers is still there (on the same HN maybe though mac-addr should change during live migration, switches might need to know this quickly):

18:56:08.164227 ARP, Request who-has <redacted>.184 tell <redacted>.154, length 28
18:56:08.164687 ARP, Reply <redacted>.184 is-at 62:38:31:33:39:39, length 46

It appears as expected that I've got no duplicate IP addresses assigned, so why would peers keep sending relative many arp req.s while exchanging communication packets?

root@dcs2 ~]# arping -DI eth1 <redacted>.184
ARPING <redacted>.184 from 0.0.0.0 eth1
Unicast reply from <redacted>.184 [62:38:31:33:39:39] 1.122ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

#n1:/> arping -DI eth1 <redacted>.154
ARPING <redacted>.154 from 0.0.0.0 eth1
Unicast reply from <redacted>.154 [6E:FF6:F0:78:C6] 1.107ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

stefws · Jun 9, 2016

It could appear to be arp cache warm-up issue if I do an arping first there's no initial issue:

Code:

#n2:/> arping -I eth1 -f dcs4.<redacted>; telnet dcs4.<redacted> 389
ARPING <redacted>.156 from <redacted>.185 eth1
Unicast reply from <redacted>.156 [92:B9:56:CE:03:E6]  1.150ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
Trying <redacted>.156...
Connected to dcs4.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

But how to avoid the initial issue in general as it affects our applications?

stefws · Jun 9, 2016

Appear not to have the issue between two VMs with non-firewalled IFs, so assuming it's to do with plugging the extraFW/std.bridges in the communication path somehow.

Code:

[root@speB ~]# arp | grep dcs
[root@speB ~]# telnet dcs1 389
Trying 10.45.69.16...
Connected to dcs1.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@speB ~]# arp | grep dcs
dcs1.mxi.tdcfoo          ether   5e:15:b6:7e:c7:7f   C                     eth0
[root@speB ~]#

Even works fine with just one fwbr on destination VM:

Code:

[root@speA ~]# arp | grep dcs
dcs1.<redacted>        ether   ca:ee:05:fb:87:d3   C                     eth1
dcs2.mxi.tdcfoo          ether   76:b8:cb:a4:b6:33   C                     eth0
[root@speA ~]# telnet dcs3.<redacted> 389
Trying <redacted>.155...
Connected to dcs3.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@speA ~]# arp | grep dcs
dcs1.<redacted>        ether   ca:ee:05:fb:87:d3   C                     eth1
dcs2.mxi.tdcfoo          ether   76:b8:cb:a4:b6:33   C                     eth0
dcs3.<redacted>        ether   8e:1c:50:7e:aa:9d   C                     eth1
[root@speA ~]#

so either two fwbr in the combo or just the outbound VM's fwbr seem to introduce this. Default policy outbound is allow, ie. no outbound fw rules in place.

Search

Search

Initial cnx refused

stefws

Member

stefws

Member

stefws

Member

stefws

Member

stefws

Member

stefws

Member

stefws

Member

stefws

Member