Initial cnx refused

stefws

Renowned Member
Jan 29, 2015
302
4
83
Denmark
siimnet.dk
Wondering if initial connection attempts between two VMs on the same PVE 4.2 cluster been refused is due to using PVE FW and if so could this be avoided. It seems like some kind connection [state] cache needs to be set initially before been allowed as iptables rules dictate. Happens again after idling for a while (like a cache ttl expires). Any hints & clues appreciated, TIA!

Below cnx attempts rapidly follows one another:

#n2:/> telnet dcs3.<redacted> 389
Trying <redacted>.155...
telnet: connect to address <redacted>.155: Connection refused
#n2:/> telnet dcs3.<redacted> 389
Trying <redacted>.155...
Connected to dcs3.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

#n2:/> telnet dcs4.<redacted> 389
Trying <redacted>.156...
telnet: connect to address <redacted>.156: Connection refused
#n2:/> telnet dcs4.<redacted> 389
Trying <redacted>.156...
Connected to dcs4.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.


PVE 4.2 @:
root@n1:~# pveversion -verbose
proxmox-ve: 4.2-51 (running kernel: 4.4.8-1-pve)
pve-manager: 4.2-5 (running version: 4.2-5/7cf09667)
pve-kernel-4.4.8-1-pve: 4.4.8-51
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-75
pve-firmware: 1.1-8
libpve-common-perl: 4.0-62
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-17
pve-container: 1.0-64
pve-firewall: 2.0-27
pve-ha-manager: 1.0-31
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
openvswitch-switch: 2.5.0-1
 
Seems it's properly only VMs that have their default gateway via a floating ip on a HA proxy LB cluster... will investigate further
 
only the tested destination are not routed via default GW but via directly attached NIC, very strange, other VMs are not seeing the same issue, though they are similar... got a felling that I'm overlooking something...
 
Anyone known what might trigger a RST reply to an intial (in a while) SYN request between two VM's firewalled NICs attached to same vLAN only split seconds later not to?

tcpdump of inital cnx attempt getting refused:

Code:
15:04:56.786105 IP <redacted>.185.60362 > <redacted>.154.ldap: Flags [S], seq 3402
134510, win 26880, options [mss 8960,sackOK,TS val 429607411 ecr 0,nop,wscale 7]
, length 0
E..<:.@.@./V>.).>.).......s.......i..g....#....
..I.........
15:04:56.786377 IP <redacted>.154.ldap > <redacted>.185.60362: Flags [R.], seq 0,
ack 3402134511, win 0, length 0
E..(..@.@.i.>.).>.)...........s.P.......

following tcpdump of successful cnx attempt:

Code:
15:05:11.363088 IP <redacted>.154.ldap > <redacted>.185.60378: Flags [S.], seq 310
2507846, ack 3252771331, win 26844, options [mss 8960,sackOK,TS val 538726974 ec
r 429621988,nop,wscale 7], length 0
E..<..@.@.i.>.).>.)........F..Z...h..<....#....
.R>........
15:05:11.363120 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [.], ack 1, w
in 210, options [nop,nop,TS val 429621988 ecr 538726974], length 0
E..4..@.@...>.).>.).......Z....G....._.....
.... .R>
15:05:15.480669 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [F.], seq 1,
ack 1, win 210, options [nop,nop,TS val 429626106 ecr 538726974], length 0
E..4..@.@...>.).>.).......Z....G....._.....
.... .R>
15:05:15.480981 IP <redacted>.154.ldap > <redacted>.185.60378: Flags [F.], seq 1,
ack 2, win 210, options [nop,nop,TS val 538731092 ecr 429626106], length 0
E..4.y@.@...>.).>.)........G..Z.....01.....
.bT....
15:05:15.481004 IP <redacted>.185.60378 > <redacted>.154.ldap: Flags [.], ack 2, w
in 210, options [nop,nop,TS val 429626106 ecr 538731092], length 0
E..4..@.@...>.).>.).......Z....H....._.....
.... .bT
 
Last edited:
According to this it should be remote side not listening, only this is not the case. Seem more like some sort of cache needs to make a note of peers wanting to connect...
 
could it be arp cache maybe?

On destination VM I'm seeing relative frequent arp requests like these whenever there's communication otherwise not, like server wants to ensure client peers is still there (on the same HN maybe though mac-addr should change during live migration, switches might need to know this quickly):

18:56:08.164227 ARP, Request who-has <redacted>.184 tell <redacted>.154, length 28
18:56:08.164687 ARP, Reply <redacted>.184 is-at 62:38:31:33:39:39, length 46

It appears as expected that I've got no duplicate IP addresses assigned, so why would peers keep sending relative many arp req.s while exchanging communication packets?

root@dcs2 ~]# arping -DI eth1 <redacted>.184
ARPING <redacted>.184 from 0.0.0.0 eth1
Unicast reply from <redacted>.184 [62:38:31:33:39:39] 1.122ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

#n1:/> arping -DI eth1 <redacted>.154
ARPING <redacted>.154 from 0.0.0.0 eth1
Unicast reply from <redacted>.154 [6E:FF:D6:F0:78:C6] 1.107ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
 
It could appear to be arp cache warm-up issue if I do an arping first there's no initial issue:

Code:
#n2:/> arping -I eth1 -f dcs4.<redacted>; telnet dcs4.<redacted> 389
ARPING <redacted>.156 from <redacted>.185 eth1
Unicast reply from <redacted>.156 [92:B9:56:CE:03:E6]  1.150ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
Trying <redacted>.156...
Connected to dcs4.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

But how to avoid the initial issue in general as it affects our applications?
 
Appear not to have the issue between two VMs with non-firewalled IFs, so assuming it's to do with plugging the extraFW/std.bridges in the communication path somehow.

Code:
[root@speB ~]# arp | grep dcs
[root@speB ~]# telnet dcs1 389
Trying 10.45.69.16...
Connected to dcs1.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@speB ~]# arp | grep dcs
dcs1.mxi.tdcfoo          ether   5e:15:b6:7e:c7:7f   C                     eth0
[root@speB ~]#

Even works fine with just one fwbr on destination VM:

Code:
[root@speA ~]# arp | grep dcs
dcs1.<redacted>        ether   ca:ee:05:fb:87:d3   C                     eth1
dcs2.mxi.tdcfoo          ether   76:b8:cb:a4:b6:33   C                     eth0
[root@speA ~]# telnet dcs3.<redacted> 389
Trying <redacted>.155...
Connected to dcs3.<redacted>.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
[root@speA ~]# arp | grep dcs
dcs1.<redacted>        ether   ca:ee:05:fb:87:d3   C                     eth1
dcs2.mxi.tdcfoo          ether   76:b8:cb:a4:b6:33   C                     eth0
dcs3.<redacted>        ether   8e:1c:50:7e:aa:9d   C                     eth1
[root@speA ~]#

so either two fwbr in the combo or just the outbound VM's fwbr seem to introduce this. Default policy outbound is allow, ie. no outbound fw rules in place.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!