[SOLVED] Create cluster problem - possibly SSL related

scyto · Sep 14, 2023

tipex said:
not sure my networking skills are good enough

lol, i had help and direction i only saw this twice in my career - once in 2000 (when an adaptec network driver was corrupting a packet - but only for one machine - it was only when the machine name was set to a very specific name in a backup packet) (literally if the machine was called something like win01 it worked, but if the machine was called win02 that one packet was corrupted and thrown away - stopping just a backup app from working, the issue was a bug in their TCP offload engine - they never fixed it, shows why they went out of business)

the second was in the last few weeks when it turned out TCPv6 was totally broken in thunderbolt-net.

on the sender machine and recieving machine you run tcpdump -i <interfacename> ip6 (note if you don't have filters you will collect a lot of traffic so run these for the short time you do a test - don't use ip6 like i did - that just captures ip6..)

Then you get something like this:

Code:

root@pve1:~# cat pve1.tcpdump
<noise removed>
11:56:04.963688 IP6 fe80::8e:95ff:fef1:621a > fe80::1a:44ff:fe65:dbe0: ICMP6, neighbor solicitation, who has fe80::1a:44ff:fe65:dbe0, length 32
11:56:04.963726 IP6 fe80::1a:44ff:fe65:dbe0 > fe80::8e:95ff:fef1:621a: ICMP6, neighbor advertisement, tgt is fe80::1a:44ff:fe65:dbe0, length 24
11:56:05.259237 IP6 xxxx:xxxx:830:81::81.60670 > xxxx:xxxx:830:81::82.ssh: Flags [S], seq 2237794262, win 65460, options [mss 65460,sackOK,TS val 2783495228 ecr 0,nop,wscale 7], length 0

and

Code:

root@pve2:~# cat pve2.tcpdump
<noise removed>
11:56:04.962860 IP6 fe80::8e:95ff:fef1:621a > fe80::1a:44ff:fe65:dbe0: ICMP6, neighbor solicitation, who has fe80::1a:44ff:fe65:dbe0, length 32
11:56:04.963212 IP6 fe80::1a:44ff:fe65:dbe0 > fe80::8e:95ff:fef1:621a: ICMP6, neighbor advertisement, tgt is fe80::1a:44ff:fe65:dbe0, length 24

This tells you that the SSH packet was sent to the driver on the sender, but never received by the destination.

then i did this

Code:

PVE1 (sender)

root@pve1:~# ip -s -s link show en06
7: en06: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:1a:44:65:db:e0 brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast     
    19047333241 20301503      0       0       0       0
    RX errors:    length    crc   frame    fifo overrun
                       0      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns     
    15558035141 18392655      7       0       0       0
    TX errors:   aborted   fifo  window heartbt transns
                       0      0       0       0       2


PVE2 (destination)
root@pve2:~# ip -s -s link show en05
74: en05: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:8e:95:f1:62:1a brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast     
    15561305384 19172991      0       0       0       0
    RX errors:    length    crc   frame    fifo overrun
                       0      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns     
    19050263184 18347026      0       0       0       0
    TX errors:   aborted   fifo  window heartbt transns
                       0      0       0       0       2

do you see the 7 error every time i tried to SSH those errors ticked up on the sender (not reciever) - this indicated the driver or hardware was dropping the packets - then i contacted the owner of the code, showed them all this and they fixed it!

Ideally you have 3 captures going - one on the sender, one on the receiver and one on the switch in the middle (or a 3rd sniffer machine you mirror the ports to so it sees all traffic) - then you can compare traces to see if the packet hit the wire on not. Note i didn't do it that way because this was thunderbolt-net and there is no way to put a switch / sniffer in the middle.

any hoo, a little off topic, but hopefuly this gives you more tests you can do if moving to clean vanilla no vlan switch doesn't work.

--edit--
oh and the cap files can be loaded into wireshark to make it easier to analyze if you have lots of entries.

tipex · Sep 14, 2023

Thanks for the above.

Tonight I moved both machines from a VLAN into the default LAN. I kept the switch and router the same. Still getting the same problem.

Next I will try with the same router but using a different switch...

tipex · Sep 14, 2023

Ok same router but different switch and its working - cluster formed woop woop!

Now you may initially think well its the switch then but there is a little more complexity to it than that.

For my router I am using OPNSense. I have 4 ports:
- Port 0 = WAN
- Port 1 = Untagged default LAN traffic
- Port 2 = Tagged VLAN traffic
- Port 3 = Created a new network on here and connected the other switch.

So yes it could be the switch but it could also be the new network I created in OPNSense.

Clocking off for the night now. Will ponder on this and decide what to do next.

scyto · Sep 15, 2023

As we all said from the start, take them off the vlans / firewall / etc - literally plug them into a lone basic unmanaged switch, plug a PC in the switch and just do that - 3 devices all totally isolated in the same physical broadcast, multicast and unicast domain - no weird packet options either - allow all packet types to all ports

tipex · Sep 16, 2023

Fresh information...

From previous tests: port 1 of my router (Untagged default LAN) connected to my switch did not work. Cluster = fail.
I have just tried port 1 connected to my old spare switch and its working. Cluster = Pass.

This rules out the router as being the issue as I made zero changes to the router. So we are left with my switch being the issue.

Any one got any ideas of the kind of things that could be wrong with my switch? Its a new layer 3 managed switch and so has lots of fancy features that I know little about.

Conveniently there is a live demo of the web interface here if you want to see what settings are available:
http://eu.draytek.com:12540/

fabian · Sep 19, 2023

corosync doesn't require "much":
- pmtud should work
- UDP traffic should pass through (the port range is in our docs)
- the latency should be low (single digit ms range)
- the link should ideally be stable

since in your case the link never comes up, it likely is one of the first two that already fails (MTU or UDP traffic on knet ports)

tipex · Sep 23, 2023

Some new information. I disabled all these DoS settings in my switch and the two nodes can see each other. So its one of these thats causing the problem:

I'm getting closer!!!

tipex · Sep 23, 2023

Through trial and error it seems to be the "UDP Blat" setting.

This forum thread seems to offer some kind of answer: https://community.spiceworks.com/topic/400552-blat-attack-question

@fabian I can obviously disable this setting and that does not bother me too much as this is a home network so does not need all the security bells and whistles. But is there some kind of bug or flaw in the way Proxmox is doing something? If this was a production environment I may very well want UDP Blat protection. I'm wondering if Proxmox could be tweaked so DoS protection does not flag it as being a threat. It might help some of the enterprise users of Proxmox who need all the securtiy features enabled. Would seem a shame to lose this useful information I have uncovered.

I wont mark this as solved yet as I want to move the machiens back into the management VLAN and install fromt scratch to fully prove to myself it all works. If it does we party

fabian · Sep 25, 2023

if the switch is doing detection just based on "source==destination port" then that seems like a rather crude heuristic that is bound to trip up..

in any case, the corosync network should not be exposed to the public and thus also not require DDoS protection.

tipex · Oct 8, 2023

Right then guys I have rebuilt my network after you all had me tearing it apart

Cluster forms fine. So its 100% the UDP Blat DoS protection setting in my switch that was causing it.

Thank you so much to everyone who helped me reach this point. In partiuclar @fabian and @scyto

Now to go back in time to start of July and think what I was even trying to do before I hit this problem. It think it was something to do with computers.

tipex · Oct 8, 2023

@fabian for some reason I cant edit my very first post to mark this as solved. Can you do it?

scyto · Oct 8, 2023

If you have udp ddos attacks in your production network you have larger issues to solve, I would argue udp blat is mostly security theatre, if your prod network is that uncontrolled you should think about something more robust like suricata filtering between segments,

glad you solved it!

Robbo · Mar 13, 2024

Thank you Thank you Thank you!
I have spend several days pulling my hair out on this, rebuilding hosts, and playing with config files.
It turns out that my old HP V1810-48 switch had UDP Blat DoS protection turned on. One tick box unticked and everything worked!
As a bonus, my desktop suddenly has NTP as well.
Thanks for nothing HP devs.

Search

Search

[SOLVED] Create cluster problem - possibly SSL related

scyto

Active Member

tipex

Member

tipex

Member

scyto

Active Member

tipex

Member

fabian

Proxmox Staff Member

tipex

Member

tipex

Member

fabian

Proxmox Staff Member

tipex

Member

tipex

Member

scyto

Active Member

Robbo

New Member

[SOLVED] Create cluster problem - possibly SSL related

Active Member

Member

Member

Active Member

Member

Proxmox Staff Member

Member

Member

​

Proxmox Staff Member

Member

Member

Active Member

New Member