[SOLVED] Create cluster problem - possibly SSL related

not sure my networking skills are good enough
lol, i had help and direction i only saw this twice in my career - once in 2000 (when an adaptec network driver was corrupting a packet - but only for one machine - it was only when the machine name was set to a very specific name in a backup packet) (literally if the machine was called something like win01 it worked, but if the machine was called win02 that one packet was corrupted and thrown away - stopping just a backup app from working, the issue was a bug in their TCP offload engine - they never fixed it, shows why they went out of business)

the second was in the last few weeks when it turned out TCPv6 was totally broken in thunderbolt-net.

on the sender machine and recieving machine you run tcpdump -i <interfacename> ip6 (note if you don't have filters you will collect a lot of traffic so run these for the short time you do a test - don't use ip6 like i did - that just captures ip6..)

Then you get something like this:

Code:
root@pve1:~# cat pve1.tcpdump
<noise removed>
11:56:04.963688 IP6 fe80::8e:95ff:fef1:621a > fe80::1a:44ff:fe65:dbe0: ICMP6, neighbor solicitation, who has fe80::1a:44ff:fe65:dbe0, length 32
11:56:04.963726 IP6 fe80::1a:44ff:fe65:dbe0 > fe80::8e:95ff:fef1:621a: ICMP6, neighbor advertisement, tgt is fe80::1a:44ff:fe65:dbe0, length 24
11:56:05.259237 IP6 xxxx:xxxx:830:81::81.60670 > xxxx:xxxx:830:81::82.ssh: Flags [S], seq 2237794262, win 65460, options [mss 65460,sackOK,TS val 2783495228 ecr 0,nop,wscale 7], length 0

and

Code:
root@pve2:~# cat pve2.tcpdump
<noise removed>
11:56:04.962860 IP6 fe80::8e:95ff:fef1:621a > fe80::1a:44ff:fe65:dbe0: ICMP6, neighbor solicitation, who has fe80::1a:44ff:fe65:dbe0, length 32
11:56:04.963212 IP6 fe80::1a:44ff:fe65:dbe0 > fe80::8e:95ff:fef1:621a: ICMP6, neighbor advertisement, tgt is fe80::1a:44ff:fe65:dbe0, length 24

This tells you that the SSH packet was sent to the driver on the sender, but never received by the destination.

then i did this

Code:
PVE1 (sender)

root@pve1:~# ip -s -s link show en06
7: en06: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:1a:44:65:db:e0 brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast     
    19047333241 20301503      0       0       0       0
    RX errors:    length    crc   frame    fifo overrun
                       0      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns     
    15558035141 18392655      7       0       0       0
    TX errors:   aborted   fifo  window heartbt transns
                       0      0       0       0       2


PVE2 (destination)
root@pve2:~# ip -s -s link show en05
74: en05: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 02:8e:95:f1:62:1a brd ff:ff:ff:ff:ff:ff
    RX:   bytes  packets errors dropped  missed   mcast     
    15561305384 19172991      0       0       0       0
    RX errors:    length    crc   frame    fifo overrun
                       0      0       0       0       0
    TX:   bytes  packets errors dropped carrier collsns     
    19050263184 18347026      0       0       0       0
    TX errors:   aborted   fifo  window heartbt transns
                       0      0       0       0       2

do you see the 7 error every time i tried to SSH those errors ticked up on the sender (not reciever) - this indicated the driver or hardware was dropping the packets - then i contacted the owner of the code, showed them all this and they fixed it!

Ideally you have 3 captures going - one on the sender, one on the receiver and one on the switch in the middle (or a 3rd sniffer machine you mirror the ports to so it sees all traffic) - then you can compare traces to see if the packet hit the wire on not. Note i didn't do it that way because this was thunderbolt-net and there is no way to put a switch / sniffer in the middle.

any hoo, a little off topic, but hopefuly this gives you more tests you can do if moving to clean vanilla no vlan switch doesn't work.

--edit--
oh and the cap files can be loaded into wireshark to make it easier to analyze if you have lots of entries.
 
Last edited:
Thanks for the above.

Tonight I moved both machines from a VLAN into the default LAN. I kept the switch and router the same. Still getting the same problem.

Next I will try with the same router but using a different switch...
 
Ok same router but different switch and its working - cluster formed woop woop!

Now you may initially think well its the switch then but there is a little more complexity to it than that.

For my router I am using OPNSense. I have 4 ports:
- Port 0 = WAN
- Port 1 = Untagged default LAN traffic
- Port 2 = Tagged VLAN traffic
- Port 3 = Created a new network on here and connected the other switch.

So yes it could be the switch but it could also be the new network I created in OPNSense.

Clocking off for the night now. Will ponder on this and decide what to do next.
 
As we all said from the start, take them off the vlans / firewall / etc - literally plug them into a lone basic unmanaged switch, plug a PC in the switch and just do that - 3 devices all totally isolated in the same physical broadcast, multicast and unicast domain - no weird packet options either - allow all packet types to all ports
 
Last edited:
Fresh information...

From previous tests: port 1 of my router (Untagged default LAN) connected to my switch did not work. Cluster = fail.
I have just tried port 1 connected to my old spare switch and its working. Cluster = Pass.

This rules out the router as being the issue as I made zero changes to the router. So we are left with my switch being the issue.

Any one got any ideas of the kind of things that could be wrong with my switch? Its a new layer 3 managed switch and so has lots of fancy features that I know little about.

Conveniently there is a live demo of the web interface here if you want to see what settings are available:
http://eu.draytek.com:12540/
 
corosync doesn't require "much":
- pmtud should work
- UDP traffic should pass through (the port range is in our docs)
- the latency should be low (single digit ms range)
- the link should ideally be stable

since in your case the link never comes up, it likely is one of the first two that already fails (MTU or UDP traffic on knet ports)
 
Some new information. I disabled all these DoS settings in my switch and the two nodes can see each other. So its one of these thats causing the problem:
1695481081365.png

I'm getting closer!!! :)
 
Through trial and error it seems to be the "UDP Blat" setting.

This forum thread seems to offer some kind of answer: https://community.spiceworks.com/topic/400552-blat-attack-question

@fabian I can obviously disable this setting and that does not bother me too much as this is a home network so does not need all the security bells and whistles. But is there some kind of bug or flaw in the way Proxmox is doing something? If this was a production environment I may very well want UDP Blat protection. I'm wondering if Proxmox could be tweaked so DoS protection does not flag it as being a threat. It might help some of the enterprise users of Proxmox who need all the securtiy features enabled. Would seem a shame to lose this useful information I have uncovered.

I wont mark this as solved yet as I want to move the machiens back into the management VLAN and install fromt scratch to fully prove to myself it all works. If it does we party :D

 
if the switch is doing detection just based on "source==destination port" then that seems like a rather crude heuristic that is bound to trip up..

in any case, the corosync network should not be exposed to the public and thus also not require DDoS protection.
 
Right then guys I have rebuilt my network after you all had me tearing it apart o_O

Cluster forms fine. So its 100% the UDP Blat DoS protection setting in my switch that was causing it.

Thank you so much to everyone who helped me reach this point. In partiuclar @fabian and @scyto

Now to go back in time to start of July and think what I was even trying to do before I hit this problem. It think it was something to do with computers.
 
  • Like
Reactions: scyto
If you have udp ddos attacks in your production network you have larger issues to solve, I would argue udp blat is mostly security theatre, if your prod network is that uncontrolled you should think about something more robust like suricata filtering between segments,

glad you solved it!
 
Thank you Thank you Thank you!
I have spend several days pulling my hair out on this, rebuilding hosts, and playing with config files.
It turns out that my old HP V1810-48 switch had UDP Blat DoS protection turned on. One tick box unticked and everything worked!
As a bonus, my desktop suddenly has NTP as well.
Thanks for nothing HP devs.
 
Just had this same problem on a new 8.2 cluster. Was not, as far as I could tell, related to a network issue - ended up just linking the two initial machines directly by cable to eliminate switch entirely and still did not resolve issue.

However, based on what was getting logged, it looked like it was actually a corosync issue of some kind. Based on this thread https://forum.proxmox.com/threads/a...after-upgrade-5-4-to-6-0-4.56425/#post-260570 I thought I would try the solution indicated by Shturman:
Code:
totem {
  cluster_name: tv
  config_version: 19
  interface {
    bindnetaddr: 10.10.10.0
    ringnumber: 0
    knet_transport: sctp
  }
  ip_version: ipv4
  secauth: on
  version: 2
  token: 10000
}

That is, add 'token: 10000' and 'knet_transport: sctp' to the totem sections of the corosync.conf file(s) on the initial cluster node. This solved the issue immediately. Hopefully this will help someone coming across the same problem.
 
I'm having this same exact issue. @smokedpixel I tried updating the corosync config and it did not work for me. I've verified my servers can communicate with eachother by ssh into each of them from the other server.
 
I know this is old, but are you able to resolve it? I have the same problem. One thing to note is I followed Full Mesh for Ceph using RSTP Loop setup.
I am using pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!