Am having some multicast issues on my Juniper stack network (this isn't said to say the problem is Juniper related, just informational). The goal is a 10 node cluster; servers are in place, I've been installing Proxmox 5.4.1 through IPMI, doing a dist-upgrade, and then running a script I have that installs openv-switch, configures the network, installs my own utilities and packages, our zabbix agent, and so on. A final reboot and the node is joined in. Everything worked perfect until node4.
Nodes1-3 get perfect multicast behavior flooding the network for 10 minutes at a run with 0% loss.
Node4 however, fails at 100% loss. This has left me ... wait for it ... at a loss.
Note that yes, node4 joined the cluster, but only after I forced the issue with a pvecm e 1 which means it wouldn't have succeeded without that hack. It would have, as the other 3 times I tried to join it, would have been stuck as "getting quorum ..." So yes, I realize the irony of the fact that joining it this way is totally useless.
Am not a multicast expert by any shot, but I've dug into this pretty far, checking forum posts, and am now out of ideas. Have reinstalled node4 (and properly removed it from the cluster each time and verified each corosync.conf file was clean) three times. Same behavior. Even got doubtful and counted the number of filled ports on the SRX240 vs. the number of servers I had. Nope, all accounted for. Not one plugged into the switch getting to the gateway somehow.
Checking the GUI shows nodes1-3 can see each other and report they are online. Node4 is a red X on nodes1-3. Logging into node 4 shows it with a green checkmark, but nodes1-3 with a red X.
Checking the /etc/pve/nodes file on nodes 1-3, you see:
Checking the /etc/pve/nodes file on node4 shows:
From nodes 1-3 (Quick test shown, I have done 10+ minute tests with 0% loss)
From Node4
From the SRX240 config
Informational
All items double-verified by me and an independent review by one of our network guys
* Network is end-to-end Juniper. EX4200 switch as a TOR driving several SRX240s
* The nodes are behind an SRX240 by themselves in a single zone, with a single VLAN comprised of every interface on the firewall
* igmp-snooping is verified enabled, as is IGMP v2 on the VLAN at the SRX240 level
* All nodes are on the same /27 subnet
* All nodes are Proxmox 5.4.1 and have a post-install script ran so it's as identical as you can get
* Thinking node4 was just a crappy install, I've done a reinstall 3 times with the same results each time
* All /etc/host files are hard-coded, exact, verified copies of each other
* All nodes can ping each other at high rates
* Multicast can run 10+ minutes with 0% loss for nodes1-3
* Node4 is 100% loss
* IPTables is empty and disabled on all nodes (since they are behind a hardware firewall)
* No VLAN tagging or anything like that at the moment
Nodes1-3 get perfect multicast behavior flooding the network for 10 minutes at a run with 0% loss.
Node4 however, fails at 100% loss. This has left me ... wait for it ... at a loss.
Note that yes, node4 joined the cluster, but only after I forced the issue with a pvecm e 1 which means it wouldn't have succeeded without that hack. It would have, as the other 3 times I tried to join it, would have been stuck as "getting quorum ..." So yes, I realize the irony of the fact that joining it this way is totally useless.
Am not a multicast expert by any shot, but I've dug into this pretty far, checking forum posts, and am now out of ideas. Have reinstalled node4 (and properly removed it from the cluster each time and verified each corosync.conf file was clean) three times. Same behavior. Even got doubtful and counted the number of filled ports on the SRX240 vs. the number of servers I had. Nope, all accounted for. Not one plugged into the switch getting to the gateway somehow.
Checking the GUI shows nodes1-3 can see each other and report they are online. Node4 is a red X on nodes1-3. Logging into node 4 shows it with a green checkmark, but nodes1-3 with a red X.
Checking the /etc/pve/nodes file on nodes 1-3, you see:
Code:
drwxr-xr-x 2 root www-data 0 Apr 26 20:50 node04r1
drwxr-xr-x 2 root www-data 0 Apr 25 06:12 node03r1
drwxr-xr-x 2 root www-data 0 Apr 24 19:01 node02r1
drwxr-xr-x 2 root www-data 0 Apr 24 16:27 node01r1
Checking the /etc/pve/nodes file on node4 shows:
Code:
drwxr-xr-x 2 root www-data 0 Apr 27 14:51 node04r1
From nodes 1-3 (Quick test shown, I have done 10+ minute tests with 0% loss)
Code:
xx.xx.xx.51 : unicast, xmt/rcv/%loss = 19/19/0%, min/avg/max/std-dev = 0.144/0.210/0.261/0.032
xx.xx.xx.51 : multicast, xmt/rcv/%loss = 19/19/0%, min/avg/max/std-dev = 0.185/0.250/0.492/0.066
xx.xx.xx.54 : unicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.169/0.220/0.283/0.038
xx.xx.xx.54 : multicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.177/0.234/0.288/0.033
xx.xx.xx.57 : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.087/0.158/0.219/0.066
xx.xx.xx.57 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.095/0.166/0.227/0.062
xx.xx.xx.51 : unicast, xmt/rcv/%loss = 19/19/0%, min/avg/max/std-dev = 0.144/0.210/0.261/0.032
xx.xx.xx.51 : multicast, xmt/rcv/%loss = 19/19/0%, min/avg/max/std-dev = 0.185/0.250/0.492/0.066
xx.xx.xx.54 : unicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.169/0.220/0.283/0.038
xx.xx.xx.54 : multicast, xmt/rcv/%loss = 12/12/0%, min/avg/max/std-dev = 0.177/0.234/0.288/0.033
xx.xx.xx.57 : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.087/0.158/0.219/0.066
xx.xx.xx.57 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.095/0.166/0.227/0.062
xx.xx.xx.51 : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.168/0.201/0.253/0.034
xx.xx.xx.51 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.192/0.230/0.255/0.025
xx.xx.xx.52 : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.146/0.174/0.221/0.030
xx.xx.xx.52 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.156/0.189/0.259/0.040
xx.xx.xx.54 : unicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.184/0.232/0.253/0.027
xx.xx.xx.54 : multicast, xmt/rcv/%loss = 6/6/0%, min/avg/max/std-dev = 0.236/0.249/0.262/0.011
From Node4
Code:
xx.xx.xx.52 : unicast, xmt/rcv/%loss = 9/9/0%, min/avg/max/std-dev = 0.155/0.199/0.218/0.022
xx.xx.xx.52 : multicast, xmt/rcv/%loss = 9/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
xx.xx.xx.54 : unicast, xmt/rcv/%loss = 11/11/0%, min/avg/max/std-dev = 0.145/0.207/0.251/0.030
xx.xx.xx.54 : multicast, xmt/rcv/%loss = 11/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
xx.xx.xx.57 : unicast, xmt/rcv/%loss = 14/14/0%, min/avg/max/std-dev = 0.174/0.216/0.310/0.044
xx.xx.xx..57 : multicast, xmt/rcv/%loss = 14/0/100%, min/avg/max/std-dev = 0.000/0.000/0.000/0.000
From the SRX240 config
Code:
protocols {
igmp {
interface vlan.0 {
version 2;
}
interface vlan.2 {
version 2;
}
}
stp;
igmp-snooping {
vlan all;
}
}
Informational
All items double-verified by me and an independent review by one of our network guys
* Network is end-to-end Juniper. EX4200 switch as a TOR driving several SRX240s
* The nodes are behind an SRX240 by themselves in a single zone, with a single VLAN comprised of every interface on the firewall
* igmp-snooping is verified enabled, as is IGMP v2 on the VLAN at the SRX240 level
* All nodes are on the same /27 subnet
* All nodes are Proxmox 5.4.1 and have a post-install script ran so it's as identical as you can get
* Thinking node4 was just a crappy install, I've done a reinstall 3 times with the same results each time
* All /etc/host files are hard-coded, exact, verified copies of each other
* All nodes can ping each other at high rates
* Multicast can run 10+ minutes with 0% loss for nodes1-3
* Node4 is 100% loss
* IPTables is empty and disabled on all nodes (since they are behind a hardware firewall)
* No VLAN tagging or anything like that at the moment