Hi sorry on the delay, I get crazy busy.
Watchguard Firebox M200 ?
Spec sheet says:
Firewall throughput: 3.2 Gbps
VPN throughput: 1.2 Gbps
the LB4M's and the dell switches have plenty of capacity.
Just to get this straight:
Right so far.
You are running 3 separate Proxmox-Clusters
Each of those "Proxmox-nodes" has a 1x 10G for ceph traffic via Dell powerconnect switches.
Each of those "Proxmox-nodes" has a 1x 1G for VM-traffic via Quanta LB4M
Your M200's connect to your Quanta LB4m's in order for you to get outside access to the data-center.
You run all your public VM traffic via the Watchguard M200's ?
We actually have 2 clusters of 4 nodes per C6100 (so two physical C6100 servers) running in HA mode.
They have 3x1TB 7200 RPM drives each.
Each HA node
Q1: How many M200's you have ? 1 per Proxmox-Cluster?
No. We have 5 racks. Generally 2 clusters (as above) per rack. Each switch (Quanta) is tied to the M200 we use for VPN traffic for our staff use. There are 5 distinct subnets, 1 per rack.
We have 2 MPLS, 1 external connection, and another external connection....sigh, long story short:
One datacenter is closing, but we purchased the Watchguard firewalls and literally the day after we installed the HA M300s into our primary site we got told it was closing. So I've been scrambling for months to prep for the Feb end of month closure of that datacenter, but basically what we have is not what we will have.
Current:
Internal company VPN access M200 (also main connectivity).
MPLS x1, HA M300s per.
Planned:
MPLS x2, HA M300s per.
Main connectivity HA M300s.
Internal company VPN access M200.
As you can see, I'm running our external traffic out through the M200 but the production servers are still in the other DC. We are virtualizing that junk into our new primary DC, and we'll move/reconfigure the two sets of HA M300s from there to the new primary DC once we go live with the VMs. So basically besides internal traffic, we have no traffic. lol
Q2: Are the M200's connected to the the dell power connects via the Quanta LB4M 10G uplink ports ?
No. The powerconnects are isolated Ceph traffic only switches.
The 10g uplink ports are daisy chained across each rack from switch to switch, aka LB4M to LB4M.
At the end of the line we have 5 gigabit ports bonded on the M200 to 5 ports on the LB4M in that rack.
Q3: Are you talking 4 C6100's per cluster (16 Proxmox nodes per), or 4 Proxmox nodes per Cluster (1 C6100s). ???
8 total nodes, with 4 nodes per server. We started with 1, and added a 2nd C6100 for redundancy.
Q4: How many OSDs do you have per "Proxmox-CLuster" ? What capacities ?
Each OSD is 1tb, so an entire drive is being used. 2 per node, so 14 with 7 monitors.
Q5: Do any of the VM's do inter-Vm communication ?
Yes, they are a server stack, 3 windows VMs talking to each other. JBoss, SQL.
Q6: do you have the chance to "isolate" a single of those nodes and have a look at what it is doing when it gets online ? e.g. inbound/outbound traffic, syslog messages, etc
Goodness. I wish my company had hired the extra 2 VM staff I requested 5 months ago. Sadly they have not. So I probably could do so on a distinct setup here in the office, but hey we're moving offices next week and I've been scrambling to make sure the new office is ready.
I'd like to point out we also have about 10 older C6100 proxmox servers that are running individual proxmox nodes.
Two of our systems in two racks are working fine, these are production machines running 1 of 2 C6100 servers. When we add the 2nd server, things go belly up.
Q7: Do you have a monitoring system you could use to see which "VM / node / service" generates the high traffic, or even better yet where the high traffic on your M200's comes from?
There doesn't appear to be high traffic. I looked, all of the gigabit links are running at 1% or less. lol That's why this is driving me crazy.
Q8: Do you have a link to the Manufacturer site/model where one could look at the specs for the LB4M's ?
Sort of. These are the best places for info:
https://forums.servethehome.com/ind...-lb4m-48-port-gigabit-switch-discussion.3248/
First post is a link to the manual.
Q9: linux native vmbrx's or openvswitch used on prxmox nodes ?
Windows. I'm in a windows shop. The only linux servers are IT (proxmox, etc).
Q10: you do not have a single node of the proxmox-cluster configured to run any sort of backend via your m200's right ? like e.g. a single node in your office basement used as disaster recovery or some such thing ?
Nope. We will eventually do this in our secondary datacenter, once the old (primary) one closes and we get into the new one.
Q11: You do not use the M200's to filter traffic between your Clusters Proxmox nodes right ? (
Code:
Proxmox1(VPN) <-> LB4M <-> M200 <-> Lb4M <-> Proxmox2 (VPN)
They are on VLANs.
may that be Proxmox or Ceph traffic. You use teh M200's solely as perimeter firewall to guard your internal (private) Datacenter network from the outside (other clients in Datacenter, regular www), right ?
Correct. No other clients in DC are sharing our internet.
Some thoughts I have on what it could be:
1. Could be Ceph trying to balance the Cluster via your 1G links or even worse via your M200s (suggests broken or misconfigured configs)
Some were on the initial set, because the Ceph HA guide we bought is sadly incorrect on this point. We have built the others correctly and removed that one from service.
2. Could be Proxmox-HA tyring to move VM's back to where they were supposed to be running (once you move the "rogue" node back into the cluster) via the 1G-Links over your m200s (suggest misconfiguration)
The M200s do not resolve DNS. I suggested we isolate the individual servers to talk directly to the LB4Ms (which I believe can resolve DNS) and isolate each rack from each other. This was always planned, but being overloaded with tasks I haven't been able to get ahead of anything and actually get a federated AD system up to help with this.
3. Could be the Proxmox-node spamming your M200s via LB4M's with some sort of mass multicast traffic (could be all sorts of things)
I think it might have been, but really have no idea. We have more clusters up now than ever before, and the box isn't blowing up.
4. Could be a couple of VM's (for what ever reason) located on the "cluster" trying to move mass amounts of traffic via your M200's.
Agreed. I think the incorrect firmware with a bad scanner update combined to take itself down. Once patched the problem seemed to go away, but we already blew the 2nd offending proxmox server cluster away as a precaution.
in any case, its obvious its traffic coming from behind your m200's not from outside based on your symptom description.
I agree.
Thanks for the excellent questions, my apologies for not answering quicker.