VLAN / Networking

spleenftw · Apr 21, 2024

Hello guys,

I'm planning to build a 3 nodes proxmox cluster and i got a few questions :

VLANS
5 : 10.100.5.0/24 - COROSYNC
10 : 10.100.10.0/24 - MANAGEMENT
15 : 10.100.15.0/24 - VM ADMIN

Each nodes will have 2 NICs :
- one for the Corosync and Ceph, 1G, VLAN 5
- one for the node, 2.5G, VLAN AWARE (10 & 15)

I was looking for a management IP for each node on VLAN 10 (that will not be accessible from VLAN 15) and putting all of my VM that are admin things (dns, dockers, nginx, whatever) so that they are in their own VLAN and does not access the host.

I was wondering if i only had to create 2 virtual bridges :
- one with the adress 10.100.10.x/24 + GW
- then the second one VLAN AWARE with no adress and simply attach this vmbr to the LXC/VM and tag the VLAN 15

I should not config the management NIC as trunk but tagged with VLAN 10 and 15 on my switch right ?

Thanks !

Dunuin · Apr 22, 2024

spleenftw said:
one for the Corosync and Ceph, 1G, VLAN 5

Corosync should have its own NIC without anything else that might saturate the NIC (so no ceph, migration, backups, ...). Ceph needs good NICs and multiple OSDs per node (starting with 3-4 SSDs). Doesn't make any sense with under 10Gbit and you better have 25+Gbit for anything else than testing purposes. Please read these whole articles to unterstand the ceph and cluster basics and requirements:
https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster
https://pve.proxmox.com/wiki/Cluster_Manager#_requirements

spleenftw said:
I was wondering if i only had to create 2 virtual bridges :
- one with the adress 10.100.10.x/24 + GW
- then the second one VLAN AWARE with no adress and simply attach this vmbr to the LXC/VM and tag the VLAN 15

You don't need two bridges. A single VLAN aware bridge and then a vlan interface (like "vmbr1.10") on top where oyu set your gateway and IP.

spleenftw said:
I should not config the management NIC as trunk but tagged with VLAN 10 and 15 on my switch right ?

Yes, tagged with 10 and 15.

spleenftw · Apr 22, 2024

Hey,

Thanks for your dedicated answer !

i don't have 10G nics, only a 1G and a 2.5G one. I can give the 2.5G the cluster interface.

While i understand the rest, i am not sure about this part :

Dunuin said:
You don't need two bridges. A single VLAN aware bridge and then a vlan interface (like "vmbr1.10") on top where oyu set your gateway and IP.

I just create the vmbr0 vlan aware bridge with then a vmbr0.10 with his management IP + GW and just attach it with a tagged vlan to each VM ?

Dunuin · Apr 22, 2024

spleenftw said:
I just create the vmbr0 vlan aware bridge with then a vmbr0.10 with his management IP + GW and just attach it with a tagged vlan to each VM ?

Yes.

spleenftw said:
i don't have 10G nics, only a 1G and a 2.5G one. I can give the 2.5G the cluster interface.

Then I would consider better suited options like ZFS replication instead of ceph. But here a third NIC might be good to have too l, for the replication traffic, so this won't interfere with your VM traffic, when it is syncing all disks every minute.

spleenftw · Apr 22, 2024

Dunuin said:
Yes.

Then I would consider better suited options like ZFS replication instead of ceph. But here a third NIC might be good to have too l, for the replication traffic, so this won't interfere with your VM traffic, when it is syncing all disks every minute.

Ceph with 2.5G network won't be working well enough ?

Dunuin · Apr 22, 2024

spleenftw said:
Ceph with 2.5G network won't be working well enough ?

Yes, see the manual. Usually you would use 40Gbit + 10Gbit + 1 Gbit. Or even 100Gbit if you care about performance.
You are way below the minimum that would make sense for any productive tasks. If you can't add at least one dual-port 10Gbit NIC (better get 40Gbit...used market they are not that that expensive...google for Connectx-3...but already EoL and unsure how long when will work) I wouldn't consider ceph.

spleenftw · Apr 22, 2024

Dunuin said:
Yes, see the manual. Usually you would use 40Gbit + 10Gbit + 1 Gbit. Or even 100Gbit if you care about performance.
You are way below the minimum that would make sense for any productive tasks. If you can't add at least one dual-port 10Gbit NIC (better get 40Gbit...used market they are not that that expensive...google for Connectx-3...but already EoL and unsure how long when will work) I wouldn't consider ceph.

I'm kinda limited since it'll be 3 optiplex micro and i'm getting a m.2 2230 adapter to 2.5G to add a second NIC to the optiplex.
It's for a homelab use, i don't have a lot of VM that writes tons of data, so i thought that 2.5G would be enough to write on ceph between the 3 optiplex nvme.

I'm not sure i can fit a 10G SFP+ / RJ45 into those optiplex micro boxes.

Dunuin · Apr 22, 2024

Even with enough fast NICs...you need multiple SSDs as OSD per Node to make sense...which you probably won't fit in those thin clients...
Thin clients might be fine for learning/testing but too bad hardware to do anything with ceph you could rely on (even in a homelab).

spleenftw · Apr 22, 2024

Dunuin said:
Even with enough fast NICs...you need multiple SSDs as OSD per Node to make sense...which you probably won't fit in those thin clients...

I was simply going for a 2To nvme per nodes and put them together in OSD, did some tests on pve vms, was not doing so bad tho, i gonna have to replan all of this then, damn

Dunuin · Apr 22, 2024

Those PVE VMs then probably also used the bridges without packets leaving the 2.5Gbit NICs so those VMs were connected via 20-40Gbit via virtio NICs? Did you also test whats happening in case a OSD fails or a node goes down and when ceph is then rebalancing?

spleenftw said:
I was simply going for a 2To nvme per nodes and put them together in OSD, did some tests on pve vms, was not doing so bad tho, i gonna have to replan all of this then, damn

I would stick with ZFS for that if you can live with losing 1 minute of data in case a node fails. ZFS is replicated local storage, so will be way faster when using slow NICs, as reading/writing data won't happen over the network. But in both cases (ceph or ZFS) you usually want enterprise/datacenter grade NICs with a PLP and there all 2TB models will require M.2 22110 slots or U.2/U.3. So something is the range of a Samsung PM983 or similar. Often such MiniPCs will only support up to M.2 2280.

spleenftw · Apr 22, 2024

Dunuin said:
Those PVE VMs then probably also used the bridges without packets leaving the 2.5Gbit NICs so those VMs were connected via 20-40Gbit via virtio NICs? Did you also test whats happening in case a OSD fails or a node goes down and when ceph is then rebalancing?

I would stick with ZFS for that if you can live with losing 1 minute of data in case a node fails. ZFS is replicated local storage, so will be way faster when using slow NICs. But in both cases (ceph or ZFS) you usually want enterprise/datacenter grade NICs with a PLP and there all 2TB models will require M.2 22110 slots or U.2/U.3. So something is the range of a Samsung PM983 or similar.

Oh that's probably the case for the virtio NICS, you're right. But it still was writing on consumers nvme (980 pro), that i was gonna put in my 3 new nodes.

Damn i wanted to play around with the ceph with the 2.5G NIC / 980 pro nvme...
I gotta look into the ZFS replication i guess ? How does it work tho ? Because i know zfs in local, but in a cluster ?

Dunuin · Apr 22, 2024

spleenftw said:
I gotta look into the ZFS replication i guess ? How does it work tho ? Because i know zfs in local, but in a cluster ?

Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same data (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).
So with 3x 2TB across the nodes you got ~1.6TB of storage for all 3 nodes combined.

And while consumer SSDs like the 980 Pro will work, its not recommended. ZFS and virtualization causes massive write amplification and might shred your SSDs very fast, depending on the workload. Especially when there are lots of sync writes, which a consumer SSD without a PLP can't cache in DRAM and therefore not optimize before writing to the NAND. In addition to the bad durability, those consumer SSDs will also be magnitudes (like 500x) slower when doing sync writes because of the missing PLP. Better to get some proper SSDs in the fiorst place than buying twice.

spleenftw · Apr 22, 2024

Dunuin said:
Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).

Oh so in theory :

I build my zfs pool with my 980 pro nvme in local on each VM, then replicate between those 3 with the 2.5G link every minute ?
While ceph will permanently write to the shared pool on the 3 disks through the network ?

Isn't it the same while one does it each minutes and the other one each "bit" ?

spleenftw · Apr 22, 2024

Dunuin said:
Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same data (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).
So with 3x 2TB across the nodes you got ~1.6TB of storage for all 3 nodes combined.

And while consumer SSDs like the 980 Pro will work, its not recommended. ZFS and virtualization causes massive write amplification and might shred your SSDs very fast, depending on the workload. Especially when there are lots of sync writes, which a consumer SSD without a PLP can't cache in DRAM and therefore not optimize before writing to the NAND. In addition to the bad durability, those consumer SSDs will also be magnitudes (like 500x) slower when doing sync writes because of the missing PLP. Better to get some proper SSDs in the fiorst place than buying twice.

I keep looking for some enterprise grade nvme like the PM983, but its almost 400 bucks each damn
Or will this one do the job : SSD PM983 MZ-1LB1T90 1,88T PCIe Gen4x4 NVMe M.2 22110 ?

Dunuin · Apr 22, 2024

spleenftw said:
I build my zfs pool with my 980 pro nvme in local on each VM, then replicate between those 3 with the 2.5G link every minute ?
While ceph will permanently write to the shared pool on the 3 disks through the network ?

Yes.

spleenftw said:
Isn't it the same while one does it each minutes and the other one each "bit" ?

No, ZFS will only read/write to local storage so with the full PCIe performance (32 or 64 Gbit). Once the local SSD stored a write that write operation will be finished/acknowledged.
Once a minute it will sync those pools over the network but this won't affect VM storage performance in any way.

When using ceph it will have to do the reads/writes over the network so your 32/64Gbit of the NVMe will be bottlenecked to the 2.5Gbit of your NIC (or actually more like 1.25Gbit as ceph would need to write a copy to both other nodes through that single NIC).
If a VM does a write it will only be acknowlaged once all 3 nodes wrote a copy of that write. So every write will have the additional latency of your network and will be limited by its NICs bandwidth.

Dunuin · Apr 22, 2024

spleenftw said:
I keep looking for some enterprise grade nvme like the PM983, but its almost 400 bucks each damn
Or will this one do the job : SSD PM983 MZ-1LB1T90 1,88T PCIe Gen4x4 NVMe M.2 22110 ?

Have a look at the price-per-TB-TBW (so how long it should last) and not at the price-per-TB-capacity. Then those enterprise SSDs are cheap and not expensive

You could buy a cheap paper bag for 30 cent that will break the first time its raining. Or you buy a durable cloth one for 1 dollar. Both have the same capacity but the cloth one will be able to carry way more heavy stuff and last waaay longer. On the long term the cloth one will be cheaper.

spleenftw · Apr 22, 2024

Dunuin said:
Have a look at the price-per-TB-TBW (so how long it should last) and not at the price-per-TB-capacity. Then those enterprise SSDs are cheap and not expensive

I found some PM983 MZ-1LB1T90, will they do the job ? Because i can't find the difference with the PM983 MZ1LB1T9HALS

spleenftw · Apr 22, 2024

Dunuin said:
Yes.

No, ZFS will only read/write to local storage so with the full PCIe performance (32 or 64 Gbit). Once the local SSD stored a write that write operation will be finished/acknowledged.
Once a minute it will sync those pools over the network but this won't affect VM storage performance in any way.

When using ceph it will have to do the reads/writes over the network so your 32/64Gbit of the NVMe will be bottlenecked to the 2.5Gbit of your NIC (or actually more like 1.25Gbit as ceph would need to write a copy to both other nodes through that single NIC).
If a VM does a write it will only be acknowlaged once all 3 nodes wrote a copy of that write. So every write will have the additional latency of your network and will be limited by its NICs bandwidth.

Okay, i think i'll still try with Ceph a little bit once i got everything i needed, if i ever find it slow / big latency i'll then go for some zfs replication

Dunuin · Apr 22, 2024

spleenftw said:
Okay, i think i'll still try with Ceph a little bit once i got everything i needed, if i ever find it slow / big latency i'll then go for some zfs replication

Especially try those cases where a node fails and ceph needs to transfer hundreds of GBs of data to rebalance. You often don't see these bottlenecks while everything is running healthy but you get into but troubles once there is some problem.

spleenftw · Apr 22, 2024

Dunuin said:
Especially try those cases where a node fails and ceph needs to transfer hundreds of GBs of data to rebalance. You often don't see these bottlenecks while everything is running healthy but you get into but troubles once there is some problem.

I mean, if i'm using the ceph storage for my VMs, doesn't it just have the vm disks on the vm already ? what is there left to transfer that goes over hunderds of GB ?

VLAN / Networking

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Distinguished Member

Member

Member

Distinguished Member

Member

We value your privacy