VLAN / Networking

spleenftw

Member
Oct 6, 2023
54
0
6
Hello guys,

I'm planning to build a 3 nodes proxmox cluster and i got a few questions :

VLANS
5 : 10.100.5.0/24 - COROSYNC
10 : 10.100.10.0/24 - MANAGEMENT
15 : 10.100.15.0/24 - VM ADMIN


Each nodes will have 2 NICs :
- one for the Corosync and Ceph, 1G, VLAN 5
- one for the node, 2.5G, VLAN AWARE (10 & 15)

I was looking for a management IP for each node on VLAN 10 (that will not be accessible from VLAN 15) and putting all of my VM that are admin things (dns, dockers, nginx, whatever) so that they are in their own VLAN and does not access the host.

I was wondering if i only had to create 2 virtual bridges :
- one with the adress 10.100.10.x/24 + GW
- then the second one VLAN AWARE with no adress and simply attach this vmbr to the LXC/VM and tag the VLAN 15


I should not config the management NIC as trunk but tagged with VLAN 10 and 15 on my switch right ?

Thanks !
 
Last edited:
one for the Corosync and Ceph, 1G, VLAN 5
Corosync should have its own NIC without anything else that might saturate the NIC (so no ceph, migration, backups, ...). Ceph needs good NICs and multiple OSDs per node (starting with 3-4 SSDs). Doesn't make any sense with under 10Gbit and you better have 25+Gbit for anything else than testing purposes. Please read these whole articles to unterstand the ceph and cluster basics and requirements:
https://pve.proxmox.com/wiki/Deploy...r#_recommendations_for_a_healthy_ceph_cluster
https://pve.proxmox.com/wiki/Cluster_Manager#_requirements

I was wondering if i only had to create 2 virtual bridges :
- one with the adress 10.100.10.x/24 + GW
- then the second one VLAN AWARE with no adress and simply attach this vmbr to the LXC/VM and tag the VLAN 15
You don't need two bridges. A single VLAN aware bridge and then a vlan interface (like "vmbr1.10") on top where oyu set your gateway and IP.

I should not config the management NIC as trunk but tagged with VLAN 10 and 15 on my switch right ?
Yes, tagged with 10 and 15.
 
Last edited:
Hey,

Thanks for your dedicated answer !

i don't have 10G nics, only a 1G and a 2.5G one. I can give the 2.5G the cluster interface.

While i understand the rest, i am not sure about this part :
You don't need two bridges. A single VLAN aware bridge and then a vlan interface (like "vmbr1.10") on top where oyu set your gateway and IP.

I just create the vmbr0 vlan aware bridge with then a vmbr0.10 with his management IP + GW and just attach it with a tagged vlan to each VM ?
 
I just create the vmbr0 vlan aware bridge with then a vmbr0.10 with his management IP + GW and just attach it with a tagged vlan to each VM ?
Yes.
i don't have 10G nics, only a 1G and a 2.5G one. I can give the 2.5G the cluster interface.
Then I would consider better suited options like ZFS replication instead of ceph. But here a third NIC might be good to have too l, for the replication traffic, so this won't interfere with your VM traffic, when it is syncing all disks every minute.
 
Yes.

Then I would consider better suited options like ZFS replication instead of ceph. But here a third NIC might be good to have too l, for the replication traffic, so this won't interfere with your VM traffic, when it is syncing all disks every minute.
Ceph with 2.5G network won't be working well enough ?
 
Ceph with 2.5G network won't be working well enough ?
Yes, see the manual. Usually you would use 40Gbit + 10Gbit + 1 Gbit. Or even 100Gbit if you care about performance.
You are way below the minimum that would make sense for any productive tasks. If you can't add at least one dual-port 10Gbit NIC (better get 40Gbit...used market they are not that that expensive...google for Connectx-3...but already EoL and unsure how long when will work) I wouldn't consider ceph.
 
Yes, see the manual. Usually you would use 40Gbit + 10Gbit + 1 Gbit. Or even 100Gbit if you care about performance.
You are way below the minimum that would make sense for any productive tasks. If you can't add at least one dual-port 10Gbit NIC (better get 40Gbit...used market they are not that that expensive...google for Connectx-3...but already EoL and unsure how long when will work) I wouldn't consider ceph.
I'm kinda limited since it'll be 3 optiplex micro and i'm getting a m.2 2230 adapter to 2.5G to add a second NIC to the optiplex.
It's for a homelab use, i don't have a lot of VM that writes tons of data, so i thought that 2.5G would be enough to write on ceph between the 3 optiplex nvme.

I'm not sure i can fit a 10G SFP+ / RJ45 into those optiplex micro boxes.
 
Even with enough fast NICs...you need multiple SSDs as OSD per Node to make sense...which you probably won't fit in those thin clients...
Thin clients might be fine for learning/testing but too bad hardware to do anything with ceph you could rely on (even in a homelab).
 
Last edited:
Even with enough fast NICs...you need multiple SSDs as OSD per Node to make sense...which you probably won't fit in those thin clients...
I was simply going for a 2To nvme per nodes and put them together in OSD, did some tests on pve vms, was not doing so bad tho, i gonna have to replan all of this then, damn
 
Those PVE VMs then probably also used the bridges without packets leaving the 2.5Gbit NICs so those VMs were connected via 20-40Gbit via virtio NICs? Did you also test whats happening in case a OSD fails or a node goes down and when ceph is then rebalancing?

I was simply going for a 2To nvme per nodes and put them together in OSD, did some tests on pve vms, was not doing so bad tho, i gonna have to replan all of this then, damn
I would stick with ZFS for that if you can live with losing 1 minute of data in case a node fails. ZFS is replicated local storage, so will be way faster when using slow NICs, as reading/writing data won't happen over the network. But in both cases (ceph or ZFS) you usually want enterprise/datacenter grade NICs with a PLP and there all 2TB models will require M.2 22110 slots or U.2/U.3. So something is the range of a Samsung PM983 or similar. Often such MiniPCs will only support up to M.2 2280.
 
Last edited:
Those PVE VMs then probably also used the bridges without packets leaving the 2.5Gbit NICs so those VMs were connected via 20-40Gbit via virtio NICs? Did you also test whats happening in case a OSD fails or a node goes down and when ceph is then rebalancing?


I would stick with ZFS for that if you can live with losing 1 minute of data in case a node fails. ZFS is replicated local storage, so will be way faster when using slow NICs. But in both cases (ceph or ZFS) you usually want enterprise/datacenter grade NICs with a PLP and there all 2TB models will require M.2 22110 slots or U.2/U.3. So something is the range of a Samsung PM983 or similar.

Oh that's probably the case for the virtio NICS, you're right. But it still was writing on consumers nvme (980 pro), that i was gonna put in my 3 new nodes.

Damn i wanted to play around with the ceph with the 2.5G NIC / 980 pro nvme...
I gotta look into the ZFS replication i guess ? How does it work tho ? Because i know zfs in local, but in a cluster ?
 
I gotta look into the ZFS replication i guess ? How does it work tho ? Because i know zfs in local, but in a cluster ?
Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same data (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).
So with 3x 2TB across the nodes you got ~1.6TB of storage for all 3 nodes combined.

And while consumer SSDs like the 980 Pro will work, its not recommended. ZFS and virtualization causes massive write amplification and might shred your SSDs very fast, depending on the workload. Especially when there are lots of sync writes, which a consumer SSD without a PLP can't cache in DRAM and therefore not optimize before writing to the NAND. In addition to the bad durability, those consumer SSDs will also be magnitudes (like 500x) slower when doing sync writes because of the missing PLP. Better to get some proper SSDs in the fiorst place than buying twice. ;)
 
Last edited:
Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).
Oh so in theory :

I build my zfs pool with my 980 pro nvme in local on each VM, then replicate between those 3 with the 2.5G link every minute ?
While ceph will permanently write to the shared pool on the 3 disks through the network ?

Isn't it the same while one does it each minutes and the other one each "bit" ?
 
Every VM will access the local ZFS pool. PVE will sync those pools incrementally every minute so all pool will store the same data (+/- one minute). Its not a shared storage but replication will be fast (as only the last minute will have to be transfered) and HA will work (losing the last minute of data).
So with 3x 2TB across the nodes you got ~1.6TB of storage for all 3 nodes combined.

And while consumer SSDs like the 980 Pro will work, its not recommended. ZFS and virtualization causes massive write amplification and might shred your SSDs very fast, depending on the workload. Especially when there are lots of sync writes, which a consumer SSD without a PLP can't cache in DRAM and therefore not optimize before writing to the NAND. In addition to the bad durability, those consumer SSDs will also be magnitudes (like 500x) slower when doing sync writes because of the missing PLP. Better to get some proper SSDs in the fiorst place than buying twice. ;)
I keep looking for some enterprise grade nvme like the PM983, but its almost 400 bucks each damn
Or will this one do the job : SSD PM983 MZ-1LB1T90 1,88T PCIe Gen4x4 NVMe M.2 22110 ?
 
Last edited:
I build my zfs pool with my 980 pro nvme in local on each VM, then replicate between those 3 with the 2.5G link every minute ?
While ceph will permanently write to the shared pool on the 3 disks through the network ?
Yes.
Isn't it the same while one does it each minutes and the other one each "bit" ?
No, ZFS will only read/write to local storage so with the full PCIe performance (32 or 64 Gbit). Once the local SSD stored a write that write operation will be finished/acknowledged.
Once a minute it will sync those pools over the network but this won't affect VM storage performance in any way.

When using ceph it will have to do the reads/writes over the network so your 32/64Gbit of the NVMe will be bottlenecked to the 2.5Gbit of your NIC (or actually more like 1.25Gbit as ceph would need to write a copy to both other nodes through that single NIC).
If a VM does a write it will only be acknowlaged once all 3 nodes wrote a copy of that write. So every write will have the additional latency of your network and will be limited by its NICs bandwidth.
 
Last edited:
I keep looking for some enterprise grade nvme like the PM983, but its almost 400 bucks each damn
Or will this one do the job : SSD PM983 MZ-1LB1T90 1,88T PCIe Gen4x4 NVMe M.2 22110 ?
Have a look at the price-per-TB-TBW (so how long it should last) and not at the price-per-TB-capacity. Then those enterprise SSDs are cheap and not expensive ;)

You could buy a cheap paper bag for 30 cent that will break the first time its raining. Or you buy a durable cloth one for 1 dollar. Both have the same capacity but the cloth one will be able to carry way more heavy stuff and last waaay longer. On the long term the cloth one will be cheaper.
 
Last edited:
Have a look at the price-per-TB-TBW (so how long it should last) and not at the price-per-TB-capacity. Then those enterprise SSDs are cheap and not expensive ;)
I found some PM983 MZ-1LB1T90, will they do the job ? Because i can't find the difference with the PM983 MZ1LB1T9HALS
 
Yes.

No, ZFS will only read/write to local storage so with the full PCIe performance (32 or 64 Gbit). Once the local SSD stored a write that write operation will be finished/acknowledged.
Once a minute it will sync those pools over the network but this won't affect VM storage performance in any way.

When using ceph it will have to do the reads/writes over the network so your 32/64Gbit of the NVMe will be bottlenecked to the 2.5Gbit of your NIC (or actually more like 1.25Gbit as ceph would need to write a copy to both other nodes through that single NIC).
If a VM does a write it will only be acknowlaged once all 3 nodes wrote a copy of that write. So every write will have the additional latency of your network and will be limited by its NICs bandwidth.

Okay, i think i'll still try with Ceph a little bit once i got everything i needed, if i ever find it slow / big latency i'll then go for some zfs replication :)
 
Okay, i think i'll still try with Ceph a little bit once i got everything i needed, if i ever find it slow / big latency i'll then go for some zfs replication :)
Especially try those cases where a node fails and ceph needs to transfer hundreds of GBs of data to rebalance. You often don't see these bottlenecks while everything is running healthy but you get into but troubles once there is some problem.
 
Especially try those cases where a node fails and ceph needs to transfer hundreds of GBs of data to rebalance. You often don't see these bottlenecks while everything is running healthy but you get into but troubles once there is some problem.
I mean, if i'm using the ceph storage for my VMs, doesn't it just have the vm disks on the vm already ? what is there left to transfer that goes over hunderds of GB ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!