Project: Fileserver on Proxmox

m4rtin · Feb 10, 2025

Hi,

I'm considering moving a file server (90TB, Debian bare-metal) to a new Supermicro server (320TB with an LSI MegaRaid controller) where I will use this file server as a VM in Proxmox, with a second identical server for an HA cluster.

I don't want to use shared storage as I want a small setup. Just 2 identical servers with a replication every minute and 2 VMs on 2 Synologys as quorum.

The reason I want to use Proxmox is the automatic failover and the efficient replication.

Do you think replication will work well on a 320TB LVM? The workload is not that big as the 320TB will be mostly an archive. I estimate the replication size per minute could be up to 5GB, but probably much less. In the current setup (without Proxmox) I use lsyncd which works well.

waltar · Feb 10, 2025

320TB archive doesn't look like kind of vm/lxc images and more like real data in a filesystem ... so why then lvm under the hood too ?
Doing syncing in vm's could maybe slower than before like your today's 90TB fileservers but depends on your disks and controllers too.
Looks like you don't sync any data between the supermicro's and the synology's ... which would break all your sync dreams.
10Gb network for 1GB/s sync between the 2 supermicro's is given ?
Syncing the lvm under a filesystem needs a filesystem freeze while syncing filesystem data could be or even not.
You could on second supermicro export the TB's by iscsi (not mounted there) and do a mdadm raid1 on first server so everythink would be ever the same.
Use xfs with external metadata device when sync filesystem data.

UdoB · Feb 10, 2025

m4rtin said:
Do you think replication will work well on a 320TB LVM?

I love PVE (and ZFS - no LVM, never). But with that goal I would look for TrueNAS. And not VM, bare metal that is!

ness1602 · Feb 10, 2025

Strange and interesting idea. I would use CEPH for that, it seems that it would suit better. i've never seen single node zfs pool with that size on one machine in Proxmox. But if you stick 20x20tb drives in Raidz2/3 i guess you could do it. And replication only works on ZFS, not on LVM.
Interesting either way.

m4rtin · Feb 10, 2025

UdoB said:
I love PVE (and ZFS - no LVM, never). But with that goal I would look for TrueNAS. And not VM, bare metal that is!

Thanks, TrueNAS looks good from what I've read about it. I've heard about it before, but I thought it was nothing serious.

m4rtin · Feb 10, 2025

ness1602 said:
Strange and interesting idea. I would use CEPH for that, it seems that it would suit better. i've never seen single node zfs pool with that size on one machine in Proxmox. But if you stick 20x20tb drives in Raidz2/3 i guess you could do it. And replication only works on ZFS, not on LVM.
Interesting either way.

I think I'll use Raidz3 with 16 x 22 TB HDDs in TrueNAS, makes more sense I guess.

UdoB · Feb 11, 2025

m4rtin said:
I think I'll use Raidz3 with 16 x 22 TB HDDs in TrueNAS, makes more sense I guess.

Just keep in mind that you'll get the IOPS of a single drive. (If you put all 16 drives in a single vdev, which is probably not recommended.)

In any case I recommend to add "Special Device" consisting of NVMe (or maybe SSD) as a mirrored vdev before filling in too much data. And to match the redundancy of a RaidZ3 - where three devices are allowed to fail - it should be a four-way mirror, not just two devices. For example 4*2TB as a single mirror vdev, that is. Oh... and please use Enterprise class devices, with PLP.

Johannes S · Feb 11, 2025

ness1602 said:
Strange and interesting idea. I would use CEPH for that, it seems that it would suit better. i've never seen single node zfs pool with that size on one machine in Proxmox. But if you stick 20x20tb drives in Raidz2/3 i guess you could do it. And replication only works on ZFS, not on LVM.
Interesting either way.

I would strongly suggest to think twice before going the Ceph route. First OP plans to use two nodes, not three (Ceph needs at least three nodes) and even with three nodes he would need to be cautious (see UdoBs writeup https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/ ).

Another problem with Ceph would be that only a fraction of available space could be actually used. I used https://florian.ca/ceph-calculator/ for some calculations. I assumed that the existing 90 TB server will be reused.
Scenario One: Two nodes with 320 TB, one with 90 TB: Then 205 TB could be safely used as "safe cluster size".
Scenario Two: Move discs from the large fileservers with 320 TB to the 90 TB node. We have 730 TB in our system, 730 TB / 3 means around 243 TB per node. I assumed 22 TB per disc (because of OPs last post with his planned RAIDZ setup) so in practice we only have 726 TB in our system
Node1: 242 TB ( 11 x 22)
Node 2: 242 TB (11 * 22)
Node 3: 242TB (11 * 22)

With these numbers the calculator says we have around 242 TB as save cluster size. So if the goal is to utilize as much space as possible we will always end up with not-used space or discs.

m4rtin said:
I don't want to use shared storage as I want a small setup. Just 2 identical servers with a replication every minute and 2 VMs on 2 Synologys as quorum.

This setup screams for trouble since the clustering (with the two qdevices) adds additional complexity. The docs recommend against using more than one qdevice and against using them at all if you have an uneven number of nodes: https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support

m4rtin said:
The reason I want to use Proxmox is the automatic failover and the efficient replication.

I'm not sure on the automatic failover but the actual replication could also be achieved with TrueNAS like Udo said. And in the end in my book it wouldn't be too much of a problem to have two fileservers who replicate their contents on a regular schedule and having both of their network shares on the clients (maybe one with read-only permissions to be changed to read-write if one node goes offline)

m4rtin said:
Do you think replication will work well on a 320TB LVM? The workload is not that big as the 320TB will be mostly an archive. I estimate the replication size per minute could be up to 5GB, but probably much less. In the current setup (without Proxmox) I use lsyncd which works well.

On a LVM replication won't work at all since ProxmoxVEs (and TrueNAS) replication functionality is based on ZFS send/receive mechanism. Whether this is feasible depends mainly on your network (which hardware do you have? Is the transfer network just for the archive transfer or also for other uses? etc pp). So please post the parameters of your environment (10/25/100 Gb, pure storage network or whether client uses it too etc).

m4rtin said:
Thanks, TrueNAS looks good from what I've read about it. I've heard about it before, but I thought it was nothing serious.

Why? I can't remember a time when TrueNAS wasn't designed with usage as an enterprise solution. Yes, they switched from FreeBSD to Linux at some point to allow hosting of apps (which is to be honest more something for homelabbers) but most of it's advanced features are for usage in a professional environment. @UdoB I might be wrong but TrueNAS replication works with ZFS send/receive too ain't it? This is not meant as a diss but this would mean that at least the technology behind the replication would be the same as with ProxmoxVE. I agree with you that a bare metal TrueNAS will be more suited and more performant for @m4rtin usecase.

m4rtin said:
I think I'll use Raidz3 with 16 x 22 TB HDDs in TrueNAS, makes more sense I guess.

I second Udos suggestion to think again over your RAID levels. RAIDZ3 won't have the best performance and a rebuild after a disk failure will take quite some time. A RAID10 like setup (building two mirrors, striped together) will allow only 50% of the complete capacity but the read performance will be better than with RAIDZ3. Udos suggestion to use SSDs with PLP as special device mirror to speed up metadata and small files access is also a good suggestion.

Johannes S · Feb 11, 2025

Another point for the Ceph suggestion: ~~CERN~~ ETH Zürich had some trouble to get sizing and capacities right. I saw a talk on that some weeks ago, it's online: https://fosdem.org/2025/schedule/ev...etic-benchmarks-to-real-world-user-workloads/ There were also some other talks on Ceph although I didn't manage to get a seat (the room was quite small unfortunely): https://fosdem.org/2025/schedule/room/k3401/

Edit: Wasn't CERN but ETH Zürich, I mixed them up (due to both institutions being in Switzerland I suspect)

m4rtin · Feb 11, 2025

Usecase
Fileserver / Archive for Adobe CS work. The amount of data is 80 TB at the moment, so the current fileserver is coming to an end.
Clients connect via SMB on MacOS

My setup is:

current setup:
2 x potato server with LSI Megaraid controller and 8 x 14 TB HDDs (Raid 5) and Bare Metal OpenSUSE / Debian (BTRFS / EXT4) and 10 Gbit LAN connection (clients use 1Gbit).

new hardware (planned):
2 x supermicro server with 4 height units (36 x 3.5' slots) with 10 Gbit LAN connection
8 x 22 TB (for the start, maybe more for RAID-Z3)
2 x NVME (size ?)

Backup

Daily to tape
Daily to a synology (rsync) in a datacenter

I'll dispose the old server (they are partly more than 10 years old) but reuse the HDDs in the new server.

Why do you think bare metal is so important? I thought VM performance is almost as good as the host these days?

What configuration would you recommend in Proxmox? (I just read about TrueNAS Enterprise licensing fees...)
What about RAID-Z3 with 2 vdevs, each 4 HDDs

ness1602 · Feb 11, 2025

I'll answer some misconceptions from Johannes, and give some recommendations:
1) CEPH demands 3+ nodes, with more than 100tb data i would recommend probably 5-7-9 nodes. So out of your scope.
2) Clustering doesnt add complexity since it is already included, and adding quorom nodes is just fine.
3) With 2 nodes and zfs replication you don't have automatic failover, that is only with HA storage( Gluster,Ceph, NFS ,etc). You can have manual failover, where you copy .conf file from one node to another and spin up vm.
4) Truenas is still based on BSD, truenas scale is on debian. For home users OMV is enough, without added complexity of Scale.
5) Using raidz is okay if you don't have read or write intensive apps ( you said something like 5gb a day, that is not much). Keep in mind, once you add special device(metadata) you cannot remove it from Raidz, from RAID10 you can. Only you can answer performance requirements like this, but 90TB for a VM is okay, maybe recovery if something crashes from backups can take a while( if it is 1gbit/s , if it is 10gb/s then i guess it could be faster).
6) For ease of administration, i would always recommend RAID10 with 2x special device enterprise ssd or nvme. But if you are limited by space or money this is it.

m4rtin · Feb 11, 2025

ness1602 said:
I'll answer some misconceptions from Johannes, and give some recommendations:
1) CEPH demands 3+ nodes, with more than 100tb data i would recommend probably 5-7-9 nodes. So out of your scope.
2) Clustering doesnt add complexity since it is already included, and adding quorom nodes is just fine.
3) With 2 nodes and zfs replication you don't have automatic failover, that is only with HA storage( Gluster,Ceph, NFS ,etc). You can have manual failover, where you copy .conf file from one node to another and spin up vm.
4) Truenas is still based on BSD, truenas scale is on debian. For home users OMV is enough, without added complexity of Scale.
5) Using raidz is okay if you don't have read or write intensive apps ( you said something like 5gb a day, that is not much). Keep in mind, once you add special device(metadata) you cannot remove it from Raidz, from RAID10 you can. Only you can answer performance requirements like this, but 90TB for a VM is okay, maybe recovery if something crashes from backups can take a while( if it is 1gbit/s , if it is 10gb/s then i guess it could be faster).
6) For ease of administration, i would always recommend RAID10 with 2x special device enterprise ssd or nvme. But if you are limited by space or money this is it.

3) I currently use this setup for some other VMs (Firewall, ...) and I configured HA (Datacenter -> HA) and it works. I use a shared ZFS pool.
5) up to 5Gbyte is a maximum guess per minute. But usually maybe 500 MB. The 2 server will have a 10 GBit LAN connection
6) RAID10 reduces the space by about 30% compared to RAID-Z3 but ease of administration is a valid point, I'll think about that...

UdoB · Feb 11, 2025

Johannes S said:
@UdoB I might be wrong but TrueNAS replication works with ZFS send/receive too ain't it?

Actually I've never used it, but yes, it should work. My memory may be old, but replication and High-Availability may be a function of the paid model they offer. Contrary to Proxmox not all features are (were?) enabled on the free version. Again: my information may be outdated.

For me an important difference is that FreeNAS is an appliance. It is meant to be used as it is delivered - and iX is forcing this. Modifying the base OS is difficult, unwanted and not always persistent. PVE is open in this additional sense as it allows you to do everything what Debian offers.

From my limited point of view FreeNAS is a high quality product. But it is different...

(( I've used "classic" FreeNAS several years ago for a long time, mostly before the Corral disaster. Currently I have only a test-machine with SCALE - this now being Linux, not BSD anymore. ))

Johannes S · Feb 11, 2025

ness1602 said:
I'll answer some misconceptions from Johannes, and give some recommendations:
1) CEPH demands 3+ nodes, with more than 100tb data i would recommend probably 5-7-9 nodes. So out of your scope.

Agreed.

ness1602 said:
2) Clustering doesnt add complexity since it is already included, and adding quorom nodes is just fine.

It's more complicated than "just fine":

We support QDevices for clusters with an even number of nodes and recommend it for 2 node clusters, if they should provide higher availability. For clusters with an odd node count, we currently discourage the use of QDevices. The reason for this is the difference in the votes which the QDevice provides for each cluster type. Even numbered clusters get a single additional vote, which only increases availability, because if the QDevice itself fails, you are in the same position as with no QDevice at all.

On the other hand, with an odd numbered cluster size, the QDevice provides (N-1) votes — where N corresponds to the cluster node count. This alternative behavior makes sense; if it had only one additional vote, the cluster could get into a split-brain situation. This algorithm allows for all nodes but one (and naturally the QDevice itself) to fail. However, there are two drawbacks to this:

If the QNet daemon itself fails, no other node may fail or the cluster immediately loses quorum. For example, in a cluster with 15 nodes, 7 could fail before the cluster becomes inquorate. But, if a QDevice is configured here and it itself fails, no single node of the 15 may fail. The QDevice acts almost as a single point of failure in this case.

The fact that all but one node plus QDevice may fail sounds promising at first, but this may result in a mass recovery of HA services, which could overload the single remaining node. Furthermore, a Ceph server will stop providing services if only ((N-1)/2) nodes or less remain online.

If you understand the drawbacks and implications, you can decide yourself if you want to use this technology in an odd numbered cluster setup.
(https://pve.proxmox.com/wiki/Cluster_Manager#_corosync_external_vote_support )

I also remember forum discussions where a staff member recommended against using more than one qdevice.

ness1602 said:
3) With 2 nodes and zfs replication you don't have automatic failover, that is only with HA storage( Gluster,Ceph, NFS ,etc). You can have manual failover, where you copy .conf file from one node to another and spin up vm.

You can also have automatic failover If ( and only If) you have a qdevice and zfs-based storage replication:

High-Availability is allowed in combination with storage replication, but there may be some data loss between the last synced time and the time a node failed. https://pve.proxmox.com/wiki/Storage_Replication

ness1602 said:
4) Truenas is still based on BSD, truenas scale is on debian. For home users OMV is enough, without added complexity of Scale.

Agreed for the scope of OMV versus TrueNAS. My homelabbing jest was msinly due to the apps/docker support in recent versions of TN Scale. To be fair for smbs that's useful as well.

ness1602 said:
6) For ease of administration, i would always recommend RAID10 with 2x special device enterprise ssd or nvme. But if you are limited by space or money this is it.

Agreed

Johannes S · Feb 11, 2025

UdoB said:
Actually I've never used it, but yes, it should work. My memory may be old, but replication and High-Availability may be a function of the paid model they offer. Contrary to Proxmox not all features are (were?) enabled on the free version. Again: my information may be outdated.

You seems to be correct,
https://www.truenas.com/compare/ says HA is an enterprise edition feature.
Regular replication isn't though:
https://www.truenas.com/docs/scale/...rotection/replication/remotereplicationscale/

So one don't need to pay for replication

ness1602 · Feb 11, 2025

As for storage-replication , HA is working yes ,but automatic failover isn't because it is not a HA storage, and as you can see you can lose the data versus regular HA storage(CEPH etc)>
.

m4rtin · Feb 11, 2025

I'll try the setup where 2 nodes each have a local ZFS pool with the same name and the synchronization runs every 1 minute. In the case of a failover, it doesn't matter if one or 5 minutes are lost. In my experience, HA works then too. I just wasn't sure if there would be performance problems, but we'll see how it works.

waltar · Feb 11, 2025

Come back with your results when it's running with all your previous data hold (>50 from 90 TB ??) as you know sync time is data dependent and empty 320 TB without a file sync in msec ... but don't to forget you want to support up to 300 TB inside full (which is even more if compressed or depup'ed) ...

Good luck !

Search

Search

Project: Fileserver on Proxmox

m4rtin

Member

waltar

Renowned Member

UdoB

Distinguished Member

ness1602

Famous Member

m4rtin

Member

m4rtin

Member

UdoB

Distinguished Member

Johannes S

Famous Member

Johannes S

Famous Member

m4rtin

Member

ness1602

Famous Member

m4rtin

Member

UdoB

Distinguished Member

Johannes S

Famous Member

Johannes S

Famous Member

ness1602

Famous Member

m4rtin

Member

waltar

Renowned Member

We value your privacy