[SOLVED] BIG DATA ZFS (~200TB)

Baader-IT · Jan 18, 2019

Hallo!

We are planing to create a BIG Data 2 Node Cluster with replication.
Our intention is to build it with ~200TB ZFS size.

This system should be designed for archiving data over more than 10 years.
There should only be 2 vm's on one node and the task of the second one is failover.

We want to start with ~ 80 TB for each vm in an LVM with XFS.

Should we place this vm on only one 80 TB disk or better use 4 disks with each 20 TB size.
Now I just want to know if there exists a maximum size of each virtual disk.
Are there any known performance issues?

Greetings
Tobi

wolfgang · Jan 18, 2019

Hi,

at the moment the GUI has a limit of 128T limit.

Baader-IT · Jan 18, 2019

wolfgang said:
128T limit.

Hi Wolfgang,

So it is not possible to create a virtual disk with a size over 128 TB right?

Greeting Tobi

wolfgang · Jan 18, 2019

Baader-IT said:
So it is not possible to create a virtual disk with a size over 128 TB right?

Not over the GUI.

You can create larger Disk over the CLI.

Baader-IT · Jan 18, 2019

Baader-IT said:
Should we place this vm on only one 80 TB disk or better use 4 disks with each 20 TB size

Baader-IT said:
Are there any known performance issues?

Wolfgang, do you have any infos for this questions ?

guletz · Jan 18, 2019

Hi @Baader-IT

I am curious how do you want design your this big zfs pool(ashift/vdev config)!?
How do you make the backups(rsync, etc) using what kind of network/s (Gbit?)?
What kind of files do you want to store(many small files, big files like video)?
Pre-production test(for each disk, for zfs/scrub/disk-replacement scenario)?

Good luck, you will need

Baader-IT · Jan 18, 2019

@ @guletz

-ashift 12
-backups - Only Replication to the other node
-Network: lwl 10GB direct cabeling between the two nodes
-Files: many small TAR Files
-pre-production test will be done soon

guletz · Jan 18, 2019

Baader-IT said:
@ @guletz
-ashift 12
-backups - Only Replication to the other node
-Network: lwl 10GB direct cabeling between the two nodes
-Files: many small TAR Files
-pre-production test will be done soon

Let try again

-how will be the pool(raidz2 with n disks, disks with sector 512 or 4K, or...)?
-how do you will you copy(only once, ore as a schedule task) on your pool this many small TAR files on the VM?

For many TAR files on a big zfs pool I will go using CT instead of VM! Even better without ANY VM/CT(using only some different zfs data-sets insted of VM/CT)

Baader-IT · Jan 18, 2019

guletz said:
how will be the pool(raidz2 with n disks, disks with sector 512 or 4K, or...)?

raidz3 with 24 disks with sector 4k

guletz said:
-how do you will you copy(only once, ore as a schedule task) on your pool this many small TAR files on the VM?

We plan adding about 100 TAR files once a day in future.
We will not override or delete existing files.

guletz said:
For many TAR files on a big zfs pool I will go using CT instead of VM! Even better without ANY VM/CT(using only some different zfs data-sets insted of VM/CT)

We don't want to use CT's cause of our company policy.

Greetings

guletz · Jan 18, 2019

Baader-IT said:
raidz3 with 24 disks with sector 4k

It will be a bad decision in my opinion with a poor performance: bad iops, very long time to finish a scrub task, bad replication speed to the second node (you can use even 1 Gbit link because you will not be able to use more then like one hdd bandwith), a long time to replace a single broken disk.

As a general principle, is not ok to have a big vdev (6-8 vdev is ok). So in your case will be better to use a 4 stripe of raidz2 (8 disks for each raidz2) + 2 spare disks.

Now let do some math:

iops in this case will be 4 x 1 disk iops (compared with 1 disk iops for your variant)

scrub will be also 4x faster

and any disk replacement will be faster than for your 24 x raidz3

Let take the vm case. For a 16k vblock size in vm, for each hdd we have:

24-3 (parity)= 21
16k / 21 = 0.76 k ... very lucky value
So because on any disk you can write only one block of 4 k at minimum(ashift 12) in your case for any VM write block you will write 21 x 4 k bloks. The same for reads. And think that you will do a simple ls -l on your multi bilion tar folder(random reads mostly). You have lucky maybe after a very very long time(one day ?), your pool will be able to finish

So in my opinion you must think again how will be your zfs pool design(what about ashift 13, a multi tar archive with let say 4 GB size and so on)

Baader-IT · Jan 18, 2019

@guletz
Okay, we changed our config to test some other setup.
Using a 2GB Raid Controller created 2 Disks using 12 HDDs for each virtual disk (raid 6).

So we get out 2x 106TB virtual disks.

Then we created a zfs striped pool using these 2 disks.

Code:

zpool create  -f -o ashift=12 HDD-Pool sda sdb

Now we do not have used the zfs raidzX.

What do you think about this setup now?
We already can see huge performance increase between both setups. (+70% speed)
The available pool size is significantly higher than before. (also about 70% more available space than before)

Greetings and thanks for ur help!

noname · Jan 18, 2019

in my opinion i suggest
first solution
1. >> 3 simple proxmox nodes cluster for guaranty quorum to keep active vms
2. >> master and slave redundancy network with serious network equipment
3. >> 2 storages (synology) active to active with minimum 40gb connected each other with serious datacenter disks
4. >> and last use one more server storage for disaster recover offline backup. thats all
///////////////////////////////////////////////////////////////////
second solution
use 4 node 24 disks ssd dc per node and combine all disks via ceph of course your network must be have minimun 40gb or better >> 100gb

LnxBil · Jan 18, 2019

Baader-IT said:
Using a 2GB Raid Controller created 2 Disks using 12 HDDs for each virtual disk (raid 6).

Haven't you read the what-not-to-do-with-ZFS guide? Using hardware RAID is the first item. Worst decision for long time data storage, really. There is nothing worse than that. You need protection against silent data corruption in an archive server and hardware RAID breaks that.

@guletz is totally right. You really want to have as much raw power as possible, so as much vdevs as possible.

Having an archive server as a VM on ZFS is also not a good idea. Best would be to just use ZFS ON your archive server, so use the server itself as backend or create the ZFS inside of your VM (then on top of some RAID if you want) so that you can send/receive the actual files and not a zvol with another filesystem on top, but I'd suggest to go with plain ZFS as backend.

Another thing is to store tar files on top of ZFS. Yes, it's possible but really not what you want with backups that change a little but are mostly the same. I have no idea what you're storing there, but having the raw files and e.g. rsync with special ZFS-friendly options gets a CoW-friendly setup in which you can only store the difference and not complete sets of backups. Could you elaborate more on what you're going to store there?

guletz · Jan 19, 2019

Baader-IT said:
What do you think about this setup now?

Is not ok, see also the @LnxBil pertinent comments (as usually) . And as you give more details, you have a better chances to find a better solution for your setup.

sb-jw · Jan 19, 2019

Do not forget to check if the Filesystem is capable of storing so many files. I recommend you to test it before and not check only the theoretical limitations.
A customer from the datacenter where i work, has built an Backup server with around 60TB storage, they copied all of his system with rsync on it (not good choose) and the ext4 was not capable of storing these files. So the customer has many problems with this server, we have checked and replaced many parts until we find out how the store the data. He changed the way for storing backups and now all is okay.

guletz · Jan 19, 2019

noname said:
use 4 node 24 disks ssd dc per node and combine all disks via ceph of course your network must be have minimun 40gb or better >> 100gb

Read again, @Baader-IT want a backup only system, in the same location. In this case he need only a decent write/read performance, because he will need to restore some of his tar archive, but not all. So in this case your solution is too much for his goals.

guletz · Jan 19, 2019

sb-jw said:
Do not forget to check if the Filesystem is capable of storing so many files.

Good point. For this reson I mention about a bigger archive files ...

LnxBil · Jan 19, 2019

sb-jw said:
Do not forget to check if the Filesystem is capable of storing so many files.

That is also one reason to just use plain ZFS, there are (worldly) limits :-D

Baader-IT · Jan 28, 2019

Hi @ All,

we tested the solution with our hardware RAID controller.
This solution is perfect for us.
We do not have any overhead (file size) and the performance is sufficient for our purposes.

We know that the hardware RAID controller shouldn't be used but on all our systems we count on it.
That's why we are now using the RAID controller.

@guletz - THX for your first comment, we didn't saw the big overhead but solved it now

morph027 · Jan 29, 2019

We know that the hardware RAID controller shouldn't be used but on all our systems we count on it.
That's why we are now using the RAID controller.

It still is a bad idea. You're not using the features ZFS is offering AND risking a raid controller hardware failure.

[SOLVED] BIG DATA ZFS (~200TB)

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Famous Member

Active Member

Famous Member

Active Member

Famous Member

Active Member

Active Member

Distinguished Member

Famous Member

Famous Member

Famous Member

Famous Member

Distinguished Member

Active Member

Renowned Member