Proxmox with GRAID - data replication

simoncechacek

New Member
Jun 21, 2023
23
0
1
wordpresscare.net
Hello,

We purchased a server setup with 24 NVMes and GRAID card. We planned to use VMware and NVMe-OF, but as this option's support id being delayed by GRAID team, we decided to try to migrate to proxmox completely.

Proxmox is supported by GRAID (currently VE7 and they are limited to 5.x kernel with their driver).

We want to create fast and reliable setup. So my ideal plan is something like this:
servers-proxmox.png


What I am unsure of and with what I was sent here by GRAID team is the question how to replicate those VMs and how to keep everything safe. I know that for promxo replication to work,. I need ZFS, but can I run ZFS on a single virtual drive, when the data protection itself is handled and accellerated by the GRAID card on one end? On the server 2 there will be no GRAID. So I can use a NVMe raid controller or software RAID.



If this is not possible. Do you know if there is an option to copy those VMs frequently (only copy changes) so I can at least a pretty recent cold standby VMs ont he other server I can manually start if the main server dies?

On the machine where GRAID is currently installed for testing (HW of server 1 but running ubunut) I can get aroud 68GB/s for reads and 36 GB/s of writes shile sequential and I was able to to get around 10M IOPS while reading random. We do not want to loose that performance, so thats why iSCSI or NFS looks like bad options.


Thank you very much for your time and answers. This platform is new to me, so I am being forced to create a functional system with no previous expirience.
 
I would venture to say that this platform is new to everyone, so its unlikely that you will find an implementation blueprint that would address all your wants. Storage vendor support is the critical piece here.

You are correct on replication: the built-in replication in PVE relies on ZFS being the base storage. ZFS is not a cluster aware filesystem, so essentially its local storage replication to another local storage. How that fits in your design is hard to say. Trying to map everything out and consider all failure scenarios requires a deeper dive/discussion and time commitment than the forum allows.

I would ask the manufacture what their recommendation is for ZFS on top of Graid volume. While the common wisdom is "no ZFS on RAID", perhaps they have different view of it.

Besides ZFS replication, you can employ constant backups/vzdumps/pbs, and transfer the data to secondary server. The RPO and RTO will be non-zero, but so are they with ZFS replication.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I may have seen a nvme raid controller once, if this is the same type of hardware. I wondered if you could just install PVE on ZFS without graid and use the disks as you would have used them with graid. Could be possible, yet s @bbgeek17 already wrote, the platform is new to everyone.
 
Hi there and thank you for your answers.

In GRAID, they said they think ZFS will not work on their RAID, but were unable to answer why, they haven't tested it.

The reason we purchased (a pretty expensive) GRAID card and license is the acceleration of software raid, so you can get insane speeds and IO from NVMe drives while offloading the raid calculation from the CPU. So the team wants to use the card as we have it :).

So I see my options as this:

1) test the ZFS on GRAID and see, if the performance is OK. If yes, then use ZFS replication to get the option to failover.

2) use a backup solution that will create and update copies of the VM as often as possible

I will test the first option, but I have a few questions about the second one. I studied something about the options of backing up / cloning VMs from proxmox, and to be honest, there is some stuff I do not understand.

On VMware, our current backup solution (Nakivo) creates a snapshot of the VM, the snapshot is then being trasfered to the second host and installed. Somehow, they are able to transfer only the new/changed data when the replication process is running repeatedly.

On proxmox however I read something about the need to stop the VM to clone it and I was unable to search anything about the option to sync only the new data so keep the copy of the VM as up to date as possible without stopping the VM or transferring all the data again (as our main VM is around 2TB of size now).

If something like this is possible, do you have a recomendation for a solution that can do that? In VMware world, there are huge players like Veeam or Nakivo. But to be honest, I do not know any solutions for proxmox other than the proxmox backup server that I saw on the site.

Sorry for possibly being stupid, I am just new to this world trying to get the setup great :)
 
On proxmox however I read something about the need to stop the VM to clone it and I was unable to search anything about the option to sync only the new data so keep the copy of the VM as up to date as possible without stopping the VM or transferring all the data again (as our main VM is around 2TB of size now).
This only works with ZFS as a basis, because ZFS does the replication and there is nothing faster than this.

The reason we purchased (a pretty expensive) GRAID card and license is the acceleration of software raid, so you can get insane speeds and IO from NVMe drives while offloading the raid calculation from the CPU.
So, you're not going with mirrors then.
 
Backups to PBS always copy only new/changed data, same data is never stored twice. Dedup can't be disabled.
Hourly backups is ok even if many data, thanks to dirty bitmap (like cbt in vmware).
if VM is shutdown, all VM data need to be read, but only new data will be transfered.
Restore is needed to start VM, so restoring 2 TB is slower than backup...
 
Last edited:
Hello @_gabriel and thank you for your answer. I am just unsure if I understand it correctly.

I do not want to shutdown the VM for the backup/failover. I want to start the second one as FAST as possible if the first machine fails. Then after fixing the main machine, I want to update all the data on the primary server and ten start the VM there.
 
of course hot backup si supported.
frequent backups require pbs which only changed data is copied.
Fast restore need backups stored on Fast drives too, SSDs.
 
I evaluted GRAID in the past, and found it to be effectively pointless.
The reason we purchased (a pretty expensive) GRAID card and license is the acceleration of software raid, so you can get insane speeds and IO from NVMe drives while offloading the raid calculation from the CPU. So the team wants to use the card as we have it :).
thats just it; the performance wasnt all that different from software raid, especially with striped mirrors. If you create a parity raid its a little faster, but xor operations are not very stressful on moderns cpus. You will get worse performance for virtualization with parity raid vs striped mirrors in any case due to a single disk queue. There is also the matter of rebuilds- graid recalculates parity but does not compare vs existing on disk checksum; zfs is much more robust from a data integrity standpoint.

If you simply eliminate it from your config, you will
- not be dependent on a piece of hw and dkms module available to mount your file system (eg, recovery)
- be able to upgrade to pve8
- be able to run zfs (yes it'll be slower, but worth it.)

I understand that sunk cost is a difficult hurdle here, but consider whats really important to you.
 
I do not want to shutdown the VM for the backup/failover. I want to start the second one as FAST as possible if the first machine fails. Then after fixing the main machine, I want to update all the data on the primary server and ten start the VM there.
The problem here is your hardware, because "FAST as possible" implies a dedicated or distributed shared storage, which is not what you bought.
 
Hi there and thank you for your answers.

Unfortunately for us, it looks like we will be ditching the GRAID card at least for now. Their compatibility is not as great as was presented and their kernel support is being delayed.

So now I want to ask for ideas on software and techniques to use.

1) We will have the main server on ZFS instead of GRAID, so it would be ready for replication when available.

2) We have our old server, that will not be able to run ZFS, but we want to use is as a backup machine just in case while we work on a second ZFS capable machine.

3) We also still have our storage server and Hetzner cloud storage.

I would like to ask you, if we are able to run periodical backups from the main proxmox machine to the storage server (NFS/CIFS share) and to Hetzner (NFS/CIFS/FTP) + automatically creating replicas of the running VMs to the secondary proxmox server (that is no running ZFS) with the proxmox backup server, or if we need a 3rd party software (in that I would ask for recomendations).

I understand this solution is not able to run like HA and in the case of running the replica, there will always be de-sync of data. We aim to have replicion, or even HA solution as we grow the Hardware.

Thank you for all your ideas and time. I will take my day to cry in the corner for the money lost in GRAID and then will be backt o get this done :).
 
why not? if it were, you'd be able to use the pve supported zfs replication method (poor man's ha)
Its build on a hardware RAID card, there is not enough slots to connect those drives directly. We will also replace this machine with another one that will be ZFS ready. Then we will make the old one a controller for the HA.
 
Ok, so we decided to slowly migrate to ZFS with the storage. I started my test and the best I was able to get is 12-15 GB/s for seq. reads and around 8GB/s for writes when using RAID10 ZFS.

Can I ask you what would be the best setup to get at least close to the GRAID resolts of 60 GB/s reads and 25-30GB/s writes?

The setup is 20 1.92TB SAMSUNG MZWLJ1T9HBJR-00007 drives, AMD EPYC 7H12 64-Core Processor CPU and 1024 GB of DDR4 ECC RAM.
 
With respect, sequential reads are a nonsensical benchmark. if your usecase is virtualization (and you are on a virtualization forum) you should be looking for 4k random iops. adjust your fia command accordingly :)
Sorry, You are right. I understand. Will try to adjust it :). This was just a number I had from the previous test so it was something I was able to compare.

I tried to measure the 4K random IOPS when testing GRAID, but never got results even close to theirs. I asked them about it, they confirmed those numbers are not right, but denied any help with editing the fio command.

The best I was able to create with the fio wiki is:

Code:
fio --name=randread --rw=randread --direct=1 --ioengine=libaio --bs=8k --numjobs=20 --size=1G --runtime=600 --group_reporting
Jobs: 11 (f=9): [r(6),f(1),_(1),f(1),_(6),r(1),_(1),r(2),_(1)][76.9%][r=2446MiB/s][r=313k IOPS][eta 00m:03s]
randread: (groupid=0, jobs=20): err= 0: pid=16364: Mon Dec 11 18:08:09 2023
read: IOPS=274k, BW=2140MiB/s (2244MB/s)(20.0GiB/9569msec)
slat (usec): min=4, max=9688, avg=62.37, stdev=83.31
clat (nsec): min=660, max=213522, avg=787.56, stdev=373.24
lat (usec): min=5, max=9691, avg=63.16, stdev=83.39
clat percentiles (nsec):
| 1.00th=[ 668], 5.00th=[ 684], 10.00th=[ 700], 20.00th=[ 724],
| 30.00th=[ 732], 40.00th=[ 740], 50.00th=[ 740], 60.00th=[ 748],
| 70.00th=[ 772], 80.00th=[ 844], 90.00th=[ 908], 95.00th=[ 1004],
| 99.00th=[ 1336], 99.50th=[ 1496], 99.90th=[ 2024], 99.95th=[ 2640],
| 99.99th=[10432]
bw ( MiB/s): min= 1772, max= 4195, per=100.00%, avg=2485.02, stdev=33.13, samples=317
iops : min=226914, max=537008, avg=318080.42, stdev=4240.08, samples=317
lat (nsec) : 750=51.77%, 1000=43.02%
lat (usec) : 2=5.10%, 4=0.08%, 10=0.02%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%
cpu : usr=1.89%, sys=55.48%, ctx=393285, majf=0, minf=247
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=2140MiB/s (2244MB/s), 2140MiB/s-2140MiB/s (2244MB/s-2244MB/s), io=20.0GiB (21.5GB), run=9569-9569msec



and

Code:
fio --name=randwrite --rw=randwrite --direct=1 --ioengine=libaio --bs=64k --numjobs=20 --size=1G --runtime=600 --group_reporting
...
fio-3.33
Starting 20 processes
Jobs: 20 (f=19): [w(9),f(1),w(10)][100.0%][w=4747MiB/s][w=76.0k IOPS][eta 00m:00s]
randwrite: (groupid=0, jobs=20): err= 0: pid=15763: Mon Dec 11 18:05:43 2023
write: IOPS=67.8k, BW=4239MiB/s (4445MB/s)(20.0GiB/4831msec); 0 zone resets
slat (usec): min=11, max=27328, avg=286.13, stdev=439.57
clat (nsec): min=700, max=1386.9k, avg=1663.03, stdev=8699.62
lat (usec): min=12, max=27334, avg=287.80, stdev=440.46
clat percentiles (nsec):
| 1.00th=[ 804], 5.00th=[ 868], 10.00th=[ 900], 20.00th=[ 964],
| 30.00th=[ 1032], 40.00th=[ 1096], 50.00th=[ 1176], 60.00th=[ 1320],
| 70.00th=[ 1528], 80.00th=[ 1768], 90.00th=[ 2064], 95.00th=[ 2544],
| 99.00th=[ 6432], 99.50th=[ 10432], 99.90th=[ 30848], 99.95th=[ 97792],
| 99.99th=[536576]
bw ( MiB/s): min= 3646, max= 4792, per=100.00%, avg=4262.18, stdev=19.63, samples=180
iops : min=58344, max=76672, avg=68192.56, stdev=314.07, samples=180
lat (nsec) : 750=0.04%, 1000=24.22%
lat (usec) : 2=64.53%, 4=8.83%, 10=1.85%, 20=0.38%, 50=0.08%
lat (usec) : 100=0.03%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%
cpu : usr=1.61%, sys=28.12%, ctx=227302, majf=0, minf=195
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,327680,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=4239MiB/s (4445MB/s), 4239MiB/s-4239MiB/s (4445MB/s-4445MB/s), io=20.0GiB (21.5GB), run=4831-4831msec



My current drive setup is just RAID 10 from those 10 drives in ZFS.
 
I'd still suggest looking at 4k, but that looks pretty nice to me... what are your operational requirements?

The proxmox server will run several VMs. The main one is a Plesk webhosting and MySQL (I want to run MySQL in a secondary VM, Plesk will have its own small DB, but that will not be used for clients data). the other VMs are our mailserver, Mattermost and a Widnows VM for accounting.

Our main goal is tu run the webhosting server as fast as possible with the lowest response times, so we can host fast websites :) We can use a lot of the RAM in the machine for caching.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!