Proxmox VE PreInstall Sanity Check

rbeard.js

Member
Aug 11, 2022
56
2
13
Hi All,

I was hoping someone could look over our proposed proxmox and proxmox backup server configuration before we order everything we need and begin rolling it out to our servers. We are looking to have 3 servers running proxmox setup in a cluster with failover and 2 servers running Proxmox Backup server (1 local and 1 remote).

Here are our specs on each machine at the moment:
Server 1 - 2x 11 core Xeons, 128g of Ram, 8TB SSD in RAIDZ2
Server 2 - 2x 11 Core Xeons, 256g of Ram, 16TB SSD in RAIDZ2
Server 3 - 2x 6 Core Xeons, 128g of Ram, 16TB SSD in RAIDZ2
PBS 1 and 2 - 1x 4 Core Xeon, 32g or Ram, 16TB SSD (Remote Site will have HDDs) in RAIDZ2

Each OS will be installed on the Dual SD card reader inside the Dell Servers

We were planning on using zfs for each machine. Would it be better to create one large zfs pool of all the disks on each machine or create a few smaller pools? I realize we could always expand out storage by adding enough disks to create another zfs pool, but we are trying to max out our storage now.

Server 1 is for our BI team running a few windows VMs. Mostly for running some heavy reports
Server 2 is going to be running DNS, DC, logging, a small file share, print server, and some other vms.
Server 3 is mostly going to be a testbed. It's also going to have the other two servers replicate to it so if we need to move the VMs over for whatever reason, it's on standby.
The PBS role is obvious. I believe you can replicate you PBS server to another PBS server. Please correct me if Im wrong but that is the idea behind having two in two different locations.

Each of the 3 main servers will have a dual port 10g Nic. One half will be used exclusively by proxmox management and the other will be shared by the VMs. Im hoping this will make backups, snapshots, and the like very quick by having that dedicated link.

Im a little worried about our current storage sizes on each server.
Server 1 - 2 VMs, at about 1TB total rn
Server 2 - About 10 VMs, about 7TB estimated rn

I know backups can be just the changes after the first full back up on each of the VMs
Also, all of these Windows VMs are running on a lot larger disks that they probably should be on.

Does anyone have a recommended way of shrinking down those windows disks before I port them into Proxmox? That would really help with the sizes.
Also, a good tool recommendation for pulling the windows servers off the bare metal machines would also be appreciated! Some of these servers have multiple drives attached like and E: drive for data that we need to work around when moving them too. Im sure that will complicate the moving process.

Anyways, please let me know any feedback you can give! We are really trying to set this up right the first time and save a ton of headache by preplanning everything.
Thank you for your time!
 
Last edited:
Server 1 - 2x 11 core Xeons, 128g of Ram, 8TB SSD in RAIDZ2
Server 2 - 2x 11 Core Xeons, 256g of Ram, 16TB SSD in RAIDZ2
Server 3 - 2x 6 Core Xeons, 128g of Ram, 16TB SSD in RAIDZ2
Did you read about padding overhead when using any raidz? You will moat likely need to increase the volblocksize to minimize padding overhead stealing capacity and this will make small random reads and all small writes very slow. So not great for running stuff like postgresql or MySQL DBs.

Good article on padding overhead: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz

An example:
You got a raidz2 of 6x 8TB disks, so a total raw capacity of 48TB. You will lose 33% of raw capacity because of parity data, so only 32TB usable. And a ZFS pool shouldn't be filled up more than 80% for best performance. So in reality only 25.6TB of usable capacity. And when using the default volblocksize of 8K everything stored on a virtual disk (zvol) will consume 200% space, because there is 100% padding overhead. So the 25.6TB will be full after only storing 12.8TB of actual data. So you lose 73% of the raw capacity.
PBS 1 and 2 - 1x 4 Core Xeon, 32g or Ram, 16TB SSD (Remote Site will have HDDs) in RAIDZ2
Your offsiite PBS still needs to run the GC which will need days to complete without SSD caching for metadata. So backup/restore/verify performance will be bad, because the GC will fully utilize the HDDs over a very long time.
Each OS will be installed on the Dual SD card reader inside the Dell Servers
PVE/PBS shouldn't be installed on SD cards/pen drives. They are writing a lot (10GB/day wouldn't be that unusual) and they are not durable enough and will fail sooner or later. Not sure if industrial SLC NAND SD cards would be up to the task.


Would it be better to create one large zfs pool of all the disks on each machine or create a few smaller pools?
Depends. Usually its better to have one big pool for better performance but there are also cases where you might want a smaller pool. For example a smmall pool just for databases, because the more disks your pool consists of, the bigger your volblocksize will have to be. There a smaller striped mirror would be a better choice than a raid2.
 
Did you read about padding overhead when using any raidz? You will moat likely need to increase the volblocksize to minimize padding overhead stealing capacity and this will make small random reads and all small writes very slow. So not great for running stuff like postgresql or MySQL DBs.

Good article on padding overhead: https://www.delphix.com/blog/delphi...or-how-i-learned-stop-worrying-and-love-raidz
That link seems to be broken but I did some research into the topic. What would be good volblocksize for our application? Is there any downside to going larger and larger? The 8TB I placed above was the total raw capacity, so there is only 8 1TB drives in that server. Same with the others.
Only server 1 will have a sql database on it and atm its not very large. You suggestion would be to make a separate pool just for the database?
Im sure we can deal with the degraded performance if we need to failover to server 3 temporarily. If we ran the sql server on the larger volblocksizes, is it a crippling impact to performance or are we talking small percentages? They are also getting triple the cores and triple ram in this upgrade so Im wondering if they would even notice a performance hit.


PVE/PBS shouldn't be installed on SD cards/pen drives. They are writing a lot (10GB/day wouldn't be that unusual) and they are not durable enough and will fail sooner or later. Not sure if industrial SLC NAND SD cards would be up to the task.
So, I like the idea of the dual SD card reader so we would always have redundancy. Im assuming we could mirror some NVMe drives maybe in a pcie slot? If one of the NVMe drives fails, replacing it would fix the mirror correct?
We could use normal SSDs but they would have to be taped somewhere XD
Something small with a cache should work fine, no?
Im not 100% sure this will work on our Dells anyways as I dont recall seeing options for bifurcation. Some of the cards say they will work without it. However, if we only did a single SSD and it dies, could we easily recover our proxmox install?


Your offsiite PBS still needs to run the GC which will need days to complete without SSD caching for metadata. So backup/restore/verify performance will be bad, because the GC will fully utilize the HDDs over a very long time.
The offsite would only backup once a week which would give it a lot of time between backups for garbage collection. Do you think with that additional information it would be okay as is?


Also I started using the VMware tool to convert the servers to a virtual disk. The disk is obviously alot smaller than original install was. If the new VHD is 96GB, could I tell proxmox that its just a 125GB disk and somehow shrink the disk within windows?
 
That link seems to be broken but I did some research into the topic.
That's sad. That was an article by the ZFS head developer explaining padding overhead on the block level :(
What would be good volblocksize for our application?
That depends on the sector size of the disks, the ashift you choose, the number of disks per vdev, the number of striped vdevs, if you care more about performance or capacity, ...
Is there any downside to going larger and larger?
The bigger you choose your volblocksize, the more read/write amplification you will see when doing small reads/writes. Lets say for example you do 1000x 4K random sync writes to a 64K volblocksze zvol. Then it will write 1000x 64K instead of 1000x 4K of data. So your throughput performance will be 1/16th and the SSDs will wear 16 times faster. so the volblocksize should be smaller than the blocksize your workload is usually writing/reading with to prevent write/read amplification.
So you need to find the sweet spot between losing to much capacity because of padding overhead and losing too much performance caused by write/read amplification. You can't have both.
So, I like the idea of the dual SD card reader so we would always have redundancy. Im assuming we could mirror some NVMe drives maybe in a pcie slot? If one of the NVMe drives fails, replacing it would fix the mirror correct?
If your server supports booting from NVMe, which not all server do, then yes. There are small Intel Optane NVMes that are perfect as boot disks.
Replacing a boot disk is a bit more complicated. you need to clone the partition table and sync the bootloader first. See chapter "replacing a failed bootable device": https://pve.proxmox.com/wiki/ZFS_on_Linux#_zfs_administration
However, if we only did a single SSD and it dies, could we easily recover our proxmox install?
PVE doesn't got any host backup/restore feature yet. Best would be to mirror them for less downtime and work.
The offsite would only backup once a week which would give it a lot of time between backups for garbage collection. Do you think with that additional information it would be okay as is?
You can estimate that. 16TB of backups with an average chunk size of 2MB would be 8 million chunk files. The GC needs to read+write the metadata of all 8 million files. So at least 16 mio IO. A raidz2s IOPS performance doesn't scale with the number of disks. Only with the number of vdevs. So the whole pool will be as slow as a single disk. Let's say that disk can handle 100 IOPS. 16 mio IO / 100 IOPS = 1.85 days. So might be fine if you only backup once per week, but I would still add 2 or 3 small SSDs as special metadata devices in a mirror. Then that GC would be done in minutes...and backup/restore/verify performance would also benefit a bit, as the HDDs then only would be hit by data and not the metadata too.
 

Attachments

  • Like
Reactions: Dunuin
That's sad. That was an article by the ZFS head developer explaining padding overhead on the block level :(

That depends on the sector size of the disks, the ashift you choose, the number of disks per vdev, the number of striped vdevs, if you care more about performance or capacity, ...

The bigger you choose your volblocksize, the more read/write amplification you will see when doing small reads/writes. Lets say for example you do 1000x 4K random sync writes to a 64K volblocksze zvol. Then it will write 1000x 64K instead of 1000x 4K of data. So your throughput performance will be 1/16th and the SSDs will wear 16 times faster. so the volblocksize should be smaller than the blocksize your workload is usually writing/reading with to prevent write/read amplification.
So you need to find the sweet spot between losing to much capacity because of padding overhead and losing too much performance caused by write/read amplification. You can't have both.

mr44er posted a waybackmachine link.
I clearly need to do some more research on the best combination of settings here. The SSD I'm using is 4096 block size and according to what I've seen so far, ashift 12 is correct for it.
I found one article saying that matching your block size to the volblocksize is preferred so 4k and another saying to use 16k with ssds lol. Thank you for the information thus far though!

But I would still add 2 or 3 small SSDs as special metadata devices in a mirror. Then that GC would be done in minutes...and backup/restore/verify performance would also benefit a bit, as the HDDs then only would be hit by data and not the metadata too.
Oh! I misunderstood. I thought you were talking about shifting the entire server to pure SSD. I didn't realize you could cache on zfs. Okay sure, I can look into getting this added.





Could you explain a little about how I can pull data out of my zfs pools? I'm still trying to wrap my mind around it. I know I can use zfs list to see all the contents on my pool.
How can I copy off my .raw disk for a VM to port to a new machine?
I know I can backup the VM which creates a .zma file that I should be able to restore to proxmox.
I guess I'm describing a scenario where the backup function isn't working.
When I look through my file structure, I can see the VM disks but they are all links and in several different parts and definitely not something I can copy off.
I'm sorry if I'm not making sense. My coworker knows more about Linux than I do. I'm just used to seeing all the files and being able to manipulate them at will. I'm sure the same is possible with zfs, it's probably just a different way of seeing/using the data.
 
Could you explain a little about how I can pull data out of my zfs pools? I'm still trying to wrap my mind around it. I know I can use zfs list to see all the contents on my pool.
How can I copy off my .raw disk for a VM to port to a new machine?
When working with ZFS you use zvols for your virtual disks. They are block devices, not files. So more like a partition or a LV in case you are falimiar with LVM. So there is no file you could copy. To move data between ZFS pools you could use "zfs send | zfs recv". When piping it though SSH you can also move data between different hosts. Or you pipe zfs send to a file, copy that file and then pipe it back from the file to zfs recv:
https://docs.oracle.com/cd/E18752_01/html/819-5461/gbchx.html
 
Last edited:
Your offsiite PBS still needs to run the GC which will need days to complete without SSD caching for metadata. So backup/restore/verify performance will be bad, because the GC will fully utilize the HDDs over a very long time.
Sorry to pull you back to this thread,
you mentioned using an SSD mirror (raidz1?) for our offsite backups to read just the metadata.
I was looking around for a primer on setting this up but SSD cache just leads to a bunch of articles about using SSDs for reads and writes. Would you happen to have a how to or a quick summary on how I can set this up for out offsite? We like the idea of quicker GC afterall!
 
you mentioned using an SSD mirror (raidz1?) for our offsite backups to read just the metadata.
Special metadata devices doesn`t support raidz1/2/3. Only mirrors. So your data can be on raidz1/2/3 but the metadata has to be stored on a mirror. If you want the same reliability as raidz2 (so two disks may fail) you could mirror 3 SSDs. And it is not caching. If you lose your SSD mirror, you lose also all data on the HDDs.

Would you happen to have a how to or a quick summary on how I can set this up for out offsite?
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
 
Last edited:
And it is not caching. If you lose your SSD mirror, you lose also all data on the HDDs.
Well that's terrifying. I guess it is an off-site backup and not super critical. We could also rebuild the SSD mirror if needed. I'm going to consider a third SSD now just to be safe. Thank you
 
Special metadata devices doesn`t support raidz1/2/3. Only mirrors. So your data can be on raidz1/2/3 but the metadata has to be stored on a mirror. If you want the same reliability as raidz2 (so two disks may fail) you could mirror 3 SSDs. And it is not caching. If you lose your SSD mirror, you lose also all data on the HDDs.


https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
So, I was setting up my special device and I'm getting an error. It says I can use the force command to ignore it and proceed but I wanted to check with you before that. The zpool carbon is the target for the metadata cache and is setup in a raidz2

Also, just to confirm, I can rebuild this special device mirror if one of the SSDs fails right? I could slap in a new disk and run some commands to rebuild the mirror?

1674140674368.png
 
Also, just to confirm, I can rebuild this special device mirror if one of the SSDs fails right? I could slap in a new disk and run some commands to rebuild the mirror?
Jup.

So, I was setting up my special device and I'm getting an error. It says I can use the force command to ignore it and proceed but I wanted to check with you before that.
Looks fine to me. That is just a warning to prevent that you do something stupid by accident. It still allows you to do something "stupid" on purpose, you just have to set the "-f" flag to force it.

Again:
Keep in mind that special devices are no cache. Lose both special devices and all data on those HDDs is lost too.
And once added, you won't be able to remove the special devices, without destroying the entire pool and recreating it from scratch.

Your command should work but you are weakening your reliability.
With your raidz2 of the HDDs any 2 HDDs may fail without data loss. And with just a single HDD lost ZFS can still restore corrupted data because a HDD with parity data is still left.

With just 2 SSDs in a mirror only 1 SSD may fail. And once that single SSD failed any small error will corrupt data and ZFS won't be able to repair it, as no other copy exists. So the whole pool is very vulnerable while resilvering it or while waiting for a replacement SSD,

So the way to not weaken your reliability would be to add a mirror of 3 SSD.

And I wouldn't add the SSDs with "/dev/sdg" but with "/dev/disk/by-id/YourDisk". Makes it much easier to identify a failed disk when replacing it, as ZFS will then show you the wwn or disk serial, also printed on the disk, of the failed disk.
 
Last edited:
Jup.


Looks fine to me. That is just a warning to prevent that you do something stupid by accident. It still allows you to do something "stupid" on purpose, you just have to set the "-f" flag to force it.

Again:
Keep in mind that special devices are no cache. Lose both special devices and all data on those HDDs is lost too.
And once added, you won't be able to remove the special devices, without destroying the entire pool and recreating it from scratch.

Your command should work but you are weakening your reliability.
With your raidz2 of the HDDs any 2 HDDs may fail without data loss. And with just a single HDD lost ZFS can still restore corrupted data because a HDD with parity data is still left.

With just 2 SSDs in a mirror only 1 SSD may fail. And once that single SSD failed any small error will corrupt data and ZFS won't be able to repair it, as no other copy exists.

So the way to not weaken your reliability would be to add a mirror of 3 SSD.

And I wouldn't add the SSDs with "/dev/sdg" but with "/dev/disk/by-id/YourDisk". Makes it much easier to identify a failed disk when replacing it, as ZFS will then show you the wwn or disk serial, also printed on the disk, of the failed disk.

Okay awesome it worked just perfectly!
I actually thought about that too. I went down to a RaidZ actually, I mistyped above. We can afford to loose a backup of a backup. Its easy enough to repair if something goes wrong with it. I removed the fourth drive and added it to our unraid server where it will find more use XD
Thanks again for the help!