Since there are many members here that have quite some experience and knowledge with ZFS, not the only that I`m trying to find the best/optimal setup for my ZFS setup; but also want to have some tests and information in 1 place rather than scattered around in different threads, posts and websites on the internet for Proxmox and ZFS.
Apologies for the many paste links, but the forum does not allow posting such long content as of now. (Please enter a message with no more than 10000 characters. but I had 9k)
The system:
ARC settings (as per the docs): https://ghostbin.co/paste/8pkku
The goal:
Current pool status: https://ghostbin.co/paste/ups72
Current ARC summary: https://pastebin.com/y4WRCvbb
ZDB report: https://pastebin.com/eW9JRy1N
PVEPERF: https://ghostbin.co/paste/gpmos
Small tests using FIO, commands explained:
Results:
Now let's check ARC summary after these small tests: https://pastebin.com/Gq9jkJS6
As you can see the Cache Hit Ratio is 99.46%. But this is (I`m guessing) because we had small files in our tests, enough to fit in the ARC. But let's make a bigger test, something that cannot fit in the ARC.
Random Read/Writes – SYNC mode with 100 GB file and 4k block size. During the layout (loading) of fio the ARC was increasing, IO delay 5% and 4-5% CPU usage, however once ARC was full, IO delay jumped to 10-19% and even 28%. (I`m guessing since it was writing directly to the disk now)
Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/xcsz5 (sorry, I reached my limit on pastebin and I will not create an account there)
I do not understand everything in the stats, and what they represent but I will leave that to someone willing to explain. All I understand is that the cache hit ratio is a bit bigger now.
Not quite happy with the results, so I will not try to add a ZIL SLOG. From the docs: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_limit_zfs_memory_usage, it says to create a partition for ZIL and L2ARC.
The partition for ZIL should be half of the system memory (as the docs say), so in my case I made it 125 GB, and the rest of 769 GB I will dedicate to L2ARC.
Adding the SLOG, without mirror just for testing:
Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.
Overall, after adding the ZIL/SLOG NVME SSD the performance has not increased at all from what I see. The next step is to add L2ARC, and test:
Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.
Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/ktd8h
Interesting is that now Cache Hit Ratio is 99.52% which shows an increase compared without L2ARC and SLOG. L2ARC Hit Ratio is 65.01%, I`m not sure if this can be/or will increase.
Conclusion:
More questions:
Thank you for those taking the time to read my (long) post with questions.
Apologies for the many paste links, but the forum does not allow posting such long content as of now. (Please enter a message with no more than 10000 characters. but I had 9k)
The system:
- CPU: AMD EPYC 7371
- RAM: 256 GB running at 2666 MHz (M393A4K40CB2-CVF)
- HDD: 2× 6TB HDD SATA Soft RAID (HGST_HUS726T6TALE6L1)
- NVME: 2× 960GB SSD NVMe (SAMSUNG MZQLB960HAJR-00007)
- Network: 10 Gbps
ARC settings (as per the docs): https://ghostbin.co/paste/8pkku
The goal:
- Run as many KVM/WIN machines as possible, without lag inside them
- Have the lowest server load as possible, without dedicating all the RAM to ARC and L2ARC
- Prevent server crash running many VMs (obviously)
- ZFS (Zettabyte File System) is an amazing and reliable file system.
- ZIL stands for ZFS Intent Log. The purpose of the ZIL in ZFS is to log synchronous operations to disk before it is written to your array.
- ZIL SLOG is essentially a fast persistent (or essentially persistent) write cache for ZFS storage.
- ARC is the ZFS main memory cache (in DRAM), which can be accessed with sub microsecond latency.
- L2ARC sits in-between, extending the main memory cache using fast storage devices, such as flash memory based SSDs (solid state disks).
- Special Device is called Special Allocation Class, essentially you add a fast SSD and it will speed up the slow disks ?! Not much info out there. It cannot be removed.
- KVM machines use synchronous write instead of asynchronous.
- Block size should be the same as the hardware, in case of normal HDDs (and my case) 4k.
- ZIL SLOG must be added "if needed", yet some posts/websites say it's a must. It is also mentioned the ZIL must be mirrored in RAID1, but some say it's not required.
- L2ARC must be added "if needed", and eats up ARC (RAM). It is not clear what is the ratio used, for example if you add 400 GB SSD, how much ARC it will use.
- How to measure/find the optimal block size for the pool, running only KVM/WIN machines
- Do I really need SLOG and L2ARC
- How to set the limit for SLOG/L2ARC to be optimal
- Do I need to add special device ?
- Can you mix SLOG/L2ARC with special device ?
Current pool status: https://ghostbin.co/paste/ups72
Current ARC summary: https://pastebin.com/y4WRCvbb
ZDB report: https://pastebin.com/eW9JRy1N
PVEPERF: https://ghostbin.co/paste/gpmos
Small tests using FIO, commands explained:
- --direct=0 # O_DIRECT is not supported in synchronous mode
- --name=seqread # File name
- --rw=read # Type of test
- --ioengine=sync # Defines how the job issues I/O. We will be using SYNC since our pool has sync set to "standard"
- --bs=4k # This is the default block size for these HDD's
- --numjobs=1 # 1 test only
- --size=1G # File size
- --runtime=600 # Terminate processing after the specified number of seconds
Results:
- Sequential Reads – SYNC mode: https://pastebin.com/YgzpD1mr
- Sequential Writes – SYNC mode: https://pastebin.com/0jTyp5h8
- Random Reads – SYNC mode: https://pastebin.com/zSa8H9yL
- Random Writes – SYNC mode: https://pastebin.com/Ahe7m9Lb
- Random Read/Writes – SYNC mode: https://pastebin.com/Zt7NpMQs
Now let's check ARC summary after these small tests: https://pastebin.com/Gq9jkJS6
As you can see the Cache Hit Ratio is 99.46%. But this is (I`m guessing) because we had small files in our tests, enough to fit in the ARC. But let's make a bigger test, something that cannot fit in the ARC.
Random Read/Writes – SYNC mode with 100 GB file and 4k block size. During the layout (loading) of fio the ARC was increasing, IO delay 5% and 4-5% CPU usage, however once ARC was full, IO delay jumped to 10-19% and even 28%. (I`m guessing since it was writing directly to the disk now)
- First run: https://pastebin.com/N0PqXtFA As we know, the average iops for a spinning HDD is aroung 75-100 iops, we can see an increase up to 135 iops on average. This is not the same as the small 1 GB files, and my guess/thought is because the ARC is full.
- Second run: https://pastebin.com/GtiQYHC8 Instant run this time, because the data is cached in the ARC. We can see an increase to iops up to 150 iops on average.
- Third run: https://pastebin.com/nAGFrGyj This time we see even less iops, this I don't understand why exactly. From my opinion it should be the same as run #2 or even better ? The average iops now is as run #1 and even less.
Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/xcsz5 (sorry, I reached my limit on pastebin and I will not create an account there)
I do not understand everything in the stats, and what they represent but I will leave that to someone willing to explain. All I understand is that the cache hit ratio is a bit bigger now.
Not quite happy with the results, so I will not try to add a ZIL SLOG. From the docs: https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#_limit_zfs_memory_usage, it says to create a partition for ZIL and L2ARC.
The partition for ZIL should be half of the system memory (as the docs say), so in my case I made it 125 GB, and the rest of 769 GB I will dedicate to L2ARC.
Adding the SLOG, without mirror just for testing:
Code:
# zpool add rpool log nvme0n1p1
# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
logs
nvme0n1p1 ONLINE 0 0 0
errors: No known data errors
Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.
- First run: https://ghostbin.co/paste/brsqd
- Second run: https://ghostbin.co/paste/7w7xj (performance worsened ?)
- Third run: https://ghostbin.co/paste/bf362 (performance worsened ?)
Overall, after adding the ZIL/SLOG NVME SSD the performance has not increased at all from what I see. The next step is to add L2ARC, and test:
Bash:
# zpool add rpool cache nvme0n1p2
# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sda2 ONLINE 0 0 0
sdb2 ONLINE 0 0 0
logs
nvme0n1p1 ONLINE 0 0 0
cache
nvme0n1p2 ONLINE 0 0 0
errors: No known data errors
Now let's get back to testing, again with Random Read/Writes – SYNC mode with 100 GB file and 4k block size. The ARC was cleared, the file was removed and we start fresh with fio.
- First run: https://ghostbin.co/paste/7png2 I can finally see some improvement! On average 160 iops, and was more impressive is that the CPU load was under 1% after the layout of fio. IO delay was around 3% constantly. Good (?). Here are some arcstats towards the end: https://ghostbin.co/paste/b4c9o
- Second run: https://ghostbin.co/paste/c7x8x Things are starting to look good! Average iops now is 270, almost double. At the beginning it jumped even to 4k iops, but I guess that's just because of ARC.
- Third run: https://ghostbin.co/paste/zpkuq Even more increase this time, average iops being 335 which shows a good improvement.
Now let's check ARC summary after these bigger tests: https://ghostbin.co/paste/ktd8h
Interesting is that now Cache Hit Ratio is 99.52% which shows an increase compared without L2ARC and SLOG. L2ARC Hit Ratio is 65.01%, I`m not sure if this can be/or will increase.
Conclusion:
- ZIL/SLOG without L2ARC does not improve anything
- Server load (CPU mostly) is much lower and better using ZIL/SLOG + L2ARC cache.
More questions:
- Is this setup optimal ? Is there anything that needs/can be tweaked ? Without adding hardware of course.
- Is the special device adding in combination with this a bonus ? The second NVME sits there for nothing right now.
- Does the NVME need to be RAID1 or can stay as is ? This is datacenter class so lifetime should not be an issue for a good while.
Thank you for those taking the time to read my (long) post with questions.