(Developer) question about .chunk folder

Why does it make sense - on any filesystem - to run millions of readdir(3) and fstat(2)?

Please explain.
because zfs has the instruments for that as kernel extention. zfs read operations are cached, write operations are on premise if needed, but can be cached if not important. also, zfs solves a lot of other issues and offers redundancy, that people faced a few years ago with hardware raid or mdadm. i mean: ofc. it is possible to use a database like sqlite on zfs or any other file system. but again, as soon as you setup your database to write on premise, it will get slow.
i need to admit, that i didn't benchmarked sqlite with write on premise vs. zfs write operations ignoring cache. but it won't change a lot, it will be slow. just search for firebird database issues (writes on premise by default)... which i did benchmarked.
schreib mir mal eine private nachricht, ich gehe mit dir gerne ein paar sachen durch bei gelegenheit.
 
what are we talking about then? the *only* operation needing to access only the chunk metadata is garbage collection, all other operations need to write or read the chunks anyway so a metadata cache would not help at all

Linear read/write/unlink are not the slow operations.

Why?

Check my first post. It took 37 seconds to "find" 450.000 files.

Bash:
root@pbs:~# time du -hs /pbs
time find /pbs -type f -print | wc -l
707G    /pbs


real    0m37.253s
user    0m0.377s
sys     0m9.537s

Same test for linear read/writes

Bash:
root@pbs:/pbs# time dd if=/dev/zero of=1gb.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 26.9374 s, 39.9 MB/s

real    0m28.747s
user    0m0.000s
sys     0m1.617s

# I rebooted here to have a "cold" read
root@pbs:/pbs# time cat 1gb.img >/dev/zero

real    0m3.038s
user    0m0.002s
sys     0m1.072s

root@pbs:/pbs# time dd if=/dev/zero of=lots-of-writes.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 1.00075 s, 512 kB/s

real    0m1.127s
user    0m0.003s
sys     0m0.019s
 
because zfs has the instruments for that as kernel extention. zfs read operations are cached,

On any FS operations are cached :)

1717679004525.png


That doesn't mean it's a good idea to run fstat(2) / readdir(3) over and over again.

The pure fact that zfs has some extra storage for the inode index - and doesn't need to run seek operation - is no rational argument for running fstat(2) and readdir(3) more then once.

Correct me if I am wrong here. What do you "gain" from this multiple calls? Nothing has changed. You jump from user space in kernel space (which is in itself a costy operation). Why not cache it in ram? I showed you - in math - that 60mb for 1/2 million of files.
 
Probably even a local redis would make sense to just store the key / fstat / path / node info information
oh, wait, redis is a RAM based database! this is why it is fast in i/o but you don't want a crash, while it wasn't finished saving the file on disk. redis is only for small and unimportant data, like configs (where in case of crash the old config still readable).

On any FS operations are cached :)

View attachment 69362


That doesn't mean it's a good idea to run fstat(2) / readdir(3) over and over again.

The pure fact that zfs has some extra storage for the inode index - and doesn't need to run seek operation - is no rational argument for running fstat(2) and readdir(3) more then once.

Correct me if I am wrong here. What do you "gain" from this multiple calls? Nothing has changed. You jump from user space in kernel space (which is in itself a costy operation). Why not cache it in ram? I showed you - in math - that 60mb for 1/2 million of files.
... no words, except of insults. mea culpa. your cache info is for i/o that is cached, and ofc. there is always a cache available and used... except you do write on premise, which goes directly to disk for every single write op, but it won't cancel your already existing cache or any other op, that is not on premise.

Harry, i stumbled in this topic by accident while searching for my issues, but now, please, rtfm at least about the stuff you think to know.
 
oh, wait, redis is a RAM based database! this is why it is fast in i/o but you don't want a crash, while it wasn't finished saving the file on disk. redis is only for small and unimportant data, like configs (where in case of crash the old config still readable).

You don't have to persist it. That is the point.

You can even creat the index on the fly and cache it.
 
You don't have to persist it. That is the point.

You can even creat the index on the fly and cache it.
wrong, you have to persist, since chunks, that might be removed from index in your non persistant dream case, are also needed for older backups.
a file based database might get corrupted as a whole, when not persistant. as a antipole to file based databases, there are real databases (i don't speak sql or no-sql, you could suggest mongo, which is persistant) they use a tracking log for every single query. and those databases are real db servers, like ms sql with its transaction log, using the database file with cache to assure your complicated queries (those with lots of joins) are fast, but writes log on premise, which does not concern the main cpu threads... i am out for today.
 
wrong, you have to persist, since chunks, that might be removed from index in your non persistant dream case,

You don't have to persist the meta information on chunks!

I covered a 3 possible problems with examples.

- File exist in the cache only (but something removed it from disk)
- File exists on the disk only (but is not in the cache).
- File was removed by a crazy user operation / or is not yet on the disk (rm, diskfailure, ... PBS already handels this - it's called "disk failure, "user error" - for for Tapes - "File needs to be recovered from the tape, because it's not yet on the disk cache"). Check the source code I am referencing.



In the worst case szenario - you do an fstat(2) instead of using the cache. There is no "data corruption".

Very very simple.

There is no problem. We don't "replace" the filesystem - we only want to reduce the fstat(2) and readdir(3) calls.
 
Linear read/write/unlink are not the slow operations.
but chunk read writes are not linear but random

root@pbs:/pbs# time dd if=/dev/zero of=1gb.img bs=1G count=1 oflag=dsync
thats the equivalent of ~256-512 chunks not 450.000

so when we have to read/write chunks, the stat before for each can't really hurt in comparison to reading/writing 1-4MiB
and you said we don't talk about garbage collection, but that's the only high level operation that purely relies on the chunk metadata (all others must read or write)
so if the metadata cache is not to improve garbage collection why do this at all, since it'll have no to little effect on the other processes like verification/backup/restore/etc.

You don't have to persist the meta information on chunks!

I covered a 3 possible problems with examples.

- File exist in the cache only (but something removed it from disk)
- File exists on the disk only (but is not in the cache).
- File was removed by a crazy user operation / or is not yet on the disk (rm, diskfailure, ... PBS already handels this - it's called "disk failure, "user error" - for for Tapes - "File needs to be recovered from the tape, because it's not yet on the disk cache"). Check the source code I am referencing.

this doesn't nearly cover all cases... in each case we must differentiate between functions/methods that rely on the metadata or actually read/write the chunk. and for methods that only rely on the metadata (e.g. check for existence when we want to avoid a write or something) if the cache in inconsistent with the underlying filesystem this is catastrophic! as in, if we'd check if the chunk exists so we don't write it, but in reality it's gone (even though it's in the cache) we wouldn't write altough we would need to


There is no problem. We don't "replace" the filesystem - we only want to reduce the fstat(2) and readdir(3) calls.
again that only really helps garbage collection which you mentioned this thread is not about, so this won't bring any improvement.

I'd recommend (as we already in the documentation and everywhere else do): don't use slow and even remote storages for datastores, as you won't be happy with it if they're getting bigger
if your budget absolutely does not allow for that, there are solutions for some slow storages like zfs metadata special device.
 
but chunk read writes are not linear but random


thats the equivalent of ~256-512 chunks not 450.000
...

Dear Dominik,

the problem is much more severe.

I am about to publish testcode and helperscripts. This will give us - realworld - measurements on real hardware comparing these szenarios:

- zfs
- ext4
- nfs
- smb
- sata ssd
- sata
- nvme

Tests:

- create 0000-ffff directories
- create 500.000 random ssh256 filenames, every file is created via (the equivalent of) "echo -n $x | ssh256sum"
- query all files (readdir)
- fstat all files (readdir+)
- random access n files (ssh256 name)
- read the data from n files

(I started with shell scripts, but the pipes and child prosess take most of the time - so I went into real coding)

To be honnest - I am shocked about the numbers so far.

It will take some time to have useful numbers. There is a dramatic improvement possible here!



Willing to agpl the script and create a tutorial here.
 
if you really set on producing some solution here, i'd recommend going via our developer list so that code could be integrated,
i'd personally recommend writing there before actually starting any implementation though.

also i can't promise you will get a different answer than mine there...

see https://pve.proxmox.com/wiki/Developer_Documentation
for details on how to contribute code
 
if you really set on producing some solution here, i'd recommend going via our developer list so that code could be integrated,
i'd personally recommend writing there before actually starting any implementation though.

I am not implementing any code here. I still feel like the caveman throwing rocks on a space shuttle.

After the testsuit I am currenly writing. I am sure we are going to have a better foundation for any possible action steps.
 
if you really set on producing some solution here, i'd recommend going via our developer list so that code could be integrated,
i'd personally recommend writing there before actually starting any implementation though.

I created a performance tester. I tested it on quite old hardware (ssd, sata, usb, multiple CPUs, multiple computers, multiple filesystems, multiple remote filesystem, ext4, zfs, ...)

That is a solid foundaion to remove any myth / bs / assumptions people have

https://github.com/egandro/pbs-storage-perf-test

This is my conclusion (I have a chapter what Proxmox GmbH can do):

https://github.com/egandro/pbs-storage-perf-test/blob/main/conclusion.md


There are two coding issues with PBS I found:

- The biggest (coding) impact - the way PBS treats buckets. I think 0000-ffff is too much and the numbers will show that. Squid uses a different directory structure. Try to experiment - this is a major performance thing.
- Also please detect smb/nfs/webdav datastores. It is the worst you can do! There are tons of tutorials and youtube videos on how to mount this BS in PBS. PBS behaves like O(n)=e^x compared to a local filesystem.


The code is AGPLed. Please ping me if you need something else. I found my answers.
 
note that i know that benchmarking storage is hard and error prone, but that has to be correctly done to produce results that are usable:
i looked over the code and the results a bit and a few things stand out that might be not showing the full or correct picture:

about the methodology:

* i'd use 'fio' as a base benchmark for disks (not only because i find bonnie++ output confusing, but it does not really show the rand read/write IOPS performance for e.g. 4k blocks, and i don't reallly know how bonnie behaves with cache)
* you only let the benchmark run for one round, in general, but especially for benchmarking storage, i'd recommend using at least 3-5 passes for each test, so you can weed out outlier (with one pass you don't know if any of them is an outlier)
* the benchmark parameters of the filesystems/mounts would be nice, which options did you use (if any), how did you create the filesystem, etc. especially for zfs that would be interesting because it can make a big difference
* your 'drop_caches' is correct for "normal" linux filesystems, but not for zfs since it has it's own arc (which size would also be interesting), basically you'd have to limit the arc to something very small to properly benchmark it (also there is an extra option for zfs cache dropping)
* you have an error in the 'find_all_files' function, you list each chunk bucket folder for every file in it, so for 500.000 files you list each of the ~65000 folder about ~7-8 times Wrong see below
* you list 50000 files not 500000, but the 'no_buckets' version lists all?
* each "chunk" you create only has 1-6 bytes (he index itself) which skews the benchmark enomoursly. please try again by using "real" chunks in the order of 128k - 16MiB (most often the chunks are between 1 and 4 MiB)
this is the biggest issue i have with this, because the benchmark does not produce any "real" metric with that. you probably benchmark more how many files you can open/read not how much throughput you get in a real pbs scenario

as for the benchmarks themselves:

only two of the benchmarks have a relation to real world usage of the datastore:
* create random chunks, mostly behaves like a backup
* stat_file_by_id: similarly behaves like a garbage collect but not identically, since there are two passes there: a pass where we update the atime of the files from indices and one pass to check and delete chunks that are not necessary any more

'read_file_content_by_id' is a suboptimal try to recreate a restore because you read the files in the exact same order as they were written, which can make a huge difference depending on disk type and filesystem
'find_all_files' is IMHO unnecessary since that is not an operation that is done anywhere in pbs (we don't just list chunks anywhere)

as for the results:

there are some results that are probably not real and is masked by some cache (probably the drives): e.g. you get ~200000 20000 IOPS on a QLC ssd, i very much doubt that is sustainable, i guess there is some SLC cache or similar going on (which skews your benchmark)
please don't use usb disks, the usb connection/adapter alone will introduce any number of caches and slowdown that skews the benchmark

now for the conclusions (at least some thoughts on some of them):

* yes i agree, we should more prominently in the docs tell users to avoid remote and slow storages
* zfs has other benefits aside from "speed", actually it's the reverse, it's probably slower but has more features (like partiy, redunancy, send/receive capability, etc.)
* SATA has nothing to do with performance, there are enterprise SATA SSDs as well
* there is already a 'benchmark' option in the pbs client, we could extend that to test other things besides chunk uploading of course
* i'm not conviced the number of buckets make a big difference in the real world (because of the methodology points i pointed out above)
* i'm not sure sshfs is a good choice for remote storage, everytime i use it find it much slower than nfs/smb, also it may not support all filesystem features we need (it does not have strict POSIX semantics)

thanks for taking the time to try to improve PBS!

EDIT: fixed some things i misread (FILES_TO_READ vs FILES_TO_WRITE)
 
Last edited:
note that i know that benchmarking storage is hard and error prone, but that has to be correctly done to produce results that are usable:
i looked over the code and the results a bit and a few things stand out that might be not showing the full or correct picture:

about the methodology:

* i'd use 'fio' as a base benchmark for disks (not only because i find bonnie++ output confusing, but it does not really show the rand read/write IOPS performance for e.g. 4k blocks, and i don't reallly know how bonnie behaves with cache)


Please carefully read the test. We are not benchmarking the "disk io" that is just an additional information. It's not about the speed of the disk - it's about the speed of the filesystem.

* you only let the benchmark run for one round, in general, but especially for benchmarking storage, i'd recommend using at least 3-5 passes for each test, so you can weed out outlier (with one pass you don't know if any of them is an outlier)

What will you gain in a multipass? We see the local filesystem is 1.2seconds (for 65536 directories) - while samba is 18 minutes), NFS is 40 seconds and ssh fs is 20 seconds. Seeing is believing. Even if a 3rd or 5rd pass would have a 10% variance - what do you think is bad?

Caches are eliminated in every pass. We are not benchmarking the ram. (echo "1" .... "3" > /proc/sys/vm/drop_caches)

target dirsha256_name_generationcreate_buckets
.0.25s1.19s
/nfs0.27s38.14s
/smb0.30s1129.43s
/sshfs0.30s19.50s
/iscsi0.26s1.87s
/ntfs0.24s1.99s
/loopback-on-nfs0.25s1.50s


* your 'drop_caches' is correct for "normal" linux filesystems, but not for zfs since it has it's own arc (which size would also be interesting), basically you'd have to limit the arc to something very small to properly benchmark it (also there is an extra option for zfs cache dropping)
* you have an error in the 'find_all_files' function, you list each chunk bucket folder for every file in it, so for 500.000 files you list each of the ~65000 folder about ~7-8 times Wrong see below

With all respect. That is totally irrelevant. ext4 vs zfs are equal (line 1 and line 2) I don't care if a 8 year old hardware needs 60 sec or 80 sec to create 500.000 random files in 65500 buckets. That's irrelevant.

NFS and SMB are about 2 hours for the same operation. That is crazy. Only sshfs is near something that I would except - if I am forced to pick a remote fs.

target dirfilesystem detected by stat(1)create_bucketscreate_random_filesread_file_content_by_idstat_file_by_idfind_all_files
.ext2/ext36.39 sec67.58 sec24.99 sec10.98 sec42.25 sec
/zfszfs2.43s85.24s24.86s16.52s76.11s
/nfsnfs496.37 sec7256.65 sec97.60 sec59.51 sec189.97 sec
/smbsmb23455.01s6120.51s171.50s169.93s1023.71s
/sshfsfuseblk170.25 sec2434.56 sec258.87 sec183.00 sec1469.04 sec
/iscsiext2/ext38.01 sec132.06 sec49.01 sec23.55 sec55.53 sec

* the benchmark parameters of the filesystems/mounts would be nice, which options did you use (if any), how did you create the filesystem, etc. especially for zfs that would be interesting because it can make a big difference

Feel free to optimize that. I personally see no "magic mounting option sauce" to reduce the 2 hours of samba/nfs and make it near the 60-80 sec timings of the local filesystem. That magic option would make a factor 1000 speedup. Again - we are droping caches in every pass to have the worst case.



* you list 50000 files not 500000, but the 'no_buckets' version lists all?

Yes we saturate the filesystem with a "real word-is" 500.000 files. That is what I have on most of my systems after a few months as chunks in my working set. We run a buckets vs. no buckets operation. For the "read/stat files" we take 10% of that files, restore operation.

Again - we are not interessted in the disk speed, that is why the files contain only a simple integer. We are measuring the open() (and internal fs seek operation) and stat() methods of the OS

The find all files is what the garbage collection method does. It does the equivalent of a "find /foo - type f" call.


* each "chunk" you create only has 1-6 bytes (he index itself) which skews the benchmark enomoursly. please try again by using "real" chunks in the order of 128k - 16MiB (most often the chunks are between 1 and 4 MiB)
this is the biggest issue i have with this, because the benchmark does not produce any "real" metric with that. you probably benchmark more how many files you can open/read not how much throughput you get in a real pbs scenario

I totally agree that the chunk size is a relevant factor. That why we tried to eliminate that factor by all means and to have as few read/write operations as possible. We are not interessed about a speed competiton in "what hardware is the most capeable of linear speed read and writes of block". We only have a single read/write with the size of an i-node of the filesystem.

That is done on purpose - as this is a filesystem and not read/write throughput benchmark.

I totally agreee one can do such tests and that they are beneficial. We are still convinced that nfs and samba are the worst fileystems that you can take, as you have to wait 2 hours to create 500.000 files with 4 bytes. Just imagine we would create 500.000 x 4mb files - that is eternity...


only two of the benchmarks have a relation to real world usage of the datastore:
* create random chunks, mostly behaves like a backup
* stat_file_by_id: similarly behaves like a garbage collect but not identically, since there are two passes there: a pass where we update the atime of the files from indices and one pass to check and delete chunks that are not necessary any more)

Check

'read_file_content_by_id' is a suboptimal try to recreate a restore because you read the files in the exact same order as they were written, which can make a huge difference depending on disk type and filesystem

Fun fact :) that is exactly what is happening and on purpose.

File 1...500.000 (and file with 1..50.000) So we exaclty know that the 50.000 we read are the same 50.000 that are "connected" like in a realworld backup.

Python:
# write
for x in range(FILES_TO_WRITE):
    # ...
    filename = BASE_DIR + "/buckets/" + prefix + "/" + sha  
    # ...
    f.write(str(x))

# read
for x in range(FILES_TO_READ):
    # ...
    filename = BASE_DIR + "/buckets/" + prefix + "/" + sha  
    # ...
    bytes = f.read()


'find_all_files' is IMHO unnecessary since that is not an operation that is done anywhere in pbs (we don't just list chunks anywhere)

I am not sure about the pbs code - but you have some read_dir() calls: https://github.com/search?q=repo:proxmox/proxmox-backup+read_dir&type=code

* stat_file_by_id: similarly behaves like a garbage collect but not identically, since there are two passes there: a pass where we update the atime of the files from indices and one pass to check and delete chunks that are not necessary any more

I totally agree. The modification of the fstat and the unlink test are not implemented. That is on purpose. The required step is the stat() call - and as you can see in the numbers are sort of ok in between the filesystems for this operation

Best / Worst:

3s vs. 45s (nvme)
611s vs. 674s (usb sata)
10s vs. 183s (old m.2)

But what is worse - the hours you wasted until you came to this point to have the 500.000 on the disk. Here what I expect - updating the stat and unlinking - will have a similar "hell no!" recomendation for nfs and samba - ssh will be have ok-ish. There will be no much difference in zfs vs. ext4.

Do you think it's worth adding that test?

'read_file_content_by_id' is a suboptimal try to recreate a restore because you read the files in the exact same order as they were written, which can make a huge difference depending on disk type and filesystem

As mentioned - that's on purpuse.

Should I read this in inverse order?


there are some results that are probably not real and is masked by some cache (probably the drives): e.g. you get ~200000 20000 IOPS on a QLC ssd, i very much doubt that is sustainable, i guess there is some SLC cache or similar going on (which skews your benchmark)
please don't use usb disks, the usb connection/adapter alone will introduce any number of caches and slowdown that skews the benchmark

I used - on purpose - the worst hardware that I could find. Using the Big O notation we learned, that hardware just changes a coefficient. It is very hard to change a factor in a power of 10 e.g. "this is 1000x faster".

Thanks for your rant on usb for any proxmox product :) :) :)

Yes the internal caches of hardware are an additional factor that you have to eliminate by the content size of the blocks.

Anyway :) That is what makes this test optimal. For the localhost/localhost nfs, smb, sshfs... we eliminated the factor hardware - since we run on the same disk ask the ext4. For benchmarking the filesyste it's quite accurate.

I hope that you don't live under the impression putting a nfs/samba server on top of a ramdisk would make it "fast" :) I can do such a test. We will still see the 40 sec vs. 2 hours.



now for the conclusions (at least some thoughts on some of them):

* yes i agree, we should more prominently in the docs tell users to avoid remote and slow storages
* zfs has other benefits aside from "speed", actually it's the reverse, it's probably slower but has more features (like partiy, redunancy, send/receive capability, etc.)
* SATA has nothing to do with performance, there are enterprise SATA SSDs as well
* there is already a 'benchmark' option in the pbs client, we could extend that to test other things besides chunk uploading of course
* i'm not conviced the number of buckets make a big difference in the real world (because of the methodology points i pointed out above)
* i'm not sure sshfs is a good choice for remote storage, everytime i use it find it much slower than nfs/smb, also it may not support all filesystem features we need (it does not have strict POSIX semantics)

  • ZFS might be faster with more disks /raid (I was lazy not to measure that) - I coudn't find any "dramatic" gain over ext4
  • We used SATA spinning rust. If you have the option to choose - use flash for the backups/restore and spinning rust for backups of backups
  • the number of buckets (and how you create them) are a major issue (on remote filesystems) the tests clearly indicates that
  • sshfs is the only remote fs - the numbers - prove to chose do you really recomend to wait 2h for samba / nfs - while sshfs is done in 5 minutes?

Can you please elaborate what made you guys use buckets? That would be interesting. Here what squid does:

Code:
# cache_dir ufs /var/spool/squid XXXXX 16 256
/var/spool/squid/00/00
/var/spool/squid/00/BA
...
/var/spool/squid/00/98
/var/spool/squid/00/2B

You get 16 toplevel buckets each with 256 child buckets. So the user can decide what's "optimal".
 
Last edited:
ok so i did read the whole post, but i will not go into detail on every point here, i'll summarize my opinion though:

* you disregard my critic about your methodology, if the methodology is flawed, you don't need to benchmark at all. your smb create bucket tests vary from 1100s to over 3000s but tell us that it's not about disk performance
locally here i could create the directories on a samba share in about 50s, so what does that tell us?

* we always said "don't use slow and/or remote storage" that didn't and probably will not change, and there is no silver bullet to "fix" it because of the nature of these filesystems

* theoretical benchmarks of disk operations are nice, but only if the method is fine, and if you don't benchmark things that are relevant in real-world scenarios (e.g. chunk size) it does not say anything about the system you want to use

all in all, i stand by my initial post in the thread: any type of cache for the chunks not done by the os is too complicated and brittle to work properly and probably won't any real world gain on performance and we don't recommend slow or remote filesystems
(in fact i did send a patch for the docs to spell that out: https://lists.proxmox.com/pipermail/pbs-devel/2024-June/009804.html )
 
ok so i did read the whole post, but i will not go into detail on every point here, i'll summarize my opinion though:

* you disregard my critic about your methodology, if the methodology is flawed, you don't need to benchmark at all. your smb create bucket tests vary from 1100s to over 3000s but tell us that it's not about disk performance
locally here i could create the directories on a samba share in about 50s, so what does that tell us?

Yes - that is the point. It's relative to one disk / hardware. The hardware is eliminated. It's not the absolute numbers that are the issue.

  • if you use ext4 / zfs on the same disk you get 20sec
  • if you use smb/nfs (mounted localhost) with the same ext4 / zfs - you have 2 hours
I don't care if on a different hardware it's 80sec vs. 5 hours or 10 sec vs. 1.5 hours.

The conclusion "smb/nfs" (as a filesystem) is evil is still the same.


* we always said "don't use slow and/or remote storage" that didn't and probably will not change, and there is no silver bullet to "fix" it because of the nature of these filesystems

I hope that we have an agreement that any YouTube tutorial "how to use nfs / smb with your Proxmox Backup Server" is very bad advise.

* theoretical benchmarks of disk operations are nice, but only if the method is fine, and if you don't benchmark things that are relevant in real-world scenarios (e.g. chunk size) it does not say anything about the system you want to use

100% agreed - we are comparing not hardware (you can always throw money on that) - we are comparing filesystems (that is a choice for a given set of hardware).

all in all, i stand by my initial post in the thread: any type of cache for the chunks not done by the os is too complicated and brittle to work properly and probably won't any real world gain on performance and we don't recommend slow or remote filesystems

Thank you for that statement :) and because of my numbers I still stand with the advice >> if << you have to pick a remote filesystem, never ever go with nfs / samba. The least evil thing is sshfs (for homelab or as pbs backup backup). Better use iscsi or similar - or a PBS VM with the datastore cow2 disk on nfs.

(in fact i did send a patch for the docs to spell that out: https://lists.proxmox.com/pipermail/pbs-devel/2024-June/009804.html )

That is a very nice insight! What would be very cool to put this in PBS UI and add a warning indicator to the datastore. NFS / SMB can be easily detected.

https://man7.org/linux/man-pages/man2/statfs.2.html
https://stackoverflow.com/questions...ode-whether-a-directory-is-on-nfs-file-system


I probably create a test where we can play with buckets and chunk size. The bucket organization is a parameter and I stay with "it matters" :)

Is there a "reason" you can remember, why you guys picked 0000-ffff ? (That is something something very smart someone decided!)
 
Last edited:
I could add nothing to this thread, but: thank you, it was really interesting! Shedding light on these details is really not easy it seems :)
 
Please carefully read the test. We are not benchmarking the "disk io" that is just an additional information. It's not about the speed of the disk - it's about the speed of the filesystem.

In that case I must ask, what is in your opinion the purpose of measuring the speed of the filesystem? I mean no disrespect, but every in "real world" scenario IO is a much greater factor and ultimately the actual bottleneck.

This doesn't mean that your tests are without merit, but at this point in the conversation I feel like I need to point out that any potential improvements / optimizations / etc. that might arise from this thread will most likely be negligible. Worse, such optimizations can sometimes even backfire by introducing additional complexity and maintenance burdens.

Since you initially suggested introducing an on-disk cache, I want to elaborate on what (unintended) consequences the introduction of another cache level can have:

Cache Limitations​

  • A cache is always limited by its size, you cannot have a cache with an indefinite amount of memory.
  • The data that is cached must have temporal locality in order for the cache to be actually effective, which means that data that was accessed and cached once is likely to be accessed again in a short timeframe. (this time from the cache).
  • This timeframe is limited by the size of the cache, which, as stated above, is not unlimited - in fact it's generally small compared to the main storage. (Compare the size of your CPU cache with the size of your drives, for example.)
This means that we must be certain that the data that we're accessing is accessed frequently. In the case of the Squid web cache that you mentioned, that might be the case - most visitors of someone's webpage will usually land on e.g. index.html first, and because that file is therefore always accessed, the cache remains hot and it is looked up faster.

However, if the data we're looking up does not have any temporal locality (or only little), then the whole point of caching is useless and even harmful, because the system will waste CPU cycles, reads, writes, etc. on cache misses. Furthermore, we would need to maintain the state of the cache as well: If the cache is on disk, you have to make more syscalls for just maintaining the cache - in turn, you interact with the filesystem again. Do you see where the problem is now?

So, in order for caching to be effective on PBS, we must be sure that the same data is being accessed repeatedly in a short timeframe and that our cache is large enough to effectively store whatever information you'd like to store there. We would also have to decide whether that cache is in memory or on disk, and whatever you choose, it is to the detriment of something else. Unless someone provides proof that caching does have an effective benefit - this proof being a fully working and sound implementation of caching in regards to PBS, that has been thoroughly tested in real-world applications - I will conclude that adding a cache is pointless.

Furthermore, there are filesystem caches in place already as you saw, why do you think they exist? Sure, by clearing the FS cache you'll get raw results, but those caches have been implemented for a reason, because they have actually been found to have a substantial benefit. This is why ZFS for example comes with its ARC. For what purpose should we add another cache layer that has to be maintained?

Using the Big O notation we learned, that hardware just changes a coefficient. It is very hard to change a factor in a power of 10 e.g. "this is 1000x faster".

Interestingly enough, this is exactly where Big O notation fails to provide any meaningful insight, because hardware matters. A lot. SSDs do not have a seek time and spinup time, HDDs do. The list goes on. Another good example is a performance comparison between vectors and linked lists. Guess which one is faster in practice.


I hope this clears some more things up. All that being said, I do agree that one should avoid NFS / SMB whenever possible, but I must say that your tests only proved what was already known, unfortunately. You're always welcome to come up with more tests and perform them with different filesystems and perhaps also hardware constellations, however.
 
100% agreed - we are comparing not hardware (you can always throw money on that) - we are comparing filesystems (that is a choice for a given set of hardware).
Actually, there is one more thing I want to mention here: The choice of hardware matters too, especially in relation to how your filesystem is set up. That is why all modern filesystems provide so many configuration options - just look at how much there is to configure in e.g. ZFS.

The choice of hardware in general makes a huge difference not just in regards to filesystems. Algorithms like SHA256 are usually also hardware-accelerated, which is one of the reasons to use it if you can, instead of coming up with your own brittle hash function.

My point is that you cannot effectively single out all of these factors, because these factors are not independent. If you want an accurate representation on the overall performance of a filesystem, you must test all kinds of different constellations and configurations on all kinds of different hardware. Regarding filesystems over the network, you'll have to do the same for both client and server.
 
Actually, there is one more thing I want to mention here: The choice of hardware matters too, especially in relation to how your filesystem is set up. That is why all modern filesystems provide so many configuration options - just look at how much there is to configure in e.g. ZFS.

Yes - of curse hardware matters.

But as mentioned ... you learned at semester 1 or 2 in computer science the Big O notation (https://en.wikipedia.org/wiki/Big_O_notation).
You learned, that if a result is 0.12 * 10^2. Hardware changes the coefficent ... so it can make a 1.24 * 10^2 or a 0.03 * 10^2.
The exponent of the 10^2 is changed very very little. A factor of 10 - 100 or 1000 is very hard to be archived with hardware.
Usually a professor teaches the NP problem. NP problems can't be solved with the "fastest" hardware.

What we test here is filesystems compared - relative - on a given hardware.

- you can't change a given hardware. Why? it's a given hardware.
- you can - free at your will - pick a filesystem
- you can - free at your will - test what remote filesystem is the best

-> That is what the test is about.

I don't care if 500.000 SHA256 sums are calculated in 0.25 sec or in 1 sec. That is a stupid number. Just showing "hey this is not a problem".

We care so much if we have to wait 2 hours on NFS shares to create 500.000 files - while a local filesystem takes 30sec (or zfs 40 sec).
We don't care if faster hardware "only" takes 80 minutes on NFS :) At last - I would also avoid NFS like the plague.

NFS is the problem. SMB is even worse.

My point is that you cannot effectively single out all of these factors, because these factors are not independent.

Yeah :) I changed factors to percent. I personally don't care if on a 128 Eypc CPU running samba on a ZFS raid is "only 20000%" slower then native ext4. On this hardware it was 95000% slower.

Correct me if I am wrong here.

For me it's a good indicator what to avoid.

target dirfilesystem detected by stat(1)sha256_name_generationcreate_bucketscreate_random_filesread_file_content_by_idstat_file_by_idfind_all_files
.ext2/ext3
100,00%​
100,00%​
100,00%​
100,00%​
100,00%​
100,00%​
/nfsnfs
108,00%​
3205,04%​
5292,29%​
317,43%​
282,47%​
309,09%​
/smbsmb2
120,00%​
94910,08%​
5298,00%​
1078,00%​
1488,31%​
2528,64%​
/sshfsfuseblk
120,00%​
1638,66%​
1985,52%​
516,12%​
566,56%​
1015,23%​
/iscsiext2/ext3
104,00%​
157,14%​
262,19%​
166,23%​
171,43%​
129,43%​
/ntfsfuseblk
96,00%​
167,23%​
495,62%​
397,39%​
399,68%​
662,84%​
/loopback-on-nfsext2/ext3
100,00%​
126,05%​
620,76%​
140,52%​
130,84%​
108,52%​
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!