(Developer) question about .chunk folder

Der Harry · Jun 2, 2024

I am trying to understand why the performance of PBS is poor on WAN conections or on devices with a very slow seek time.

My question is about the chunk organization and the chunk folder in general.

It is mentioned here: https://github.com/proxmox/proxmox-backup/blob/master/docs/technical-overview.rst#chunks

That are the stats on my system. I am using an NFS share (via Truenas/SATA disks with ZFS)

Bash:

root@pbs:~# time du -hs /pbs
time find /pbs -type f -print | wc -l
707G    /pbs


real    0m37.253s
user    0m0.377s
sys     0m9.537s
root@pbs:~# time find /pbs -type f -print | wc -l
423319


real    0m15.918s
user    0m0.394s
sys     0m3.989s

We see, there are a lot of files. Wouldn't it make sense to create a small sqlite database in the chunk folder, and just store what chunk is already on the disk?
Sort of as a L2 Cache.

There are a number of fstat calls in the source code: https://github.com/search?q=repo%3Aproxmox%2Fproxmox-backup%20fstat&type=code - that times 423319 would explain the poor performance (on a disk with not near O(1) seek performance).

If there are good reasons not to create a 2nd Index- what about creating an in memory cache, that is created at pbs service bootup and mantained during the backup cycle?

Thank you.

dcsapak · Jun 5, 2024

Der Harry said:
We see, there are a lot of files. Wouldn't it make sense to create a small sqlite database in the chunk folder, and just store what chunk is already on the disk?
Sort of as a L2 Cache.

i mean it wouldn't be impossible, but i don't think it makes much sense either (please someone correct me on my arguments if i'm wrong)

1. the filesystem already is a database (specialized for files

)
so i doubt a accessing that random info in a single sqlite file is faster than hitting the filesystem directly (well that would need to be shown, but sqilte must also read/write from disk when one does not want to have the whole database in memory,
which can also be quite large, depending on how much info per file is needed)
2. there is the page cache, once the inode of file is there, the fstat calls shouldn't cost as much
3. the possibilty of divergence, nothing guarantees that the filesystem + a sqlite db are consistent (e.g. someone deletes/overwrites random chunks)
4. for zfs there is already the 'special device' where the metadata can be put on much faster ssds if that's necessary
5. the whole architecture of pbs is more or less built on fast access & seek times (as in ssd/nvme storage)
this might be inconvenient for some, but has some clear advantages (e.g. it makes it possible to have fully deduplicated chunks across different backup groups -> save a lot of space)

maybe there are even more reasons i can't remember now but i think this should already make a good argument

Der Harry · Jun 5, 2024

dcsapak said:
i mean it wouldn't be impossible, but i don't think it makes much sense either (please someone correct me on my arguments if i'm wrong)

1. the filesystem already is a database (specialized for files )

That was my biggest fear that you answer in this way

I had the big chance to start in the IT where AVL an B-Trees was still a thing.

- https://en.wikipedia.org/wiki/B-tree
- https://en.wikipedia.org/wiki/AVL_tree

The only thing were it survived is in the filesystems..

My number proove - that the filesystem is "the worst" Database for keeping fstat(2) information on files.

450.000 files - it takes (cold) 37.253s to fstat them.

That is insane.

450.000 files (with sha1names) - just for testing there existance - that is the size of your CPUs L1 cache.

So

yeah - q.e.d. the Filesystem is the worst way to store this information.

dcsapak said:
2. there is the page cache, once the inode of file is there, the fstat calls shouldn't cost as much
3. the possibilty of divergence, nothing guarantees that the filesystem + a sqlite db are consistent (e.g. someone deletes/overwrites random chunks)
4. for zfs there is already the 'special device' where the metadata can be put on much faster ssds if that's necessary

Yeah but nfs / sata is no ssd.

dcsapak said:
5. the whole architecture of pbs is more or less built on fast access & seek times (as in ssd/nvme storage)
this might be inconvenient for some, but has some clear advantages (e.g. it makes it possible to have fully deduplicated chunks across different backup groups -> save a lot of space)

That's why I point the finger where it hurts - in every location where you do a fstat(2) call

So in other words - replacing that with a consistend database would be possible.

It would be nice to have some numbers. I will try wrapping the fstat code and do some timings to detect the number of calls and the time.

-> I bet this is a factor 1000/10000 (and minimum of O(n²) for sata <-> ssd).

Why am I asking this (probably) stupid question?

1) The performance of PBS with a Hetzner Storagebox is out of any question. And there is really no good reason to have a hetzner storage box as "longterm" storage or as PBS<->PBS storage. But firing 450.000 files over a sshfs / cifs is really bad.

In terms of features I can imagine getting a mountpoint with a flag "slow filesystem" or "enable local index cache" as flag.

2) I had pve crashes, because of mounted pbs crashes, because of fileystem "slowness". So wrapping fstat(2) might be a good idea anyway.

What do you think?

shanreich · Jun 5, 2024

Der Harry said:
The only thing were it survived is in the filesystems..

just shortly chiming in here but indexes in sqlite are literally B-Trees.

Der Harry · Jun 5, 2024

shanreich said:
just shortly chiming in here but indexes in sqlite are literally B-Trees.

Yes

So now give me an explanation why - a sqlite b-tree index for 450.000 keys is fast (and fstat(2) in the fileystem tree) is freaking slow.

Here my explaination:

- Sqlite has 450.000 stored in one place. The disk can do a linear read and at a very early time it's in the memory.

Your Filesystem Database:

Bash:

root@pbs:/pbs/.chunks# find . -type d | wc -l
65537

You need to:

- cd in a directory
- readdir(2) the content
- fstat(2) and filter for the files
- cd back ... go to next directory

Unfortunately this is not a linear read. I-nodes for all these information are all over the disk. If there is no disk - let's say it's NFS /CIFS you are pumping a lot "stuff" over a network - and a lot of ping-pong is required for very very little information!

We just need to know:

- "Hey does 4355a46b19d348dc2f57c046f8ef63d4538ebb936000f3c9ee954a27460dd865 exist?"
- "What IDs do I know"?
- "Is 32c2643e0dc65524c9f1f6f9f00937322fd68d59986bc381d9ff2285d23e353d encrypted?"

So yes - a local sqlite database would be awesome and stripping the "database queries" .

Willing to help!

I guess you guys will love a "PBS is freaking faster then Veeam" marketing thing?

flames · Jun 6, 2024

garbage collection is using atime arg only, or is there actually a database of some kind in the backend (i mean not zfs only)?

Der Harry · Jun 6, 2024

flames said:
garbage collection is using atime arg only, or is there actually a database of some kind in the backend (i mean not zfs only)?

Please elaborate (if you are a developer).

We are not talking about garbage collection.

flames · Jun 6, 2024

I am not a Proxmox product developer, if that is the question, but a developer.
I try to understand, how the chunk store (datastore) is working deeply from the view of ZFS, backup, pruning and garbage collection as a whole concept.
What brought me here to this topic, is the way to clean up full datastore (fixed w/o loosing backups), when garbage collection is failing (reasons aside, but monitoring failed unexpectedly. sorry for hijacking the topic, set it to readonly for me).

Der Harry · Jun 6, 2024

flames said:
I try to understand, how the chunk store (datastore) is working deeply from the view of ZFS, backup, pruning and garbage collection as a whole concept.

I asked no ZFS question.

I am not even using ZFS for pbs.

This question was not about the pruning and garbage collection part of pbs.

----

The question is about a programming practice to "query" the filesystem via a gazillion of readdir(2) and fstat(2) operations.

These operations are slow, not required for any of pbs core functions and independant of the underyling filesystem not neccessary.

In case the FS is nfs or cifs - you send gazillions of ping pongs from pbs to the storage server.

This causes major issues with non ssd disks "just because".

A better solution would be - a local sqlite file (or similar)

As we know all read/write/delete operation (they are passed thru pbs code, it is easy to keep it consistent). It would be even ok to rebuild this L2 index at PBS boot time or when running prunes.

Worst case operation (for an inconsistant storage) - we would try to write the file again - and then detect "oh I have that file" and run one more readir and one more fstat.

Please check my numbers I posted. For 450.000 cache files my average time is about 38sec (!) on a nfs store using sata disk - imagine how slow cifs is (we query 65000 directories)

Again

it's not "writing a file" or "reading a file"! What is slow - are the query operations "what files do we have" - and this can be super cool and easy cached locally on the PBS.

Der Harry · Jun 6, 2024

Probably even a local redis would make sense to just store the key / fstat / path / node info information

dcsapak · Jun 6, 2024

the biggest issue i posted is still unsanswered, how do you make sure the cache is consistent? if some rogue process (or bug) deletes the chunk files, but not from the cache, how would you notice that?
the danger of having inconsistent backups is IMHO not worth it to optimize for a setup PBS was never intended to be put on (i.e. it was designed for fast local storage, not slow and for sure not remote storage)
not every setup must make sense for pbs, and what we lose on performance on some setups, we gain on maintainability and simplicity in the code for those setup we designed for

i'm not saying we can't improve performance at all, i just think that any type of cache for the chunks is probably not it since it explodes in complexity fast

gurubert · Jun 6, 2024

dcsapak said:
it was designed for fast local storage

It is very hard to argue for a backup system that needs big but fast SSD storage if the customer is tight on budget (which all are).

Der Harry · Jun 6, 2024

dcsapak said:
the biggest issue i posted is still unsanswered, how do you make sure the cache is consistent? if some rogue process (or bug) deletes the chunk files, but not from the cache, how would you notice that?

I already covered that

It's dirt cheap simple.

I'll create an architecture diagram

Der Harry · Jun 6, 2024

gurubert said:
It is very hard to argue for a backup system that needs big but fast SSD storage if the customer is tight on budget (which all are).

I am not "anti" SSD.

PBS is not (only) a backup system. It's also a super nice and fast quick way to restore a VM within a short time. So fot that reason a SSD is ok-ish.

Using an SSD because not optimized programming - that is a bad reason.

It is also a bad reason to have super bad performance (just because of the fstat(2) on slower disks). That doesn't make a lot of sense.

flames · Jun 6, 2024

Der Harry said:
I already covered that

It's dirt cheap simple.

I'll create an architecture diagram

wow! why do you not contribute on the open source repository, if its dirt cheap? a diagram is also a good idea, at the cheapest least.

btw. Dominik: tho my brain tends to support the idea of an independent database for chunk-indexing, but i see the issues. no ram caches please

Der Harry · Jun 6, 2024

flames said:
wow! why do you not contribute on the open source repository, if its dirt cheap? a diagram is also a good idea, at the cheapest least.

We call this brainstorming before coding in opensource what we are doing here...

Der Harry · Jun 6, 2024

dcsapak said:
i'm not saying we can't improve performance at all, i just think that any type of cache for the chunks is probably not it since it explodes in complexity fast

As mentioned - that's dirt cheap.

Goal: reduce the number of calls of readdir(2) and fstat(2)

Sources:

https://github.com/proxmox/proxmox-backup/blob/master/docs/technical-overview.rst#chunks
https://github.com/proxmox/proxmox-...d47c1fa41b537/pbs-datastore/src/chunk_stat.rs
https://github.com/proxmox/proxmox-...47c1fa41b537/pbs-datastore/src/chunk_store.rs

What is a chunk?

A chunk is a file with a 64 character name (sha256 sum) - this is a “primary key” and it’s (unique) identifying information
A chunk is placed in a directory 0000-ffff (= 65000 buckets)

Data of a chunk

A chunk has a crc32 at the end
A chunk has a type marker at the beginning

Additional information

A chunk exists or doesn’t exist
A chunk is a file, so it has an c-time, m-time (probably also a meaningful a-time - depending on the FS)

Operations on a chunk

Create, Read, Delete
Touch (with an assert if the Chunk doesn’t exist) - from my understanding this is pbs to pbs backups or tape restore only.
There is a locking during Create to avoid Write:Write hazards. This is also optimization (we only need to write a chunk once - since it’s unique).

----

How to create/rebuild the index?

When the PBS Server is started
- datastore.cfg is loaded
- PBS knows all datastores
- create per datastore an index for the chunks
- an index might be implemented as file, sqlite, redis, ram (64 bytes * 1 Mio files = 60mb!)
On maintenance tasks with heavy I/O e.g. verification / garbage collection
Or simply don't create a dedicated index! Just cache the readdir(2) fstat(2) results in memory.

What to store in the index?

filename (64 chars) (the sha256 sum)
optional (that will need some more work and I don't know if this benefits the goal of reducing readdir(2) and fstat(2) calls)
- c-time
- m-time
- crc32
- type

How to update the index?

We know all operations on a chunk that modifies (delete, create, touch)
All of these operations run in isolated methods in either chunk_stat.rs or chunk_store.rs
The index is independent of any read information. We just need to wrap the fstat(2) “do I know chunk xyz?” to use the index.

What can go wrong with the index?

Problem A: Inconsistencies during Create, Delete operations.
Problem B: The chunk exists on the disk, but not in the index.
Problem C: The chunk exists in the index, but not on the disk.

Problem A:

If we reboot the server the index is always consistent. Unless the index is built, PBS (on that datastore) is not available.
Every Create / Delete operation is handled by chunk_stat.rs / chunk_store.rs. A coder writing sane code knows how and when to create/update/delete the key in the index.

Problem B:

Very simple. Run the fstat(2) operation on the filesystem anyway! If it’s a success, update the index - if not it was not a problem in any case. We only want to optimize the amount of readdir(2) and fstat(2) calls.

Problem C:

Very simple. Run the read(2) operation on the filesystem anyway! It will fail! We delete the id from our index.
However the user has a problem! The file is gone - something (no PBS!) deleteted it)
Something very bad happened beyond the scope of the index and pbs anyway.

Conclusion: An ouf of sync index is never a problem! We just do "one more" fstat(2) or we get read(2) issues.

(I can draw some diagram - but that's so dirt cheap simple to implement - I think it's quite clear).

flames · Jun 6, 2024

yes, pbs is not the fastest architecture. on the other side, you might consider a file based database (like sqlite) as a solution. it is not. because when sqlite, or even some real database servers, as mysql/mariadb, postgersql, mssql, oracle...etc. are set to write on premise (ignoring caches and do every write query on premise), they get very slow even with some thousands of queries. and believe me, you don't want pbs to cache in ram, when it would use a database. zfs is a database, write on premise, copy on write.
PBS is based on ZFS from the very begining, and i had my issues with the slowness back then (year 2020 iirc, when the firs public beta was released, i ran a ceph cluster with cephfs for pbs first, then zfs over rbd to support the features... there went a lot of work into testing)
if you really need to use nfs/smb as backup store, consider to use something like veeam backup, maybe some custom scripts with borg/bup/git or something.
PBS is nearly perfect, but it relays on ZFS kernel features. to change that it needs to rework the whole pbs concept. and there are more important things on todo.

Der Harry · Jun 6, 2024

flames said:
PBS is nearly perfect, but it relays on ZFS kernel features. to change that it needs to rework the whole pbs concept. and there are more important things on todo.

Why does it make sense - on any filesystem - to run millions of readdir(3) and fstat(2)?

Please explain.

dcsapak · Jun 6, 2024

one thing i just noticed:

Der Harry said:
We are not talking about garbage collection.

what are we talking about then? the *only* operation needing to access only the chunk metadata is garbage collection, all other operations need to write or read the chunks anyway so a metadata cache would not help at all

(Developer) question about .chunk folder

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Renowned Member

Active Member

Renowned Member

Active Member

Active Member

Proxmox Staff Member

Distinguished Member

Active Member

Active Member

Renowned Member

Active Member

Active Member

Renowned Member

Active Member

Proxmox Staff Member

We value your privacy