Questions on S3 Tech Preview Datastore

stevenwh

Member
Mar 16, 2024
33
2
8
Hello all,
I just recently configured a new proxmox backup server for my homelab. I noticed while configuring it that there is now options to set S3 Endpoints and create a S3 (Tech Preview) datastore. And I was thinking about playing around with this some. However I'm a little confused on the Local Cache Dir. If this is a common S3 thing, I don't know about it cause I have very limited experience with S3. I've just read that it is a good storage method for backups and such and figured it could be worth learning about.

So, can someone give a better explination of what this local cache is and what it is for? I understand the performance benefits to having a local cache. But what I don't understand is how the space consumption works. One thing I read said that PBS will just consume as much space as is available on the cache before going to the S3 endpoint. Surely that isn't what it sounds like, I would assume it would only keep a copy on the local cache but still store it on the S3 endpoint as well?? Is the only benefit to this local cache performance? It doesn't seem there is anyway to choose not to have this local cache, but I don't want to eat up a ton of space with it.

The way I have this server configured, there is not a ton of local space. This is basically just a node and I'm mounting storage from my NAS to store the actual backups. I know this isn't the ideal configuration, but for a homelab I don't really want to have to have a dedicated storage array just for proxmox backups. My NAS already has redundancy and additional backups in place to ensure I minimize my risk of losing data, so I'd prefer to just continue using it as my primary storage. I in fact planned to just run an S3 service on my NAS for this setup too.

I don't need crazy performance for this config. I don't have so many backups going on that transferring the data to my nas over the network is going to cause me any bottle necks. But since there doesn't seem to be a way to say I don't want local cache, I'm trying to understand what would happen if I just give it a very limited size partition. Say for example I just made a 1 MB partition and used that as my local cache, would everything still function correctly just pulling / sending everything directly from the S3 endpoint?

With the pruning I do, right now all of my proxmox backups only take up around 500 GB on average. And while sure, I could put a 500 GB drive on the PBS node, it seems silly to me to waste that 500 GB as a local cache in my setup.

And then I guess the other question is, and I could see this being a lot more appealing benefit if it works this way. If for some reason the S3 endpoint was inaccessible, would it still utilize the local cache for writes / reads and just sync the data to the S3 endpoint when it is available again?
 
Hi,
So, can someone give a better explination of what this local cache is and what it is for? I understand the performance benefits to having a local cache. But what I don't understand is how the space consumption works. One thing I read said that PBS will just consume as much space as is available on the cache before going to the S3 endpoint. Surely that isn't what it sounds like, I would assume it would only keep a copy on the local cache but still store it on the S3 endpoint as well?? Is the only benefit to this local cache performance? It doesn't seem there is anyway to choose not to have this local cache, but I don't want to eat up a ton of space with it.
the local datastore cache for S3 datastores serves the prupose to avoid unnecessary api request when possible. It stores all the metadata related to namespaces/backup groups/backup snapshots and keeps track of all the already seen and known chunks stored on the s3 backend (to avoid re-upload).
Further, it holds the least recently used chunks in order to avoid re-download from the API allow for potentially faster and more cost effective restores if cached. Data stored in the cache is also persisted to the S3 backend, but once the available cache slots are full, the oldest chunk is evicted to free space for inserting new recently used chunks. It does however not act as a buffer in the sense that chunks written to the cache are also immediately written to the s3 backend.

The way I have this server configured, there is not a ton of local space. This is basically just a node and I'm mounting storage from my NAS to store the actual backups. I know this isn't the ideal configuration, but for a homelab I don't really want to have to have a dedicated storage array just for proxmox backups. My NAS already has redundancy and additional backups in place to ensure I minimize my risk of losing data, so I'd prefer to just continue using it as my primary storage. I in fact planned to just run an S3 service on my NAS for this setup too.
If I understood you correctly you want to expose your NAS storage via an S3 api? Note that using NFS will most likely be more performant in that case.

I don't need crazy performance for this config. I don't have so many backups going on that transferring the data to my nas over the network is going to cause me any bottle necks. But since there doesn't seem to be a way to say I don't want local cache, I'm trying to understand what would happen if I just give it a very limited size partition. Say for example I just made a 1 MB partition and used that as my local cache, would everything still function correctly just pulling / sending everything directly from the S3 endpoint?
1 MB will not be enough, as the cache needs to at least store the metadata and seen chunk inodes to work properly. Also, it is expected to at least be able to
hold some chunks in the cache. It is therefore recommended to use something like 64GB of storage, see https://pbs.proxmox.com/docs/storage.html#datastores-with-s3-backend
Less might work without issues as well, but defies the purpose of the cache a bit.

Further, it must be placed on persistent storage, the backup metadata and chunk inodes used as markers are expected to outlive e.g. system reboots.

With the pruning I do, right now all of my proxmox backups only take up around 500 GB on average. And while sure, I could put a 500 GB drive on the PBS node, it seems silly to me to waste that 500 GB as a local cache in my setup.

And then I guess the other question is, and I could see this being a lot more appealing benefit if it works this way. If for some reason the S3 endpoint was inaccessible, would it still utilize the local cache for writes / reads and just sync the data to the S3 endpoint when it is available again?
No, if the S3 endpoint is no longer available, uploads of new (not already seen) chunks will fail and backups therefore fail as well, same for restores where the chunks are not present in the local datastore cache.