Backup Threat Modelling - Pruning Sidechannel

quanten

New Member
Feb 5, 2024
1
2
1
I use both PBS and restic for backup on different systems. I'm very happy with both tools for their job. In the restic documentation I read about a possible threat and wonder how it is handles in PBS. I have not found any mention of the threat or a mitigation in the documentation of PBS.

Attack

I would maybe name the attack "backup stuffing" or "pruning side channel". Usually you want to make sure, that the backup source can not delete the backups on the destination. Because, the source system can not only die, but could also be compromised by a potential intelligent adversary.
The usual way to prevent this is to give the backup source only permission to create new backups and not to delete backups.

In order to still be able to delete old backups and enforce a backup retention, the backup destination now has to delete backups. This is done based on pruning settings. These pruning settings are used in a pruning operation on the backup destination to decide which backups should remain and which should be deleted.
For this decision, the number of backups and the timestamp of the backups is used. And this is the information that is manipulated in this attack.

When a pruning setting of keep-last: X is set, the adversary can just create X new trash backups (without useable data). Then the pruning on the destination will delete the old good backups and our data is gone.

Pruning settings of keep-weekly (and similar) are also exploitable under the following assumption: The backup source can set arbitrary timestamps to the backups.
When the adversary on the backup source creates a trash backup with a slightly newer timestamp for each good backup, the pruning will delete the good backups.

Restic's mitigation to this problem is the keep-within option.
It prevents the deletion of any backups that are younger than X days (compared to the current time of the pruning system, not the time of the "newest" backup, since that could also be faked to the future). So you have X days to notice the compromise and in that time, no backup is deleted.
See their docs also for their description of the threat: https://restic.readthedocs.io/en/v0.16.3/060_forget.html#security-considerations-in-append-only-mode

(It should also be noted that restic, despite sharing great characteristics, is different to PBS. In restic the backup destination is just a blob storage and has no possibility to remember when a backup was actually written to it. PBS could do that, since it has a service at the destination.)

Exploitation scenarios

For PBS I see two possible source-destination scenario where this could be exploited:

1. proxmox-backup-client to PBS

In order to be able to perform the described attack, a malicious proxmox-backup-client must be able to create backups with arbitrary timestamp.
I found nothing in the documentation that says this isn't possible. However, the description of the backup protocol API suggest to me that it is not possible. But let assume it for this scenario.

(Knowledge about the pruning settings is, to some extend, could also be needed for the adversary. However, within the meaning of Kerckhoffs's principle, the knowledge must be assumed.)

There is a proxmox-backup-client running on a machine, let's call it A. There is a separate machine running a PBS, called B.
A has access to B with the DatastoreBackup role in order to create daily backups. B has the prune settings "keep-weekly: 4" installed. They have run for some time.

A is now completely compromised by an intelligent adversary and has therefore DatastoreBackup access to B. B is otherwise not compromised.

A creates four trash backups with timestamps that are slightly younger than the existing ones. On the next prune and GC job, our good (from before the attack on A) backups are gone.

2. PBS to PBS via sync job

In order to be able to perform the described attack, a malicious PBS must be able to create backups with arbitrary timestamps in its own datastore and a pull jobs will pull those potential old backups.
I see no reason why this should not be possible. The sync job is designed to sync two datastores and therefore also pulls older backups.

We have a backup source (may it be again a proxmox-backup-client), called A and a PBS (e.g. in the local network, maybe under the same SSO), called B. Machine A backups its data to B.
We have another PBS, called C, that is e.g. located offsite and is as much as possible isolated from B. On C there is a sync job configured to pull the backups regularly (sensible, without remove-vanished set). Also, on C there are pruning settings set (e.g. similar to the former scenario).

Machines A and B are now both completely compromised (e.g. compromise of SSO), again by an intelligent adversary. C is not compromised.

B deletes the old backups and creates trash backups with slightly newer timestamps. C will pull those backups and later prune the good backups.

Questions

Is the assumption in scenario 1 correct/feasible?

How about scenario 2?

Are there mitigation to this attack?

Mitigation Ideas

I have two ideas for mitigate this attack:

Receiver-side generation of timestamps. The PBS generates the timestamp of a backup on its own during receiving. These timestamps are used to decide when to prune a backup. However on a sync job, this is a bit more tricky. (This is the mitigation restic can't use, as the destination is just a dumb blob storage, as mentioned above)

Sender-side definition of pruning/backup retention: The source of the backup saves in each backup the timestamp until the backup has to be kept. Now, malicious backups can't use the side channel of pruning to delete other backups. This comes with other complications in management, configuration etc.
 
  • Like
Reactions: guletz and flames
Thanks for the extensive writeup, highly appreciated (and sorry for taking so long to get back to you)!

For future reference (for you and other readers ;)) - if you think you find an issue that might have security implications, it would be appreciated to use the process outlined in Security Reporting (also linked on https://proxmox.com/en/about/contact).

A preliminary analysis of the scenario(s) you raised indicates that both a malicious client (via the backup API) and a malicious server (if used as pull/sync source) could "stuff" a particular backup group (provided they have access to create new snapshots, or are the source of that group for the target PBS). Since there are currently no checks on the server side that prevent creating backups "in the future", this could lead to pruning of all "real" backups even if the prune settings are conservative.

For the pull/sync case there is a mitigation that can be used right away - setting transfer-last to some sensible value, and neither syncing too often nor pruning too aggressively. Such a setup effectively rate-limits the stuffing attack to the transfer-last count (per sync), and requires the attack to continue over a longer time period without being detected before being successful (where success means all regular snapshots being pruned).

We'll likely implement a threshold server-side that limits how far into the future snapshots can be created (a certain "window" must be allowed to account for time sync discrepancies between clients and servers). It might also make sense to implement rate-limiting (e.g., how many snapshots can be made per day per group). Of course, it would still be possible to exploit this issue if pruning is very aggressive, and/or an attacker can upload backups over a longer period of time.

Using server-side timestamps for pruning is not an option really, that would break things like importing historic backup data from non-PBS sources.

Determining the expiry up front is also not really an option, since that is very inflexible (pruning schedules might require tuning after a while, when the real world usage of each client is determined). But a similar approach is possible - you can mark certain snapshots as protected, and could do so in a semi-automated fashion - these snapshots would then be preserved even in the face of a stuffing attack, and you could expire the protection yourself according to your own criteria... Similarly, you could implement all pruning yourself, and take things like snapshot size, snapshot frequency/time period, ... into account, or detect a stuffing attack and avoid pruning entirely and instead flag the group for manual review.
 
  • Like
Reactions: flames
Have I understood correctly that the current recommendation is to use the largest possible storage systems on the PULL PBS to run GC/Prune as infrequently as possible?

I'm currently using a script at the file level that uses rsync to check how many files have been deleted before backing up data in order to prevent ransomware. Once a defined limit is reached, an emergency stop is triggered.

Isn't this principle particularly suitable for a PBS? For example, more than 10% new blocks and stop?
 
Last edited:
Have I understood correctly that the current recommendation is to use the largest possible storage systems on the PULL PBS to run GC/Prune as infrequently as possible?
I mean, it never hurts to have more headroom (/space). and yes, if you are worried about prematurely removing snapshots, then less aggressive pruning is the way to go, and that of course means using more space. running GC less often doesn't really help - once a snapshot is pruned, it's generally not recoverable just from the chunk store, unless you have another copy of it somewhere. protecting snapshots might be an interesting feature as well, provided you don't give client systems (like PVE) privileges to unprotect them.

I'm currently using a script at the file level that uses rsync to check how many files have been deleted before backing up data in order to prevent ransomware. Once a defined limit is reached, an emergency stop is triggered.

Isn't this principle particularly suitable for a PBS? For example, more than 10% new blocks and stop?
well, there definitely are legitimate backup chains where that can happen, but if you know your usual delta, you could implement such a heuristic.
 
I’m thinking about the second scenario: offsite PBS with a pull sync that isn’t compromised, since that’s my setup.

As I understand it, the typical way ransomware would poison the backups like this is to create a new encryption key. That new key is used to poison your backups, then it is deleted once they trigger the ransomware.

That first poisoned backup would mean no deduplication with the existing backup chain. That should mean an abnormal growth in storage before garbage collection can pull “older” backups, right? That should be a signal we can alert on.

I’d love some simple method to alert on that unexpected growth. Barring a simple method, I’d love for someone to suggest some way I could script it.
 
an alert if over X% of a snapshot are newly inserted chunks (either by count, or by size?) should be do-able and might be an interesting feature also to detect accidents, not just malicious activity..