Garbage collection breaks backups on NFS Share

Taledo · Mar 16, 2023

Hey all,

Foremost, I am aware that backups on NFS shares aren't the way to go. This is experimental, but I'd still like to share this issue.

We're running a PBS that has an NFS Share on a Dell DD6400. As the PBS doesn't support the ddboost protocol, we're using the NFS feature to store backups to the DD6400.

We've had a previous issue where the verify job failed, but we weren't able to know what had broken the backups.

Today, after a week of successful backups with a verify OK, I launched Garbage collection on the datastore.

As a result, all the backups on that datastore are now failing the verify.

We're running this version of the PBS :

proxmox-backup-server 2.3.3-1 running version: 2.3.3

Here's the start of the failed verify log :

2023-03-16T12:00:00+01:00: Starting datastore verify job 'DD6400:v-77eb83f2-d34f'
2023-03-16T12:00:00+01:00: task triggered by schedule '*:0/30'
2023-03-16T12:00:00+01:00: verify datastore DD6400
2023-03-16T12:00:00+01:00: found 2 groups
2023-03-16T12:00:00+01:00: verify group DD6400:vm/20403 (14 snapshots)
2023-03-16T12:00:00+01:00: verify DD6400:vm/20403/2023-03-15T17:30:01Z
2023-03-16T12:00:00+01:00: check qemu-server.conf.blob
2023-03-16T12:00:00+01:00: check drive-scsi0.img.fidx
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk 'da0cd4e5c9756da3f46a9a14b016e37db227f34623107d2a42fa067cb5ee1eb2' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '756bfb77821e7f535767bb45d83454efaa9ac769f30e65004693a14c039b0fe3' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '02755e0ab8b3b43fe3c51a9f9a5f74d3b6d654a2abc8180e56b8a47edf6bc482' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '5ec9b9c7411deb2de1f2a6bf102105c748163f9fb78c9f156fd18e47f55227f1' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '4e8c0e02aa173841cad1044836ee2138158fd242260295c318312ed0e16f9d02' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '0dd918e52743b70536de78889e7e414174574dfb58e343b9844536bde2bb662e' - No such file or directory (os error 2)

Somehow, either the GC or the DD6400 deleted backup chunk files.

As always, I can provide additional information if needed.

Cheers,

Taledo

Dunuin · Mar 16, 2023

GC needs atime enabled. Does that work?

Taledo · Mar 16, 2023

This is the line if my fstab :

Code:

theIP:/data/col1/shareName /mnt/dd6400 nfs defaults,user,auto,_netdev,bg,local_lock=all

Here's the mount output for that share :

Code:

(rw,nosuid,nodev,noexec,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=alsoTheIp,mountvers=3,mountport=2052,mountproto=udp,local_lock=all,addr=stillTheIP,user,_netdev)

I can try forcing atime to see if it makes a difference.

Dunuin · Mar 16, 2023

Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for at least 24 hours and 5 minutes.

mow · Mar 16, 2023

Dunuin said:
Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for t least 24 hours and 5 minutes.

From the mount manual:

relatime
Update inode access times relative to modify or change time. Access time is only updated if the previ‐
ous access time was earlier than the current modify or change time. (Similar to noatime, but it
doesn't break mutt or other applications that need to know if a file has been read since the last time
it was modified.)

Since Linux 2.6.30, the kernel defaults to the behavior provided by this option (unless noatime was
specified), and the strictatime option is required to obtain traditional semantics. In addition, since
Linux 2.6.30, the file's last access time is always updated if it is more than 1 day old.

So the last sentence is the reason for "24 hours and 5 minutes"?
I see a problem when the GC indexing runs longer than 5 minutes and some files have been accessed (and atime updated) just slightly less than 24h ago when the indexer gets to them.

Dunuin · Mar 16, 2023

mow said:
So the last sentence is the reason for "24 hours and 5 minutes"?

Jup, this is what makes relatime possible. Would otherwise be really bad if GC would delete those chunks because the default relatime of many filesystems was used.

Taledo · Mar 16, 2023

Dunuin said:
Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for at least 24 hours and 5 minutes.

Documentation of the DD6400 mention using atime for its "locking" system, which basically protects files from being deleted (this feature isn't enabled on that specific NFS share)

I'll go and check if I see anthing on the web interface, and I'll try switching the from relatime to atime on the PBS.

Cheers

fabian · Mar 16, 2023

mow said:
From the mount manual:

So the last sentence is the reason for "24 hours and 5 minutes"?
I see a problem when the GC indexing runs longer than 5 minutes and some files have been accessed (and atime updated) just slightly less than 24h ago when the indexer gets to them.

the cutoff is determined at start of GC, not fresh for each chunk, so that is not an issue. it also takes running backup workers into account, so if you have a long running backup task started earlier than a day ago, the cutoff will actually be the start time of that task to ensure that all the chunks stored by that task which are not yet referenced or not referenced anymore by other snapshots are left alone and not treated as garbage.

mow · Mar 16, 2023

Taledo said:
I'll go and check if I see anthing on the web interface, and I'll try switching the from relatime to atime on the PBS.

That won't change anything. As I quoted, the default for atime has been relatime since 2.6.30.
For immediate update on access, you would need strictatime.

Dunuin said:
Jup, this is what makes relatime possible. Would otherwise be really bad if GC would delete those chunks because the default relatime of many filesystems was used.

It will still delete chunks that were just barely not updated when the indexer accessed them, but are older than 24h+5m by the time the gc runs.
On slow disks or with a lot of overhead like nfs, it can totally happen that the indexer runs longer than 5 minutes.
I surely hope it really is 24h + runtime of indexer + 5m.

fabian · Mar 16, 2023

like I said:

https://git.proxmox.com/?p=proxmox-...35ca78acd61b746cba709e84df39932;hb=HEAD#l1036 (the two relevant timestamps are retrieved before marking starts)

https://git.proxmox.com/?p=proxmox-...64f8ff1bdcbd2ad9d988c1762c658407;hb=HEAD#l359 (sweeping will determine the cutoff based on those two values)

Taledo · Mar 16, 2023

I didn't have time to check yet on if the DD6400 supports atime.

On a general note, what would be the consequences of disabling GC on a PBS?

Cheers

Dunuin · Mar 16, 2023

Taledo said:
On a general note, what would be the consequences of disabling GC on a PBS?

Nothing will ever be deleted and your datastore will grow and grow until it runs out of space.

Search

Search

Garbage collection breaks backups on NFS Share

Taledo

Well-Known Member

Dunuin

Distinguished Member

Taledo

Well-Known Member

Dunuin

Distinguished Member

mow

Active Member

Dunuin

Distinguished Member

Taledo

Well-Known Member

fabian

Proxmox Staff Member

mow

Active Member

fabian

Proxmox Staff Member

Taledo

Well-Known Member

Dunuin

Distinguished Member

We value your privacy