Garbage collection breaks backups on NFS Share

Taledo

Member
Nov 20, 2020
75
6
13
53
Hey all,

Foremost, I am aware that backups on NFS shares aren't the way to go. This is experimental, but I'd still like to share this issue.

We're running a PBS that has an NFS Share on a Dell DD6400. As the PBS doesn't support the ddboost protocol, we're using the NFS feature to store backups to the DD6400.

We've had a previous issue where the verify job failed, but we weren't able to know what had broken the backups.

Today, after a week of successful backups with a verify OK, I launched Garbage collection on the datastore.

As a result, all the backups on that datastore are now failing the verify.

We're running this version of the PBS :

proxmox-backup-server 2.3.3-1 running version: 2.3.3

Here's the start of the failed verify log :

2023-03-16T12:00:00+01:00: Starting datastore verify job 'DD6400:v-77eb83f2-d34f'
2023-03-16T12:00:00+01:00: task triggered by schedule '*:0/30'
2023-03-16T12:00:00+01:00: verify datastore DD6400
2023-03-16T12:00:00+01:00: found 2 groups
2023-03-16T12:00:00+01:00: verify group DD6400:vm/20403 (14 snapshots)
2023-03-16T12:00:00+01:00: verify DD6400:vm/20403/2023-03-15T17:30:01Z
2023-03-16T12:00:00+01:00: check qemu-server.conf.blob
2023-03-16T12:00:00+01:00: check drive-scsi0.img.fidx
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk 'da0cd4e5c9756da3f46a9a14b016e37db227f34623107d2a42fa067cb5ee1eb2' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '756bfb77821e7f535767bb45d83454efaa9ac769f30e65004693a14c039b0fe3' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '02755e0ab8b3b43fe3c51a9f9a5f74d3b6d654a2abc8180e56b8a47edf6bc482' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '5ec9b9c7411deb2de1f2a6bf102105c748163f9fb78c9f156fd18e47f55227f1' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '4e8c0e02aa173841cad1044836ee2138158fd242260295c318312ed0e16f9d02' - No such file or directory (os error 2)
2023-03-16T12:00:08+01:00: can't verify chunk, load failed - store 'DD6400', unable to load chunk '0dd918e52743b70536de78889e7e414174574dfb58e343b9844536bde2bb662e' - No such file or directory (os error 2)

Somehow, either the GC or the DD6400 deleted backup chunk files.


As always, I can provide additional information if needed.


Cheers,


Taledo
 
This is the line if my fstab :

Code:
theIP:/data/col1/shareName /mnt/dd6400 nfs defaults,user,auto,_netdev,bg,local_lock=all

Here's the mount output for that share :

Code:
(rw,nosuid,nodev,noexec,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=alsoTheIp,mountvers=3,mountport=2052,mountproto=udp,local_lock=all,addr=stillTheIP,user,_netdev)

I can try forcing atime to see if it makes a difference.
 
Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for at least 24 hours and 5 minutes.
 
Last edited:
Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for t least 24 hours and 5 minutes.
From the mount manual:
relatime
Update inode access times relative to modify or change time. Access time is only updated if the previ‐
ous access time was earlier than the current modify or change time. (Similar to noatime, but it
doesn't break mutt or other applications that need to know if a file has been read since the last time
it was modified.)

Since Linux 2.6.30, the kernel defaults to the behavior provided by this option (unless noatime was
specified), and the strictatime option is required to obtain traditional semantics. In addition, since
Linux 2.6.30, the file's last access time is always updated if it is more than 1 day old.
So the last sentence is the reason for "24 hours and 5 minutes"?
I see a problem when the GC indexing runs longer than 5 minutes and some files have been accessed (and atime updated) just slightly less than 24h ago when the indexer gets to them.
 
So the last sentence is the reason for "24 hours and 5 minutes"?
Jup, this is what makes relatime possible. Would otherwise be really bad if GC would delete those chunks because the default relatime of many filesystems was used.
 
Relatime should be fine too. But does the underlaying storage support atime/relatime? This is sometimes disabled for better performance. GC will access all indexed chunks and then delete all chunks that doesn't got accessed for at least 24 hours and 5 minutes.

Documentation of the DD6400 mention using atime for its "locking" system, which basically protects files from being deleted (this feature isn't enabled on that specific NFS share)

I'll go and check if I see anthing on the web interface, and I'll try switching the from relatime to atime on the PBS.


Cheers
 
From the mount manual:

So the last sentence is the reason for "24 hours and 5 minutes"?
I see a problem when the GC indexing runs longer than 5 minutes and some files have been accessed (and atime updated) just slightly less than 24h ago when the indexer gets to them.

the cutoff is determined at start of GC, not fresh for each chunk, so that is not an issue. it also takes running backup workers into account, so if you have a long running backup task started earlier than a day ago, the cutoff will actually be the start time of that task to ensure that all the chunks stored by that task which are not yet referenced or not referenced anymore by other snapshots are left alone and not treated as garbage.
 
I'll go and check if I see anthing on the web interface, and I'll try switching the from relatime to atime on the PBS.
That won't change anything. As I quoted, the default for atime has been relatime since 2.6.30.
For immediate update on access, you would need strictatime.

Jup, this is what makes relatime possible. Would otherwise be really bad if GC would delete those chunks because the default relatime of many filesystems was used.
It will still delete chunks that were just barely not updated when the indexer accessed them, but are older than 24h+5m by the time the gc runs.
On slow disks or with a lot of overhead like nfs, it can totally happen that the indexer runs longer than 5 minutes.
I surely hope it really is 24h + runtime of indexer + 5m.
 
I didn't have time to check yet on if the DD6400 supports atime.

On a general note, what would be the consequences of disabling GC on a PBS?

Cheers
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!