Race between "Prune & GC" and new backups leads to corrupt dumps

NdK73

Renowned Member
Jul 19, 2012
95
6
73
Bologna, Italy
www.csshl.net
Hello.

Seems the job for prune & GC does not account for data generated by concurrent backups.

In my case GC takes 17 days (on 24 16TB SAS disks configured as a single pool containing 3 RAIDZ3 - it's simply economically unfeasible to have the same amount of storage on SSD just for backups!) and all the backups run during that interval are corrupted (IIUC GC removes some chunks that they see as still available).
Even worse: backups run AFTER the GC completes are often corrupted, too. Unless I delete the last corrupted backup.

Please consider having an option to completely avoid chunking, leaving only ZFS block-level deduplication (IIF the user keeps it enabled... I really don't like the idea of deduplicating a thing that gets done exactly to have redundancy!). Call it "pseudo-tape mode"? :)
 
Hello.

Seems the job for prune & GC does not account for data generated by concurrent backups.

In my case GC takes 17 days (on 24 16TB SAS disks configured as a single pool containing 3 RAIDZ3 - it's simply economically unfeasible to have the same amount of storage on SSD just for backups!) and all the backups run during that interval are corrupted (IIUC GC removes some chunks that they see as still available).
Even worse: backups run AFTER the GC completes are often corrupted, too. Unless I delete the last corrupted backup.

Please consider having an option to completely avoid chunking, leaving only ZFS block-level deduplication (IIF the user keeps it enabled... I really don't like the idea of deduplicating a thing that gets done exactly to have redundancy!). Call it "pseudo-tape mode"? :)
Hi,
garbage collection will take into account the oldest active writer instance and take this into account to calculate the cutoff time for chunks to be deletable.

So please do share:
  • The proxmox-backup-manager version --verbose
  • The full task logs for the concurrent backups
  • The full task log for the garbage collection
  • Make sure that the previous backup is fully verified before running a new one if using the fast incremental mode
See also: https://bugzilla.proxmox.com/show_bug.cgi?id=5710
 
Last edited:
Here it is. But packages got updated yesterday, after fixing the backups.

Code:
root@pbs:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.2.0        running kernel: 5.15.131-2-pve
proxmox-backup-server              3.2.7-1      running version: 3.2.7       
proxmox-kernel-helper              8.1.0                                     
pve-kernel-5.15                    7.4-15                                     
proxmox-kernel-6.8                 6.8.12-2                                   
proxmox-kernel-6.8.12-2-pve-signed 6.8.12-2                                   
pve-kernel-5.15.158-2-pve          5.15.158-2                                 
pve-kernel-5.15.131-2-pve          5.15.131-3                                 
pve-kernel-5.15.126-1-pve          5.15.126-1                                 
pve-kernel-5.15.116-1-pve          5.15.116-1                                 
pve-kernel-5.15.108-1-pve          5.15.108-2                                 
pve-kernel-5.15.107-2-pve          5.15.107-2                                 
pve-kernel-5.15.83-1-pve           5.15.83-1                                 
pve-kernel-5.15.74-1-pve           5.15.74-1                                 
ifupdown2                          3.2.0-1+pmx9                               
libjs-extjs                        7.0.0-4                                   
proxmox-backup-docs                3.2.7-1                                   
proxmox-backup-client              3.2.7-1                                   
proxmox-mail-forward               0.2.3                                     
proxmox-mini-journalreader         1.4.0                                     
proxmox-offline-mirror-helper      unknown                                   
proxmox-widget-toolkit             4.2.3                                     
pve-xtermjs                        5.3.0-3                                   
smartmontools                      7.3-pve1                                   
zfsutils-linux                     2.2.6-pve1

This is the log of one of the backups that failed verification:
Code:
2024-09-29T01:01:37+02:00: starting new backup on datastore 'difa-bck': "ns/OPH/vm/156/2024-09-28T23:01:36Z"
2024-09-29T01:01:37+02:00: download 'index.json.blob' from previous backup.
2024-09-29T01:01:37+02:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: download 'drive-scsi0.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: created new fixed index 1 ("ns/OPH/vm/156/2024-09-28T23:01:36Z/drive-scsi0.img.fidx")
2024-09-29T01:01:37+02:00: register chunks in 'drive-scsi1.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: download 'drive-scsi1.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: created new fixed index 2 ("ns/OPH/vm/156/2024-09-28T23:01:36Z/drive-scsi1.img.fidx")
2024-09-29T01:01:37+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/qemu-server.conf.blob" (568 bytes, comp: 568)
2024-09-29T01:01:37+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/fw.conf.blob" (42 bytes, comp: 42)
2024-09-29T01:01:40+02:00: Upload statistics for 'drive-scsi1.img.fidx'
2024-09-29T01:01:40+02:00: UUID: c20468f3429a431a897fed7ebf00f72a
2024-09-29T01:01:40+02:00: Checksum: 503f391ca4255b307dc7c42474effe92780e8425535f5385599b31c524c36274
2024-09-29T01:01:40+02:00: Size: 41943040
2024-09-29T01:01:40+02:00: Chunk count: 10
2024-09-29T01:01:40+02:00: Upload size: 46137344 (110%)
2024-09-29T01:01:40+02:00: Duplicates: 0+1 (10%)
2024-09-29T01:01:40+02:00: Compression: 70%
2024-09-29T01:01:40+02:00: successfully closed fixed index 2
2024-09-29T01:01:43+02:00: Upload statistics for 'drive-scsi0.img.fidx'
2024-09-29T01:01:43+02:00: UUID: 3cfad824cc124b2480d4856e37b45e78
2024-09-29T01:01:43+02:00: Checksum: d5eb49224c89d8f9d9e798734531ae43e710429dcdf2880297600dfb659f637b
2024-09-29T01:01:43+02:00: Size: 415236096
2024-09-29T01:01:43+02:00: Chunk count: 99
2024-09-29T01:01:43+02:00: Upload size: 406847488 (97%)
2024-09-29T01:01:43+02:00: Duplicates: 2+1 (3%)
2024-09-29T01:01:43+02:00: Compression: 25%
2024-09-29T01:01:43+02:00: successfully closed fixed index 1
2024-09-29T01:01:43+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/index.json.blob" (428 bytes, comp: 428)
2024-09-29T01:01:44+02:00: successfully finished backup
2024-09-29T01:01:44+02:00: backup finished successfully
2024-09-29T01:01:44+02:00: TASK OK

Last GC log is:
Code:
2024-09-07T02:00:00+02:00: starting garbage collection on store difa-bck
2024-09-07T02:00:00+02:00: task triggered by schedule 'sat *-1..7 02:00'
2024-09-07T02:00:00+02:00: Start GC phase1 (mark used chunks)
2024-09-08T08:52:34+02:00: marked 1% (7 of 643 index files)
2024-09-10T07:10:21+02:00: marked 2% (13 of 643 index files)
[...]
2024-09-24T11:06:36+02:00: marked 98% (631 of 643 index files)
2024-09-24T11:06:37+02:00: marked 99% (637 of 643 index files)
2024-09-24T11:06:39+02:00: marked 100% (643 of 643 index files)
2024-09-24T11:06:39+02:00: Start GC phase2 (sweep unused chunks)
2024-09-24T11:12:17+02:00: processed 1% (327924 chunks)
2024-09-24T11:17:04+02:00: processed 2% (654706 chunks)
2024-09-24T11:21:28+02:00: processed 3% (981548 chunks)
2024-09-24T11:25:28+02:00: processed 4% (1308214 chunks)
[...]
2024-09-24T16:20:15+02:00: processed 97% (31729290 chunks)
2024-09-24T16:23:26+02:00: processed 98% (32057492 chunks)
2024-09-24T16:26:36+02:00: processed 99% (32384733 chunks)
2024-09-24T16:29:46+02:00: Removed garbage: 602.015 GiB
2024-09-24T16:29:46+02:00: Removed chunks: 448780
2024-09-24T16:29:46+02:00: Pending removals: 15.978 GiB (in 12557 chunks)
2024-09-24T16:29:46+02:00: Original data usage: 824.648 TiB
2024-09-24T16:29:46+02:00: On-Disk usage: 74.832 TiB (9.07%)
2024-09-24T16:29:46+02:00: On-Disk chunks: 32250873
2024-09-24T16:29:46+02:00: Deduplication factor: 11.02
2024-09-24T16:29:46+02:00: Average chunk size: 2.433 MiB
2024-09-24T16:29:46+02:00: TASK OK

The log of the last verification is too big to be attached :(
 
Here it is. But packages got updated yesterday, after fixing the backups.

Code:
root@pbs:~# proxmox-backup-manager version --verbose
proxmox-backup                     3.2.0        running kernel: 5.15.131-2-pve
proxmox-backup-server              3.2.7-1      running version: 3.2.7      
proxmox-kernel-helper              8.1.0                                    
pve-kernel-5.15                    7.4-15                                    
proxmox-kernel-6.8                 6.8.12-2                                  
proxmox-kernel-6.8.12-2-pve-signed 6.8.12-2                                  
pve-kernel-5.15.158-2-pve          5.15.158-2                                
pve-kernel-5.15.131-2-pve          5.15.131-3                                
pve-kernel-5.15.126-1-pve          5.15.126-1                                
pve-kernel-5.15.116-1-pve          5.15.116-1                                
pve-kernel-5.15.108-1-pve          5.15.108-2                                
pve-kernel-5.15.107-2-pve          5.15.107-2                                
pve-kernel-5.15.83-1-pve           5.15.83-1                                
pve-kernel-5.15.74-1-pve           5.15.74-1                                
ifupdown2                          3.2.0-1+pmx9                              
libjs-extjs                        7.0.0-4                                  
proxmox-backup-docs                3.2.7-1                                  
proxmox-backup-client              3.2.7-1                                  
proxmox-mail-forward               0.2.3                                    
proxmox-mini-journalreader         1.4.0                                    
proxmox-offline-mirror-helper      unknown                                  
proxmox-widget-toolkit             4.2.3                                    
pve-xtermjs                        5.3.0-3                                  
smartmontools                      7.3-pve1                                  
zfsutils-linux                     2.2.6-pve1

This is the log of one of the backups that failed verification:
Code:
2024-09-29T01:01:37+02:00: starting new backup on datastore 'difa-bck': "ns/OPH/vm/156/2024-09-28T23:01:36Z"
2024-09-29T01:01:37+02:00: download 'index.json.blob' from previous backup.
2024-09-29T01:01:37+02:00: register chunks in 'drive-scsi0.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: download 'drive-scsi0.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: created new fixed index 1 ("ns/OPH/vm/156/2024-09-28T23:01:36Z/drive-scsi0.img.fidx")
2024-09-29T01:01:37+02:00: register chunks in 'drive-scsi1.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: download 'drive-scsi1.img.fidx' from previous backup.
2024-09-29T01:01:37+02:00: created new fixed index 2 ("ns/OPH/vm/156/2024-09-28T23:01:36Z/drive-scsi1.img.fidx")
2024-09-29T01:01:37+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/qemu-server.conf.blob" (568 bytes, comp: 568)
2024-09-29T01:01:37+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/fw.conf.blob" (42 bytes, comp: 42)
2024-09-29T01:01:40+02:00: Upload statistics for 'drive-scsi1.img.fidx'
2024-09-29T01:01:40+02:00: UUID: c20468f3429a431a897fed7ebf00f72a
2024-09-29T01:01:40+02:00: Checksum: 503f391ca4255b307dc7c42474effe92780e8425535f5385599b31c524c36274
2024-09-29T01:01:40+02:00: Size: 41943040
2024-09-29T01:01:40+02:00: Chunk count: 10
2024-09-29T01:01:40+02:00: Upload size: 46137344 (110%)
2024-09-29T01:01:40+02:00: Duplicates: 0+1 (10%)
2024-09-29T01:01:40+02:00: Compression: 70%
2024-09-29T01:01:40+02:00: successfully closed fixed index 2
2024-09-29T01:01:43+02:00: Upload statistics for 'drive-scsi0.img.fidx'
2024-09-29T01:01:43+02:00: UUID: 3cfad824cc124b2480d4856e37b45e78
2024-09-29T01:01:43+02:00: Checksum: d5eb49224c89d8f9d9e798734531ae43e710429dcdf2880297600dfb659f637b
2024-09-29T01:01:43+02:00: Size: 415236096
2024-09-29T01:01:43+02:00: Chunk count: 99
2024-09-29T01:01:43+02:00: Upload size: 406847488 (97%)
2024-09-29T01:01:43+02:00: Duplicates: 2+1 (3%)
2024-09-29T01:01:43+02:00: Compression: 25%
2024-09-29T01:01:43+02:00: successfully closed fixed index 1
2024-09-29T01:01:43+02:00: add blob "/mnt/datastore/difa-bck/ns/OPH/vm/156/2024-09-28T23:01:36Z/index.json.blob" (428 bytes, comp: 428)
2024-09-29T01:01:44+02:00: successfully finished backup
2024-09-29T01:01:44+02:00: backup finished successfully
2024-09-29T01:01:44+02:00: TASK OK

Last GC log is:
Code:
2024-09-07T02:00:00+02:00: starting garbage collection on store difa-bck
2024-09-07T02:00:00+02:00: task triggered by schedule 'sat *-1..7 02:00'
2024-09-07T02:00:00+02:00: Start GC phase1 (mark used chunks)
2024-09-08T08:52:34+02:00: marked 1% (7 of 643 index files)
2024-09-10T07:10:21+02:00: marked 2% (13 of 643 index files)
[...]
2024-09-24T11:06:36+02:00: marked 98% (631 of 643 index files)
2024-09-24T11:06:37+02:00: marked 99% (637 of 643 index files)
2024-09-24T11:06:39+02:00: marked 100% (643 of 643 index files)
2024-09-24T11:06:39+02:00: Start GC phase2 (sweep unused chunks)
2024-09-24T11:12:17+02:00: processed 1% (327924 chunks)
2024-09-24T11:17:04+02:00: processed 2% (654706 chunks)
2024-09-24T11:21:28+02:00: processed 3% (981548 chunks)
2024-09-24T11:25:28+02:00: processed 4% (1308214 chunks)
[...]
2024-09-24T16:20:15+02:00: processed 97% (31729290 chunks)
2024-09-24T16:23:26+02:00: processed 98% (32057492 chunks)
2024-09-24T16:26:36+02:00: processed 99% (32384733 chunks)
2024-09-24T16:29:46+02:00: Removed garbage: 602.015 GiB
2024-09-24T16:29:46+02:00: Removed chunks: 448780
2024-09-24T16:29:46+02:00: Pending removals: 15.978 GiB (in 12557 chunks)
2024-09-24T16:29:46+02:00: Original data usage: 824.648 TiB
2024-09-24T16:29:46+02:00: On-Disk usage: 74.832 TiB (9.07%)
2024-09-24T16:29:46+02:00: On-Disk chunks: 32250873
2024-09-24T16:29:46+02:00: Deduplication factor: 11.02
2024-09-24T16:29:46+02:00: Average chunk size: 2.433 MiB
2024-09-24T16:29:46+02:00: TASK OK

The log of the last verification is too big to be attached :(
So the backup uses the fast incremental mode. Please do a full verification of at least the last backup snapshot of each backup group (better of all without the skip verified flag), as I suspect that the previous backup used by the fast incremental mode is already corrupt.

If the previous snapshot is marked as corrupt after verification, the next backup run will not perform an incremental backup but rather re-upload all chunks to the server. That might help to "heal" some of the corrupted snapshots, to be checked by a subsequent verification of previously corrupt backup snapshots.
 
Hello.

Seems the job for prune & GC does not account for data generated by concurrent backups.
Merge Fruit
In my case GC takes 17 days (on 24 16TB SAS disks configured as a single pool containing 3 RAIDZ3 - it's simply economically unfeasible to have the same amount of storage on SSD just for backups!) and all the backups run during that interval are corrupted (IIUC GC removes some chunks that they see as still available).
Even worse: backups run AFTER the GC completes are often corrupted, too. Unless I delete the last corrupted backup.

Please consider having an option to completely avoid chunking, leaving only ZFS block-level deduplication (IIF the user keeps it enabled... I really don't like the idea of deduplicating a thing that gets done exactly to have redundancy!). Call it "pseudo-tape mode"? :)
For difficult techniques you should seek the help of an expert!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!