Datastore synced with Rclone broken?

Zeash

Member
Jan 23, 2022
11
0
6
20
I'm trying to set up a remote sync to a friend's RPI for my Proxmox Backup Server. I've been using this script that uses Rclone with an encrypted remote (Crypt) on a schedule:
Bash:
#!/bin/bash

# A simple script to back up PBS to Crypt remote

_lock_file=/root/.tertiary_backup.lock

_schedule_time=1:30
_max_duration="7.5h"
_cutoff_mode="soft"

_datastore_dir_name=Primary_Backup
_datastore_dir_path=/mnt/datastore/$_datastore_dir_name
_config_dir=$_datastore_dir_path/host_backup

_remote_user=user
_remote_ip=tailscale-ip
_remote_port=22
_remote_name=Tertiary_Backup_Crypt

_test_file_src_path=/etc/proxmox-backup/test-file.txt
_test_file_dest_dir=Primary_Backup/host_backup/proxmox-backup-etc

_remote_disk="/dev/disk/by-id/ID"

# Check is Lock File exists, if not create it and set trap on exit
...

checkExitCode () { ... }

repeatTask () { ... }

# Copies entire /etc directory to the datastore dir (removed after backup is complete)
prepareBackup () { ... }

pushBackup () {
        echo Pushing backup...
        rclone sync --max-duration=$_max_duration --cutoff-mode=$_cutoff_mode --create-empty-src-dirs --metadata $_datastore_dir_path $_remote_name:$_datastore_dir_name
        checkExitCode $?
        echo Removing host_backup directory...
        rm -rf $_config_dir
}

checkBackup(){ ... }

case $1 in
        backup)
                prepareBackup
                pushBackup
                ;;
        check)
                checkBackup
                ;;
        *)
                echo 'This script requires either "backup", "check" or "blank" as an argument!'
                ;;
esac

The only important part for this post is the pushBackup function, so I truncated the rest of it.

I recently completely filled up my datastore on accident, and since I haven't set up a ZFS quota, after some hacking around I decided to just restore this backup ('Secondary_Backup' is no longer in use in case you're wondering). While it took a while to restore, everything seemed to be fine. Prune ran successfully after running a chown -R backup:backup * and chown -R backup:backup .*. GC and verify jobs didn't though. Here are the logs for those jobs (mind you I ran multiples of these jobs, but they all had pretty much the same results):

- GC:

Code:
2024-09-09T15:27:44+02:00: starting garbage collection on store Primary_Backup
2024-09-09T15:27:44+02:00: Start GC phase1 (mark used chunks)
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk f7b65b7c09618b23924c9d3a28ab2b8a5c4d306c1227a8c64a07742cc765a6eb, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk fe916225bf7782c070d14d096163ff71394e7cab71632e2160456590ba4b3b9a, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk f742f0ee329275aa568d344fed943495be43054154292dfdea5026d1c8d1b5f7, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk f1e3d152485f0a503b22c739b46beea3545d1fb8be004d6278f5d1ec65cd131b, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk eeedc06ff9761feda923f963f52b19c0c2eaded30cd44206be4967d2acb78e60, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk ecb3f4f4eed7a66fdf653e404ded0e8ea94ff4727c45ea0f6f300a767b60a35c, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk ed288384e38cfbc67e01dea9abb39c254871e667180e45e2962b70af5d0d5b49, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk f174da4ef6fdba6bf14e6aff6367a8f723ad0e67981fadaf82a1ed9b1752782a, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk f5ab015e45cb5b0581ca08363bf7d911a4beb3b17ebb273573468032538e3d76, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
2024-09-09T15:27:44+02:00: WARN: warning: unable to access non-existent chunk ee5e382001833616284a2429623d318ade193e0ff6990d9dcd2eb1669131ca6d, required by "/mnt/datastore/Primary_Backup/ns/Production-Linux/ct/20003/2024-01-16T23:26:17Z/root.pxar.didx"
... ^^These lines continue for another 80MB worth of logs
2024-09-09T15:28:30+02:00: WARN: warning: unable to access non-existent chunk ff0fe820608b1251afcca617d5fdc0f781cae630e7fad5ba914f0bf549b77380, required by "/mnt/datastore/Primary_Backup/ns/Test/vm/1152/2024-08-10T23:00:04Z/drive-ide0.img.fidx"
2024-09-09T15:28:31+02:00: marked 100% (2801 of 2801 index files)
2024-09-09T15:28:31+02:00: Start GC phase2 (sweep unused chunks)
2024-09-09T15:28:31+02:00: processed 1% (3799 chunks)
2024-09-09T15:28:31+02:00: processed 2% (7523 chunks)
2024-09-09T15:28:31+02:00: processed 3% (11281 chunks)
...
2024-09-09T15:28:35+02:00: processed 90% (338432 chunks)
2024-09-09T15:28:35+02:00: processed 91% (342222 chunks)
2024-09-09T15:28:35+02:00: processed 92% (345883 chunks)
2024-09-09T15:28:35+02:00: Removed garbage: 0 B
2024-09-09T15:28:35+02:00: Removed chunks: 0
2024-09-09T15:28:35+02:00: Pending removals: 681.245 MiB (in 1852 chunks)
2024-09-09T15:28:35+02:00: Original data usage: 28.889 TiB
2024-09-09T15:28:35+02:00: On-Disk usage: 577.858 GiB (1.95%)
2024-09-09T15:28:35+02:00: On-Disk chunks: 345222
2024-09-09T15:28:35+02:00: Deduplication factor: 51.19
2024-09-09T15:28:35+02:00: Average chunk size: 1.714 MiB
2024-09-09T15:28:35+02:00: TASK WARNINGS: 340207

- Verify:

Code:
2024-09-09T21:34:53+02:00: Starting datastore verify job 'Primary_Backup:v-ce461b71-3984'
2024-09-09T21:34:53+02:00: verify datastore Primary_Backup
2024-09-09T21:34:53+02:00: found 37 groups
2024-09-09T21:34:53+02:00: verify group Primary_Backup:ct/2021 (2 snapshots)
2024-09-09T21:34:53+02:00: verify Primary_Backup:ct/2021/2023-10-14T00:01:27Z
2024-09-09T21:34:53+02:00:   check pct.conf.blob
2024-09-09T21:34:53+02:00:   check root.pxar.didx
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'fd8e98eea22852bb859e0eb48adee8ae16cbc696495a713c3cebfff575443e5e' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'ee660ebc5bd2f7e0bca32ec83a164cd7ba24b2d5db42b3ba67199c44cf20be50' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f65a537464235496a16e20e50e9a23decd9f0db4a0318c03d07276fc87349428' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f742f0ee329275aa568d344fed943495be43054154292dfdea5026d1c8d1b5f7' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f1e3d152485f0a503b22c739b46beea3545d1fb8be004d6278f5d1ec65cd131b' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f8efed4b90f1f41ff1f33856ab2755282f8610a43d3f15c0767fa400ed9598a4' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f5ab015e45cb5b0581ca08363bf7d911a4beb3b17ebb273573468032538e3d76' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f6361062fde0ee19dcb90bdcb1c5f04cfbe4892c1250eb3a5d11a3ade26dd264' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f5e6229e28f9788fc19f3ab13268a0751098a0cad75c5f20e423f71657074af1' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f1e5cb4986b085e04ab53857d28a4433ffdd4389df07182997e6a8debe6d8a49' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f174da4ef6fdba6bf14e6aff6367a8f723ad0e67981fadaf82a1ed9b1752782a' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'fdfc0f86b95ee66445479578721f67f98a4bde13f0bf105be317e55dc916d8c1' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f2a37b90dd20037068a6f4fc1cbe6af7307fd633db12dd0e639626c78528d0e9' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f34bb8d0f6f5eb7e05c3cf658e125978f902d40cd3106bf74e1560762d20dd11' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'ed288384e38cfbc67e01dea9abb39c254871e667180e45e2962b70af5d0d5b49' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'fe916225bf7782c070d14d096163ff71394e7cab71632e2160456590ba4b3b9a' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'fa4e7ac9d0c1e9af3dabdd44c4299ffcf38d218cba5845289fc3c91257bba3d1' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f866dacd5f84afb969388ed13cd1185f25beaec5ae76843c7621ac1146529cba' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'fdb3b70eb08c9b230b8a4730a2fb9b46a2c304ab804709c42a934a1dfe49a6c9' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00: can't verify chunk, load failed - store 'Primary_Backup', unable to load chunk 'f7b65b7c09618b23924c9d3a28ab2b8a5c4d306c1227a8c64a07742cc765a6eb' - No such file or directory (os error 2)
2024-09-09T21:34:55+02:00:   verified 246.09/709.55 MiB in 1.60 seconds, speed 153.57/442.78 MiB/s (20 errors)
2024-09-09T21:34:55+02:00: verify Primary_Backup:ct/2021/2023-10-14T00:01:27Z/root.pxar.didx failed: chunks could not be verified
... # ^^This is the result for pretty much all the backups
2024-09-09T22:26:14+02:00: percentage done: 100.00% (37/37 groups)
2024-09-09T22:26:14+02:00: Failed to verify the following snapshots/groups:
... a LOT of snapshosts

1726820362715.png

I think the only reason there were some snapshots that passed the verify jobs was because of the "Re-Verify After:" option, which for me is set to 30 days.

I've since deleted this datastore and started over, so if you need me to run some other tests on this, I'll have to restore it on another drive again. Otherwise my question is: how do I make this remote sync work and restore properly?
 
Hi,
did you verify that your script correctly synced the full datastore content to the remote, including the .chunks folder? If so, does this apply also for the reverse direction, so the restore to the PBS?

I think the only reason there were some snapshots that passed the verify jobs was because of the "Re-Verify After:" option, which for me is set to 30 days.
You can run a verify job without the skip verified flag set, that will cover all snapshots.

how do I make this remote sync work and restore properly?
First check the data has correctly been synced to the remote and that it was not corrupted there (what storage does your friend use on the Raspberry Pi, is that reliable?). Without the chunks, you cannot restore the contents.
 
missing chunks are possible if your manual sync processes files in the wrong order.. that's why we have a built-in sync that ensures indices and chunks are processed in a logical fashion ;)
 
missing chunks are possible if your manual sync processes files in the wrong order.. that's why we have a built-in sync that ensures indices and chunks are processed in a logical fashion ;)

True, if there where concurrent operations on the datastore (backup, prune, gc,...) while the sync was ongoing, can you exclude that @Zeash ?
 
You can run a verify job without the skip verified flag set, that will cover all snapshots.
Oh yea I failed to mention I ran it afterwards without that flag and everything was marked as corrupt.

First check the data has correctly been synced to the remote and that it was not corrupted there (what storage does your friend use on the Raspberry Pi, is that reliable?). Without the chunks, you cannot restore the contents.
It's definitely been synced successfully, at least according to RClone. I wouldn't have been surprised by corrupted chunks, since synced data is stored on a single consumer-grade SSD (albeit new). Thing is, all chunks (or 90% at least) have been marked as corrupted, so I doubt it's because of the SSD.

True, if there where concurrent operations on the datastore (backup, prune, gc,...) while the sync was ongoing, can you exclude that @Zeash ?
Yep, I've scheduled backups, prunes, verifies, and GC so that they never happen simultaneously. Backups are the closest, occurring at midnight, with syncs starting at 1:30 (used to be once a month, now once a week). From what I could gather in PVE's task list, backups take up to half an hour to complete.

I could theoretically improve the reliability of the setup by mounting the remote as a datastore, and running verify jobs on it with occasional syncs. Though I suspect those verify jobs would last 10+ hours with the available bandwidth between PBS and the remote machine.
 
I am not sure what more we can tell you, other than rclone seems to be "losing" chunks.. there's a reason PBS comes with its own sync feature that handles this correctly, even in the face of concurrent tasks. without a lot more logs it will be hard to find out what is actually going on..
 
  • Like
Reactions: Johannes S

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!