Backup rpool/ROOT/pve-1 for Disaster Recovery strategy

gabrimox

New Member
Jun 10, 2025
26
4
3
zpool town
Hi all,
i'm a new proxmox user and i'm tryng to get better strategy to backup root pve partitition.
I've installed proxmox into a ssd, also vm/lxc data config are stored here but external ssd with passthourgh are used for data storage.

I like a feedback from your for best DR strategy in case of disk failure (raid is used only for data storage, not for root)
My combo is:
rsync /etc --- > external fat32 disk (slow recovery)
zfs send/rcv --> external disk (fast recovery.... so install proxmox on new disk and then restore the pool to mount on /)

external disks for both are attached to truenas vm inside proxmox and shared via NFS

i asked to chatgpt a script for second point with following requiremet:
  • ✅ You want to backup rpool/ROOT/pve-1 recursively
  • ✅ You want to avoid mounting the replicated datasets (to prevent conflicts like a second / mount)
  • ✅ You still want daily backups at 3 AM, retention of 10 days, and basic verification
  • Snapshot exists on destination
  • ✅ Dataset is not mounted (as intended)
  • ✅ zpool status of backup pool is healthy
  • ✅ zfs scrub is recent or run manually after backup
  • ✅ Optionally: compare checksums of source vs backup (if file-level check needed — optional)
Code:
#!/bin/bash

set -e

# Variables
SRC="rpool/ROOT/pve-1"
DST="nfsbackuppool/backups/pve-1"
SNAP_PREFIX="autosnap"
DATE=$(date +%F)
SNAP_NAME="${SNAP_PREFIX}-${DATE}"
LOG_FILE="/var/log/zfs_nfs_backup.log"
BACKUP_POOL="nfsbackuppool"

log() {
  echo "$(date '+%F %T') $*" | tee -a "$LOG_FILE"
}

# 1. Create recursive snapshot
log "[INFO] Creating snapshot: ${SRC}@${SNAP_NAME}"
zfs snapshot -r "${SRC}@${SNAP_NAME}"

# 2. Send snapshot recursively, unmounted
log "[INFO] Sending snapshot recursively to ${DST}"
zfs send -R "${SRC}@${SNAP_NAME}" | zfs receive -uF "$DST"

# 3. Prune old snapshots (source)
log "[INFO] Pruning old snapshots from source"
zfs list -H -t snapshot -o name -s creation | \
  grep "^${SRC}@" | grep "${SNAP_PREFIX}-" | \
  head -n -10 | xargs -r -n1 zfs destroy

# 4. Prune old snapshots (destination)
log "[INFO] Pruning old snapshots from destination"
zfs list -H -t snapshot -o name -s creation | \
  grep "^${DST}@" | grep "${SNAP_PREFIX}-" | \
  head -n -10 | xargs -r -n1 zfs destroy

# 5. Verify snapshot exists on destination
log "[INFO] Verifying snapshot on destination"
if zfs list -t snapshot "${DST}@${SNAP_NAME}" >/dev/null 2>&1; then
  log "[OK] Snapshot exists: ${DST}@${SNAP_NAME}"
else
  log "[ERROR] Snapshot not found: ${DST}@${SNAP_NAME}"
  exit 1
fi

# 6. Check that backup dataset is not mounted
MOUNTPOINT=$(zfs get -H -o value mounted "${DST}")
if [ "$MOUNTPOINT" = "no" ]; then
  log "[OK] Dataset is not mounted (as expected)"
else
  log "[WARNING] Dataset is mounted unexpectedly"
fi

# 7. Check zpool health
log "[INFO] Checking backup pool status"
POOL_STATUS=$(zpool status -x "$BACKUP_POOL")
if [[ "$POOL_STATUS" == "all pools are healthy" || "$POOL_STATUS" == "$BACKUP_POOL is healthy" ]]; then
  log "[OK] Backup zpool '$BACKUP_POOL' is healthy"
else
  log "[ERROR] zpool '$BACKUP_POOL' is not healthy"
  zpool status "$BACKUP_POOL" >> "$LOG_FILE"
  exit 1
fi

# 8. Run scrub and wait (optional - long running)
log "[INFO] Running scrub on backup pool: $BACKUP_POOL"
zpool scrub "$BACKUP_POOL"

# Optional: wait for scrub completion (skip if not wanted)
# log "[INFO] Waiting for scrub to complete..."
# while zpool status "$BACKUP_POOL" | grep -q "scrub in progress"; do sleep 10; done
# log "[OK] Scrub completed"

log "[SUCCESS] Backup completed and verified successfully."
exit 0


what do you think?
do you have any suggestion?

important point that i found during my test and i posted to chatgpt attentions are: recursive snapshot to avoid missing dataset and avoid autmounting point (during snapshot test on external disk, i was affected by slow performance and other strange stuff... then i noticed that / was mount on both ssd and external disk! )
 
Last edited:
Did you test the rollback of a snapshot back to proxmox on a new hdd?
No, but i did a new script much more reliable
I will post as soon as i'm back from holidays

Anyway i'm thinking to use TrueNas and its Replication Task feature, it looks promising...
 
Last edited:
  • Like
Reactions: fpausp
rootfs or hdd (bootdisk) is broken...
For those scenarios and as a first level of mitigation, I use
  • snapshotting rpool/ROOT/pve-1 if changes rendering the OS itself unuseable and
  • raid 1 / zfs mirror for hardware failures
which covers both use cases. Replication/zfs send/receive or proxmox-backup-client-based backups to PBS do the second level.
 
ok final script here, thanks chatgpt:
- create snapshot
- retention like you want, delete all other snapshot
- send to usb disk with integrity/verify checks
- pruning job on both src/dst
- safe mountpoint (no / ! ) (in my case pool is backup_usb and mount is on /backup_usb , so replace with your data)
- send message to telegram if failing

Code:
#!/bin/bash

set -euo pipefail

# ===== PATH & COMMANDS =====
PATH=/usr/sbin:/usr/bin:/sbin:/bin

ZFS=/usr/sbin/zfs
ZSTREAMDUMP=/usr/sbin/zstreamdump
DATE=/bin/date
CAT=/bin/cat
CURL=/usr/bin/curl

# ===== VARIABLES =====
SRC_DATASET="rpool/ROOT/pve-1"
DST_POOL="backup_usb"
DST_DATASET="${DST_POOL}/${SRC_DATASET#*/}"
SNAP_PREFIX="snap"
DATE_FMT=$($DATE +%d_%m_%y_%H:%M)
SNAP_NAME="${SNAP_PREFIX}-${DATE_FMT}"

DST_MOUNTPOINT="/backup_usb/ROOT/pve-1"

TMP_STREAM="/var/tmp/sendstream-${SNAP_NAME}.dat"
LOGFILE="/var/log/zfs-backup.log"

TELEGRAM_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
TELEGRAM_CHAT_ID="xxxxxxxxxxxxxxxxxxxxxxxxxx"

RETENTION=30  # keep last 30 snapshots

# ===== LOGGING =====
exec > >(tee -a "$LOGFILE") 2>&1

echo "========== $(date) =========="
echo "[INFO] Backup started for ${SRC_DATASET} → ${DST_DATASET}"

# ===== FUNCTIONS =====
send_telegram() {
  local msg="$1"
  echo "[TELEGRAM] $msg"
  $CURL -s -X POST "https://api.telegram.org/bot${TELEGRAM_TOKEN}/sendMessage" \
    -d chat_id="${TELEGRAM_CHAT_ID}" \
    -d text="$msg" >/dev/null
}

error_handler() {
  local last_exit=$?
  local lineno=$1
  echo "[ERROR] Script failed at line $lineno with exit code $last_exit"
  send_telegram "❌ ZFS backup failed at line $lineno with exit code $last_exit.
Source: ${SRC_DATASET}
Destination: ${DST_DATASET}"
  exit $last_exit
}

trap 'error_handler $LINENO' ERR

prune_snapshots() {
    local dataset="$1"
    echo "[INFO] Pruning snapshots for dataset $dataset..."
    if ! $ZFS list -H -t snapshot -o name "$dataset" >/dev/null 2>&1; then
        echo "[INFO] Dataset $dataset does not exist, skipping pruning"
        return
    fi
    mapfile -t snaps < <($ZFS list -H -t snapshot -o name -s creation "$dataset" | grep "^${dataset}@${SNAP_PREFIX}-")
    echo "[INFO] Found ${#snaps[@]} snapshots with prefix $SNAP_PREFIX"
    if [ "${#snaps[@]}" -le "$RETENTION" ]; then
        echo "[INFO] No pruning needed, keeping all ${#snaps[@]} snapshots"
        return
    fi
    for old_snap in "${snaps[@]:0:${#snaps[@]}-$RETENTION}"; do
        echo "[INFO] Destroying old snapshot $old_snap"
        $ZFS destroy "$old_snap"
    done
    echo "[INFO] Pruning completed for $dataset"
}

# ===== SNAPSHOT CREATION =====
echo "[INFO] Creating snapshot ${SRC_DATASET}@${SNAP_NAME} ..."
$ZFS snapshot -r "${SRC_DATASET}@${SNAP_NAME}"
echo "[INFO] Snapshot created"

echo "[INFO] Listing source snapshots:"
$ZFS list -H -t snapshot -o name -s creation "${SRC_DATASET}" | grep "^${SRC_DATASET}@${SNAP_PREFIX}-" || echo "[INFO] No snapshots found"

echo "[INFO] Listing destination snapshots:"
if $ZFS list -H -t snapshot -o name "${DST_DATASET}" >/dev/null 2>&1; then
    $ZFS list -H -t snapshot -o name -s creation "${DST_DATASET}" | grep "^${DST_DATASET}@${SNAP_PREFIX}-" || echo "[INFO] No snapshots found"
else
    echo "[INFO] Destination dataset ${DST_DATASET} does not exist or has no snapshots yet"
fi

# ===== FIND LAST COMMON SNAPSHOT =====
echo "[INFO] Finding last common snapshot between source and destination"
mapfile -t src_snaps < <($ZFS list -H -t snapshot -o name -s creation "${SRC_DATASET}" | grep "^${SRC_DATASET}@${SNAP_PREFIX}-")
dst_snaps=()
if $ZFS list -H -t snapshot -o name "${DST_DATASET}" >/dev/null 2>&1; then
    mapfile -t dst_snaps < <($ZFS list -H -t snapshot -o name -s creation "${DST_DATASET}" | grep "^${DST_DATASET}@${SNAP_PREFIX}-")
fi

last_common=""
for (( i=${#src_snaps[@]}-1; i>=0; i-- )); do
  snap=${src_snaps[i]#${SRC_DATASET}@}
  for dst_snap_full in "${dst_snaps[@]}"; do
    dst_snap=${dst_snap_full#${DST_DATASET}@}
    if [[ "$snap" == "$dst_snap" ]]; then
      last_common="$snap"
      break 2
    fi
  done
done

if [[ -n "$last_common" ]]; then
  echo "[INFO] Last common snapshot: $last_common"
else
  echo "[INFO] No common snapshot found; first full backup will be sent"
fi

# ===== GENERATE SEND STREAM =====
if [[ -n "$last_common" ]]; then
  echo "[INFO] Generating incremental send stream from $last_common → $SNAP_NAME ..."
  $ZFS send -v --large-block --compressed -I "${SRC_DATASET}@${last_common}" "${SRC_DATASET}@${SNAP_NAME}" > "$TMP_STREAM"
else
  echo "[INFO] Generating full recursive send stream ..."
  $ZFS send -v --large-block --compressed -R "${SRC_DATASET}@${SNAP_NAME}" > "$TMP_STREAM"
fi
echo "[INFO] Send stream created at $TMP_STREAM"

# ===== VERIFY & RECEIVE =====
echo "[INFO] Verifying send stream with zstreamdump ..."
if $ZSTREAMDUMP -v < "$TMP_STREAM"; then
  echo "[INFO] Stream verification passed, receiving dataset ..."
  $CAT "$TMP_STREAM" | $ZFS recv -uF "${DST_DATASET}"
else
  send_telegram "❌ ZFS send stream verification FAILED for snapshot ${SNAP_NAME}!"
  rm -f "$TMP_STREAM"
  exit 1
fi
echo "[INFO] Receive completed"

rm -f "$TMP_STREAM"
echo "[INFO] Temporary send stream deleted"

# ===== SNAPSHOT PRUNING =====
prune_snapshots "$SRC_DATASET"
prune_snapshots "$DST_DATASET"

# ===== SET MOUNTPOINT =====
echo "[INFO] Setting mountpoint ${DST_MOUNTPOINT} on ${DST_DATASET}"
$ZFS set mountpoint="${DST_MOUNTPOINT}" "${DST_DATASET}"

echo "[INFO] Backup job completed successfully."
echo "============================================="

it works like a charm

do you suggest to export/import zpool at the end/start of the process?
consider usb disk is always attached....but no write occurs on it out of replicate script
 
Last edited: