Feature request - backup groups

kjj

Member
Mar 15, 2024
5
1
8
Right now, it is impossible to use ProxMox VE + ProxMox Backup Server to get a coherent snapshot of a multi-VM system. I'm thinking of a SharePoint farm where the database and contents are on different servers, but the problem applies to many situations.

The problem is that each VM is backed up individually. The system tells the qemu-agent to ask the operating system to freeze writes to the disk, then the system takes a snapshot, then it unfreezes the disks. When that VM is done, it moves on to the next VM, and so on. Because there is no coordination, the database backup might include data about contents on a different server that do not exist in the other server's backup.

The solution appears to be fairly simple - define a backup group. When a backup is started for any VM in the group, send the freeze request to every server in the group, and while they are all frozen, snapshot them all. The system can then process the snapshots sequentially as usual, and the group of backups will be consistent with each other because they all reflect the state of the group of servers during a single freeze.
 
  • Like
Reactions: Johannes S
You can put vms and lxcs in Pools and configure a backup job who target a specific tool. So they are basically backup groups, but still don't cover the snapshot part though. Since the developers don't read everything on the forum you might want to file the feature request on bugzilla.proxmox.com and link it here for reference?

This would ensure that developers notice it and other community lembers can chime in their support.
 
Last edited:
  • Like
Reactions: UdoB
that smells like shifting the problems of a badly designed solution to the backup job. In a multi-VM deployment it should be always possible to recover from partial outages, because those can have different reasons, not only backup.
I do a lot of async programming, not a big difference - services can (and probably will) fail, and You need to gracefully recover from that. Everything else is flaky and bad design. I do NOT say Your idea is not good - it just tells me there is some deeper underlying problem that should probably be resolved.
 
  • Like
Reactions: Johannes S
You don't understand the problem. The system recovers just fine from outages and crashes.

The problem is that while the system is running normally, the two (or more) servers are in communication. Content gets uploaded to one server and the database on the other server is updated to reflect that new content. If one server crashes, the system as a whole will flush out any uncommitted transactions as part of the crash recovery.

The problem is that the ProxMox backup currently can't be coordinated between the two servers. In the "Content gets uploaded to one server and the database on the other server is updated" transaction above, imagine if the backup for one server happens before that transaction, then the other server gets backed up after that transaction.

If you try to restore those two servers, the state of one server includes a fully verified and committed transaction that the other server knows nothing about.

My proposal is to coordinate the IO freeze for the two (or more) servers so that the backup for the entire multi-server system happens unambiguously either before or after that transaction instead of straddling it.
 
ok, I understand, my bad. even if Your design is clean, there might be a problem on restore.
why not just use vzdump hook scripts. or write Your own script that freezes and backup Your VM-Group ?

Probably You need to extend that script to stop some congestion jobs or whatever, to get a clean state (we dont want to freeze in the middle of a transaction) - and then restart that congestion after freezing - that service downtime might not even be noticeable. But its still a bit fishy - database server can be clustered and media server also. And the data should not live on the VM anyway. By that way You can backup always one of the clustered database/media servers without any downtime.
Then the problem is shifted to database / media synchronisation - maybe smth like Kafka, Redpanda, NATS Jetstream, Apache Pulsar would be a better architecture, it just depends what You really need. Dont overengineer ....


Bash:
#!/usr/bin/env bash
set -Eeuo pipefail

# VMs to process
VMS=(101 102 103)

# Backup settings
STORAGE="pbs-backup"
MODE="snapshot"
COMPRESS="zstd"
NOTES_TEMPLATE="{{guestname}}"

# Optional vzdump extras
VZDUMP_ARGS=(
  --storage "$STORAGE"
  --mode "$MODE"
  --compress "$COMPRESS"
  --notes-template "$NOTES_TEMPLATE"
)

# Track which VMs were successfully frozen, so we only thaw those
FROZEN_VMS=()

log() {
  echo "[$(date '+%F %T')] $*"
}

has_guest_agent() {
  local vmid="$1"
  qm config "$vmid" | grep -q '^agent: 1'
}

freeze_vm() {
  local vmid="$1"

  log "Freezing VM $vmid"
  qm guest cmd "$vmid" fsfreeze-freeze >/dev/null

  local status
  status="$(qm guest cmd "$vmid" fsfreeze-status 2>/dev/null || true)"

  if [[ "$status" == *"frozen"* ]]; then
    log "VM $vmid frozen"
    FROZEN_VMS+=("$vmid")
  else
    log "WARNING: VM $vmid freeze status unclear: $status"
    FROZEN_VMS+=("$vmid")
  fi
}

thaw_vm() {
  local vmid="$1"

  log "Thawing VM $vmid"
  qm guest cmd "$vmid" fsfreeze-thaw >/dev/null || \
    log "ERROR: Failed to thaw VM $vmid. Please check manually."
}

cleanup() {
  local exit_code=$?

  if ((${#FROZEN_VMS[@]} > 0)); then
    log "Running cleanup. Unfreezing VMs: ${FROZEN_VMS[*]}"
    for vmid in "${FROZEN_VMS[@]}"; do
      thaw_vm "$vmid"
    done
  fi

  exit "$exit_code"
}

trap cleanup EXIT

# Pre-checks
for vmid in "${VMS[@]}"; do
  if ! qm status "$vmid" >/dev/null 2>&1; then
    log "ERROR: VM $vmid does not exist"
    exit 1
  fi

  if ! has_guest_agent "$vmid"; then
    log "ERROR: VM $vmid does not have guest agent enabled in Proxmox config"
    exit 1
  fi
done

# Freeze all VMs
for vmid in "${VMS[@]}"; do
  freeze_vm "$vmid"
done

# Backup all VMs in one vzdump run
log "Starting backup for VMs: ${VMS[*]}"
vzdump "${VMS[@]}" "${VZDUMP_ARGS[@]}"

log "Backup completed successfully"
 
Last edited:
  • Like
Reactions: Johannes S