[Pathfinder4DIY] High-Availability solution for PBS (active/passive)

floh8

Renowned Member
Jul 27, 2021
1,239
159
88
Occasion:
Its nice to have a backup solution with remote offsite sync. But when your onsite backup server goes offline your one (the word "one" is always related to a human being, not a matter or person or number) have no backup for that day. A sysadmin could better sleep if there would be a possibility to have a high available onsite backup solution for PBS. My zfs-over-iscsi HA Storage solution inspired my one to build such a HA PBS.
If you are interested in to have such a solution direct from Proxmox then vote for this bugzilla enhancement!

IMPORTANT INFORMATION:
This solution is based on my zfs-over-iscsi HA Storage solution. So the same Information there are also relevant for this solution. My one tested it only in a test environment and not in production.

Use case for such a solution:
This solution is especially for companies which wants to have a high available Proxmox Backup Server.

Available functions:
It's a normal PBS Installation so you get what you installed a full functioned PBS. Only the Tape Backup is a little bit more difficult because your one need a TAPE Drive with 2 SAS-Ports to put both nodes to the same Tape drive. There are only 2 Vendors in the market which my one know they offer that. So for Tape-Backup user its better to use a single PBS node solution.

Key configuration of this solution:
3 PBS-Folders must be make available for the active node over a shared storage.

HW-Requirements:
  • 2 Nodes with 2 high bandwith network ports in a bond, 1 Management Port
  • min. 1 Dual-Controller JBOD Shelf
Operation System:
Actually my one liked to install a PBS 8 on top of a normal Debian installation, but after installation of the PBS pakets the node hanged while rebooting. So my one changed the order and installed the additional needed pakets for clustering on top of a standard PBS 8 ISO Installation.

My test environment base setup:
The same as in the HA zfs-over-iscsi project. So have a look there.

Additional used pakets:
pacemaker, corosync, pcs, network-manager, sbd, watchdog


Needed resource agents:
  • zfs
  • [filesystem] -> if you wanne go with BTRFS
  • ipaddr2
  • systemd

Fencing solution:
The same as in the HA zfs-over-iscsi project. So have a look there.

Tuning:
The same as in the HA zfs-over-iscsi project. So have a look there.

Configuration:
The configuration steps for the cluster build and the pacemaker configuration for zfs and the cluster-ip are all the same as in the HA zfs-over-iscsi project. So have a look there.
Only the resource agent for the iscsi service and the resource group are not necessary. Here follows the additional steps for the PBS HA configuration.

  • copy the folders "/etc/proxmox-backup", "/var/lib/proxmox-backup" and "/var/log/proxmox-backup" of one node to the zfs pool on the shared JBOD and delete the source folder on both nodes.
  • Then create a symlink for the source folders on both nodes to your shared zfs pool. Pay attention to the permissions! They must be exactly the same.
# ln -s /zpool1/pbs-etc/ /etc/proxmox-backup
# ln -s /zpool1/pbs-log/ /var/log/proxmox-backup
# ln -s /zpool1/pbs-lib/ /var/lib/proxmox-backup


------> all next steps have to be made on both nodes
  • disable the following PBS services
    • proxmox-backup.service
    • proxmox-backup-proxy.service
    • proxmox-backup-banner.service
    • pbs-network-config-commit.service (network config only over network-manager)
    • proxmox-backup-daily-update.timer

  • To make sure that neither of the first both services actually started, One have to write a own systemd-service.
  1. create a script file with # nano /root/stop-pbs.sh and the content:
Code:
#!/bin/bash

    systemctl stop proxmox-backup
    systemctl stop proxmox-backup-proxy
  • made the sript executable with # chmod 710 /root/stop-pbs.sh
  • create a new systemd service file with # nano /lib/systemd/system/proxmox-stop-PBS.service
  • edit this file with:
Code:
[Unit]
    Description=For Stopping the PBS services
    Wants=proxmox-backup.service proxmox-backup-proxy.service
    Before=corosync.service pacemaker.service

    [Service]
    ExecStart=/root/stop-pbs.sh

    [Install]
    WantedBy=multi-user.target

  • enable this new service
  • configure watchdog to use the modul softdog (see point sources)
  • comment out the line "#After=multi-user.target" of the file /lib/systemd/system/watchdog.service
  • add the service "watchdog.service" to the line After=systemd-modules-load.service iscsi.service watchdog.service in the file /lib/systemd/system/sbd.service
--> next steps only of one node
  • add to pacemaker config
Code:
# pcs resource create res_proxmox-backup systemd:proxmox-backup
# pcs resource create res_proxmox-backup-proxy systemd:proxmox-backup-proxy
# pcs resource create res_proxmox-backup-banner systemd:proxmox-backup-banner
# pcs resource create res_proxmox-backup-daily-update_timer systemd:proxmox-backup-daily-update.timer
# pcs resource group add grp_pbs_cluster res_zpool1 res_cluster-ip res_cluster-ip_MGMT res_proxmox-backup res_proxmox-backup-proxy res_proxmox-backup-banner res_proxmox-backup-daily-update_timer

How a cluster status could look like:

PBS-Cluster-status.png

Errors and solutions:

  1. In contrast to a standard debian installation the softdog modul is black-listed in the PBS-ISO installation. My one did'nt find where the entries for that configuration are located so it was time for a workaround. The solution was to use the paket watchdog.
  2. If u use the paket watchdog in combination with the paket sbd one have to define a boot dependency for sbd so that watchdog service starts before sbd service. See above.
  3. If u change the configuration in point 2 your system run in the next boot problem. It shows a dependency circle error and no pacemaker etc. is started. The reason for that is a stupid standard watchdog.service dependency configuration. One have to comment out this line. See above.
  4. Although the services "proxmox-backup" and "proxmox-backup-proxy" were disabled, they still started when booting. My one think they was initiated by an other task but not knowing which one. So the work around was to create a own systemd-service that stop both services after booting. See above.

Tested failure scenario:
Of course my one tried to make a failover while a backup job was running and of course this couldnt go well, because the hole http/TCP Stream break and therefor the connection was reseted. But thats not a problem because this was not the goal of this project. The next vm backup after failover will run without problems.

Cluster Web GUI:
Like my own already mentioned for Debian one have to compile the Web-files for your own. Of course, there was nothing about it on the internet.
My compile tests shows a very buggy implementation for Debian 12 (PBS 8) that is not usable for production. The test for Debian 13 (PBS 9) works. So here are the steps to follow for PBS 9 (internet connection and 2 GB RAM is required of course):

  • install additional packets for compiling
# apt install nodejs make automake npm autoconf pkgconf git
  • pcs-web-ui version must match to the pcs version; an overview your one see here; the pcs version one can get with # apt show pcs; at the time for this post the pcs version of Debian 13 is 0.12.0, but the according pcs-web-ui version 0.1.22 showed at the start a make file error; so my own use the version 0.1.20
  • so clone the branch with tag 0.1.20 with
# git clone -b 0.1.20 https://github.com/ClusterLabs/pcs-web-ui.git
  • change into the new directory with
# cd pcs-web-ui
  • initialize build with
# make init NEXUS_REPO=false
  • then a question comes up to update npm version --> accept this with "ENTER" (with deny older version stop with errors)
  • the warnings and errors one can ignore or intercept the batch process and clean it with the given commands and repeat from beginning
  • start build process with
# make build
  • copy files in subfolder "build" to portal folder of pcsd
# mkdir /usr/share/pcsd/public/ui
# cp build/. /usr/share/pcsd/public/ui/
  • now cluster portal for pcsd on port 2224 is now ready (login with "hacluster" user)
  • now create build files for cockpit with (before delete files in "build" folder)
# BUILD_FOR_COCKPIT=true make build
  • copy files to cockpit app folder
# mkdir /usr/share/cockpit/hacluster
# cp build/. /usr/share/cockpit/hacluster/
  • now cluster gui is also ready in cockpit
Info: For newer pcs-web-ui versions the command process changed completely. So always have a look to the github site of pcs-web-ui.

Used information sources:
The same as in the HA zfs-over-iscsi project. So have a look there. Additional the following:

Everyone can also send a direct message (no PMs, floh8 is no person or matter and floh8 is not located on a ship) to floh8 if there are requests to this solution.
 
Last edited:
Isn't this quite a hassle to setup correctly compared to just have a second PBS and setup a sync job? It also has the the benefit of an extra copy in case the backups on the first PBS get lost
 
a stage 2 backup like a offsite PBS or a Tape-Backup is always necessary, of course. This HA PBS solution increase the resiliency of the stage 1 backup. Read the occasion subtitle!
 
Extension for Multi-Backup-SW solution on a HA PBS

Many admins learn to like the PBS solution from Proxmox although some practical features still are missing in contrast to Veeam like single file restore to the source filesystem location, granular DB restore or single object restore for LDAP/AD. Clever admins like @Falk R. combine therefor PBS with a 2. or 3. backup solution like Veeam only for these special features. For the implementation of these other backup solution exists different ways to scrimp rack space or HW cost. If your own use a standalone PBS installation then the easiest solution would be to install a PVE on the same HW like the PBS use and virtualize the other backup solution. Some want to scrimp the extra PVE license cost and use the productive PVE Cluster for that but the most would like to have it on a seperate HW than the backuped virtualization environment. Now the question comes up: Is this also possible with the above HA PBS solution?

My own suggest here 2 possible solutions. My requirements for such a Virtualization solution would be:
  • web-ui
  • snapshot possibility
  • using blockstorage zfs for repository vdisk
  • simple backup possibility for the OS vdisk

My own was skeptical if this is possible with PVE because of possible incompatibilities with the existing corosync/pacemaker stack of the HA PBS solution. So my idea was to achieve it with libvirt+cockpit. Here my experience and steps:

Test environment:

- HA PBS 8 (from above)

1. Virtualization with libvirt+cockpit

  • additional packets to install
cockpit cockpit-machines qemu-utils qemu-kvm libvirt-clients libvirt-daemon bridge-utils libvirt-daemon-driver-storage-zfs​
  • disable and stop libvirtd service
  • copy libvirt folder /etc/libvirt and /var/lib/libvirt to shared storage and create symlinks (same procedure like in above post)
  • its important not to use the user space libvirt point in the cockpit gui but rather the system space libvirt point from now on
  • create own systemd service for starting and stopping the vm with
Code:
    [Unit]
    Description=Starts libvirt VMs
    Wants=libvirtd.service
 
    [Service]
    ExecStart=virsh start [vm-name]
    ExecStop=virsh destroy [vm-name]
    RemainAfterExit=true

    [Install]
    WantedBy=multi-user.target

  • create resource agents systemd for libvirtd and own libvirt-VM-Start script file
  • add these both resource agents to the existent resource group "grp_pbs_cluster"
  • now the libvirtd should start
  • zfs pool storage have to create via cli: # virsh pool-define-as --name zfspool --source-name zpool1/libvirt --type zfs(not possible in gui)
  • create network bridge in gui
  • create VM (for simple backup do'nt use zfs pool for OS vdisk)
  • cockpit uses only internal snapshots for qcow2 files, but thats no problem because the OS vdisk of the backup solution do not need extreme performance
  • for backing up the backup VM OS one can use this great github project
  • for firewall configuration one can also use cockpit when using firewalld; a config could look like this one:
PVE-PBS-Cluster-firewalld.png
short description: vmbr0=MGMT interface, enp0s10=Internet interface, vmbr1=backup storage interface, enp0s9=cluster network interface


2. Virtualization with PVE

My own was convinced that this would'nt be an easy way to go but challenge accepted. Comprehend that this solution is not a implementation of the PVE cluster stack but rather a failover PVE node with help of own configured corosync/pacemaker. While testing, my own noted that PVE will not play together with network configuration of NetworkManager. So one have to use PVE or PBS GUI for network configuration. Nevertheless one can use firewalld-gui in cockpit (cockpit still need the networkmanager packet installed). To use the integrated firewall of PVE is alluring but like in the HA PBS solution the PVE config files are saved on the shared storage. Therefore the firewall daemon on the passive node would missed his config files after fail over.
If one install PVE on a working corosync configuration then PVE adopt this configuration and the system run into an error. So the best is to comply with the installation chain 1. PVE, 2. the others (PBS and pacemaker etc..). My own do'nt wanted to reinstall my cluster so my own use a workaround.

  • disable and stop following services
  • pve-cluster
  • pve-daily-update.timer
  • proxmox-firewall
  • pve-ha-crm
  • pve-ha-lrm
  • pve-sdn-commit
  • pvedaemon
  • pveproxy
  • pvescheduler
  • pvestatd
  • pvefw-logger
  • mask and stop following services, because these are triggered to start
  • pve-firewall-commit
  • pve-firewall
  • copy libvirt folder /var/lib/pve-cluster and /var/log/pve to shared storage and create symlinks (same procedure like in above post); think of the folder "/etc/pve" is not relevant because its the cluster file system folder of PVE that is created automatic from pve-cluster service
  • create own systemd service for starting and stopping the vm with:
Code:
        [Unit]
        Description=Start or Shutdown/Stop PVE VMs
        Wants=pvedaemon.service

        [Service]
        ExecStart=pvenode startall --force on
        ExecStop=pvenode stopall --timeout 30
        RemainAfterExit=true

        [Install]
        WantedBy=multi-user.target

  • create resource agents systemd for your own systemd created for VM start/stop and all above pve services except the 4 firewall ones
  • for the resource groups my own changed to a new structure to group agents for service context to get better overview; the pbs_banner service is now missing because if PBS and PVE is on the same Node the pbs_banner service terminate (dont create this resource for PBS services)
    • pcs resource group add grp_basis res_zpool1 res_cluster-ip res_cluster-ip_MGMT
    • pcs resource group add grp_pbs-services res_proxmox-backup res_proxmox-backup-proxy res_proxmox-backup-daily-update_timer
    • pcs resource group add grp_pve-services res_pve-cluster res_pvedaemon res_pvestatd res_pveproxy res_pve-ha-crm res_pve-ha-lrm res_pve-sdn-commit res_pvescheduler res_pve-daily-update-timer res_PVE-VMs-start_stop
  • create resource constraint for these resource groups
    • pcs constraint colocation set grp_basis grp_pbs-services grp_pve-services sequential=true role=Stopped
  • How a cluster status could look like:
PVE-PBS-Cluster-status.png

  • my own do not make fail over tests because its the same solution like for PBS; maybe there are still little errors

3. Comparison of the both solution

libvirt+cockpit
PVE
effortlowhigher
flexible gui configurationnoyes
RAM usage~ 250 MB (only libvirt)~ 1 GB
network configurationcockpitPVE
firewall configurationcockpitcockpit
backup solutionexternalintegrated
professional supportnono
pricefreemin. 2 PVE subscriptions
 
Last edited:
  • Like
Reactions: UdoB