Backup to PBS of running OPNSense VM causes issues

rfox

Active Member
May 28, 2021
62
3
28
59
I'm running latest OPNsense version in a VM on a standalone host (N305 based) - and perform a backup once a week to a PBS server on another machine - Backup runs at 3 am on Sunday mornings - takes about 30 minutes using snapshot mode (60G disk image) - I get an e-mail report status backup was successful . . . BUT

Many times after that, although the VM is still running - serveral OPNsense services have stopped and the internet is broke - next morning, when I log into the OPNsense interface, I manually restart the dead services and everything works again - or a reboot is necessary.

Any suggestions? Should I not be using snapshot mode and use suspend or stop instead?

This does not happen every time - just somethimes. I thought it was a fluke - but the recurrance is annoying.

Thx in advance
 
I'm running latest OPNsense version in a VM on a standalone host (N305 based) - and perform a backup once a week to a PBS server on another machine - Backup runs at 3 am on Sunday mornings - takes about 30 minutes using snapshot mode (60G disk image) - I get an e-mail report status backup was successful . . . BUT

Many times after that, although the VM is still running - serveral OPNsense services have stopped and the internet is broke - next morning, when I log into the OPNsense interface, I manually restart the dead services and everything works again - or a reboot is necessary.

Any suggestions? Should I not be using snapshot mode and use suspend or stop instead?

This does not happen every time - just somethimes. I thought it was a fluke - but the recurrance is annoying.

Thx in advance
Hi,
do you see any errors in the VMs system logs? Maybe the VM I/O is starved because the backup target is not fast enough? In that case a fleecing image will help, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_vm_backup_fleecing
 
Hi,
do you see any errors in the VMs system logs? Maybe the VM I/O is starved because the backup target is not fast enough? In that case a fleecing image will help, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_vm_backup_fleecing
Thanks for the tip! ;) What exactly am I looking for in the logs to identify this suspicion ? Why would that affect the continuous operation of the VM when performing a "snapshot" - I thought the VM resumed very quickly and the backup happens independently ?? Up until recently, this wan't a problem and it seems to be sporadic . .

As an alternative, I have switched the weekly backup of that particular VM to local storage on the same node instead of over the network to the PBS currently living on a QNap NAS device . . . until I figure out what's happening . . .
 
Last edited:
What exactly am I looking for in the logs to identify this suspicion ?
Any IO related errors on the virtual disks of the VM around the time of the backup. How are your disks attached? Do you use sata or scsi controller?
 
Any IO related errors on the virtual disks of the VM around the time of the backup. How are your disks attached? Do you use sata or scsi controller?
1751268573364.png

Just checked logs - no errors whatsoever related to IO - just "backup successful"
 
Just checked logs - no errors whatsoever related to IO - just "backup successful"
What logs did you check? You should check the syslogs within the VM, would not expect to see any errors on the host.
I thought the VM resumed very quickly and the backup happens independently ??
Yes, the snapshot mode creates a consistent state of the VM for backup, so it does not need to be powered down. But since the backup process is copy-before-write, any newly written data will block until the old data is written to the backup target. The fleecing image reduces this by locally storing these data chunks before having them written to the backup target.
 
What logs did you check? You should check the syslogs within the VM, would not expect to see any errors on the host.

I think I found something - every Sunday @ 3am (when the backup job starts) I get a series of processes being killed and failure to reclaim memory - bit this is OPNSense running on BSD under the hood - not sure if this is relevant. All i can say is, this backup worked for many months prior - only change I can think of which possibly could have an effect is updating to the Linux 6.14.5-1-bpo12-pve kernel from 6.11 on the node ?!?

1751271918986.png