PBS breaking customer SQL backups. Backups without FS-Freeze?

ozdjh

Well-Known Member
Oct 8, 2019
114
26
48
Hi

After a couple of weeks of testing we're pretty impressed by PBS. We've started running it in parallel to our usual backup process on some production customer VMs. We've just started seeing a problem triggered by PBS that I believe is the interaction of the QEMU GA under Windows and MS SQL Server.

It appears that when a fs-freeze is requested, VSS under windows is triggering a request to all VSS writers, one of which is the SQL Writer if it's installed. That then appears to initiate a full backup of the SQL databases to some internal GUID (no idea where the data is actually sent as it's just doing a fs-freeze). The problem is that it's a full backup which breaks the backup chain, so if the customer is doing their own SQL backups that include Differential backups this makes their backups useless.

So if my understanding of the problem is correct we have 4 options :
  1. Get any client running SQL Server to mess with some windows settings to make VSS use a Copy Only backup
    1. (looks like the same problem exists in Azure - https://bit.ly/3ytDKGr )
  2. Disable QEMU GA on windows servers
  3. Not use PBS
  4. Find a way to do a backup through PBS that does not use fs-freeze (so it's like a SAN level snapshot)
Surely others here host VMs running windows and SQL for clients. Is this a known problem and is there a solution? Can we tell vzdump not to freeze / thaw the filesystems? Can we get qemu-ga to interact in a better way with VSS and SQL? I'd like to move over to PBS but this problem is a show stopper.


David
...
 
  • Like
Reactions: DerDanilo
Hi

After a couple of weeks of testing we're pretty impressed by PBS. We've started running it in parallel to our usual backup process on some production customer VMs. We've just started seeing a problem triggered by PBS that I believe is the interaction of the QEMU GA under Windows and MS SQL Server.

It appears that when a fs-freeze is requested, VSS under windows is triggering a request to all VSS writers, one of which is the SQL Writer if it's installed. That then appears to initiate a full backup of the SQL databases to some internal GUID (no idea where the data is actually sent as it's just doing a fs-freeze). The problem is that it's a full backup which breaks the backup chain, so if the customer is doing their own SQL backups that include Differential backups this makes their backups useless.

So if my understanding of the problem is correct we have 4 options :
  1. Get any client running SQL Server to mess with some windows settings to make VSS use a Copy Only backup
    1. (looks like the same problem exists in Azure - https://bit.ly/3ytDKGr )
  2. Disable QEMU GA on windows servers
  3. Not use PBS
  4. Find a way to do a backup through PBS that does not use fs-freeze (so it's like a SAN level snapshot)
Surely others here host VMs running windows and SQL for clients. Is this a known problem and is there a solution? Can we tell vzdump not to freeze / thaw the filesystems? Can we get qemu-ga to interact in a better way with VSS and SQL? I'd like to move over to PBS but this problem is a show stopper.


David
...
Does the customer use VEEAM?
We have several MSSQL guests, and all are backing up by PBS and "inGuest" by VEEAM with TLOG-Backup every 60 Minutes.
Never had a problem here. Qemu-Agent is enabled and fs-freeze/thaw does not produce any issues at all...
 
Does the customer use VEEAM?
We have several MSSQL guests, and all are backing up by PBS and "inGuest" by VEEAM with TLOG-Backup every 60 Minutes.
Never had a problem here. Qemu-Agent is enabled and fs-freeze/thaw does not produce any issues at all...
Hi

No, there's no VEEAM involved. The customer is just running normal "Full / Differential / Translog" backups locally. It's happened twice over the last 4 days and both times the SQL backup logs clearly show that a backup was triggered exactly when PBS started a backup of the VM.

The Translogs backups still get taken although I doubt they're complete anymore. The Differential fails as the last full-backup isn't the one it's Diff'ing against. Reading up on a similar issue with Azure, you may find that your Translogs don't work past the last time PBS ran. Someone posted that they had that exact problem on Azure after they snapshot the VM. Have you tried restoring those Translogs?
 
Restore is no issue here. Works as expected. Maybe its just a matter of timing and order?
After reading you link I guess best way is to modify client-registry to copy-only-mode.
This can be on your "recommendation" for SQL-Guests, so you are on the safe side....
 
The guest agent sends a VSS_BT_FULL when the freeze command is called. I can see why it does that:
VSS_BT_FULL
Full backup: all files, regardless of whether they have been marked as backed up or not, are saved. This is
the default backup type and schema, and all writers support it.

There is an issue to make this behavior configurable in the guest agent on windows.
 
  • Like
Reactions: OsvaldoP
Ok, thanks for the info. Look like we can only really move to PBS once the windows agent gets the new feature, and we get our clients to upgrade the agent on the windows VMs, and get the clients to set a custom reg key. That's not going to happen quickly at all. Shame.
 
note that this is not at all PBS specific - any freeze action will trigger it, e.g. VMA backup with agent enabled, cloning a running VM with agent enabled, snapshots without RAM state and agent enabled. the best workaround would likely be to not use the guest agent for such VMs for the time being, since the main use case for it (freezing the disks for consistency purposes) is broken (or rather, has unwanted side-effects).
 
Does the agent play any part in guest memory ballooning ?
The ballooning is handled by the ballooning agent. The guest agent should not have anything to do with it.
 
Disabling the guest agent isn't ideal as
  1. We'll need to coordinate a VM stop and start with all customers to disable it
  2. We use the agent functionality for other things
We agree the fs-freeze has "unwanted side-effects" on windows so can you provide us an option that does a backup without calling fs-freeze? So a new backup mode : "Snapshot | Suspend | Stop | Basic". Doing a backup without calling fs-freeze would be the same as running a backup against a VM that wasn't running the guest agent.

Looking at the perl code for vzdump it looks like a very simple solution to this problem. I appreciate that the problem isn't caused by PBS, but for us we can't use PBS because of it.
 
We've made a simple change to the QemuServer code in VZDump and have resolved the problem for us. It's not elegant but it'll do for us to continue testing PBS without breaking customer SQL backups.

Adding this properly would be super simple, either as a "no-freeze" backup mode or a "Dont use guest agent freeze during snapshots" option on the VM. Hopefully you guys will add that to the feature request queue so people have a decent way to work around this issue if they use PBS or VZDump directly.
 
Could you please open an enhancement request over at https://bugzilla.proxmox.com/? Then we can keep track of it and discuss the benefits vs the caveats. The main one, AFAICT, would be that the backup will not be consistent, as anything that the guest had kept only in RAM will not be part of the backup because it hasn't been flushed down to disk prior to the backup starting.
 
Hi. Sure, I'll open a ticket for this. As far as inconsistency is concerned, yes, it's not ideal, but in this case it's better than the issues caused by the fs-freeze. It's basically the same as backing up a VM that isn't running the agent, or grabbing a SAN snapshot, or a ceph export, or restarting a physical server after a power outage. If we can choose to enable that behaviour then it's just another option available to us to work around other problems.
 
  • Like
Reactions: Falken
Could you please open an enhancement request over at https://bugzilla.proxmox.com/? Then we can keep track of it and discuss the benefits vs the caveats. The main one, AFAICT, would be that the backup will not be consistent, as anything that the guest had kept only in RAM will not be part of the backup because it hasn't been flushed down to disk prior to the backup starting.
As i read in the Patchnotes for QEMU 8.0, the Problem should be fixed in a further release.
https://wiki.qemu.org/ChangeLog/8.0#Guest_agent

https://gitlab.com/qemu-project/qemu/-/commit/7dfce9bd0fb226debf03a9bc73eaa0b85e836bab
 
Last edited:
  • Like
Reactions: aaron
  • Like
Reactions: ozdjh
In bare-metal windows mssql's backup chain only breaks if a vss-shadowcopy is created in a drive where database files are stored.
If I store .mdb files at disk D and backup only C, in bare-metal there should be no problem.
But in proxmox even if I don't backup D (the drive where .mdb files are), the backup chain still breaks.
 
As I commented previusly, proxmox backup calls guest-fsfreeze-freeze, which "Sync and freeze all freezable, local guest filesystems".
This way, even the disks that are backup=0 are freezed. In Windows, this causes VSS-SQLServer to issue a "log" backup statement, breaking the backup chain.

This can be fixed by issuing a guest-fsfreeze-freeze-list instead, which "Sync and freeze specified guest filesystems.".

Documentation:
https://www.qemu.org/docs/master/interop/qemu-ga-ref.html#qapidoc-86

I submited a Bug: https://bugzilla.proxmox.com/show_bug.cgi?id=4887
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!