PBS breaking customer SQL backups. Backups without FS-Freeze?

Oct 8, 2019
102
19
23
Hi

After a couple of weeks of testing we're pretty impressed by PBS. We've started running it in parallel to our usual backup process on some production customer VMs. We've just started seeing a problem triggered by PBS that I believe is the interaction of the QEMU GA under Windows and MS SQL Server.

It appears that when a fs-freeze is requested, VSS under windows is triggering a request to all VSS writers, one of which is the SQL Writer if it's installed. That then appears to initiate a full backup of the SQL databases to some internal GUID (no idea where the data is actually sent as it's just doing a fs-freeze). The problem is that it's a full backup which breaks the backup chain, so if the customer is doing their own SQL backups that include Differential backups this makes their backups useless.

So if my understanding of the problem is correct we have 4 options :
  1. Get any client running SQL Server to mess with some windows settings to make VSS use a Copy Only backup
    1. (looks like the same problem exists in Azure - https://bit.ly/3ytDKGr )
  2. Disable QEMU GA on windows servers
  3. Not use PBS
  4. Find a way to do a backup through PBS that does not use fs-freeze (so it's like a SAN level snapshot)
Surely others here host VMs running windows and SQL for clients. Is this a known problem and is there a solution? Can we tell vzdump not to freeze / thaw the filesystems? Can we get qemu-ga to interact in a better way with VSS and SQL? I'd like to move over to PBS but this problem is a show stopper.


David
...
 
  • Like
Reactions: DerDanilo

itNGO

Well-Known Member
Jun 12, 2020
573
126
48
44
Germany
it-ngo.com
Hi

After a couple of weeks of testing we're pretty impressed by PBS. We've started running it in parallel to our usual backup process on some production customer VMs. We've just started seeing a problem triggered by PBS that I believe is the interaction of the QEMU GA under Windows and MS SQL Server.

It appears that when a fs-freeze is requested, VSS under windows is triggering a request to all VSS writers, one of which is the SQL Writer if it's installed. That then appears to initiate a full backup of the SQL databases to some internal GUID (no idea where the data is actually sent as it's just doing a fs-freeze). The problem is that it's a full backup which breaks the backup chain, so if the customer is doing their own SQL backups that include Differential backups this makes their backups useless.

So if my understanding of the problem is correct we have 4 options :
  1. Get any client running SQL Server to mess with some windows settings to make VSS use a Copy Only backup
    1. (looks like the same problem exists in Azure - https://bit.ly/3ytDKGr )
  2. Disable QEMU GA on windows servers
  3. Not use PBS
  4. Find a way to do a backup through PBS that does not use fs-freeze (so it's like a SAN level snapshot)
Surely others here host VMs running windows and SQL for clients. Is this a known problem and is there a solution? Can we tell vzdump not to freeze / thaw the filesystems? Can we get qemu-ga to interact in a better way with VSS and SQL? I'd like to move over to PBS but this problem is a show stopper.


David
...
Does the customer use VEEAM?
We have several MSSQL guests, and all are backing up by PBS and "inGuest" by VEEAM with TLOG-Backup every 60 Minutes.
Never had a problem here. Qemu-Agent is enabled and fs-freeze/thaw does not produce any issues at all...
 
Oct 8, 2019
102
19
23
Does the customer use VEEAM?
We have several MSSQL guests, and all are backing up by PBS and "inGuest" by VEEAM with TLOG-Backup every 60 Minutes.
Never had a problem here. Qemu-Agent is enabled and fs-freeze/thaw does not produce any issues at all...
Hi

No, there's no VEEAM involved. The customer is just running normal "Full / Differential / Translog" backups locally. It's happened twice over the last 4 days and both times the SQL backup logs clearly show that a backup was triggered exactly when PBS started a backup of the VM.

The Translogs backups still get taken although I doubt they're complete anymore. The Differential fails as the last full-backup isn't the one it's Diff'ing against. Reading up on a similar issue with Azure, you may find that your Translogs don't work past the last time PBS ran. Someone posted that they had that exact problem on Azure after they snapshot the VM. Have you tried restoring those Translogs?
 

itNGO

Well-Known Member
Jun 12, 2020
573
126
48
44
Germany
it-ngo.com
Restore is no issue here. Works as expected. Maybe its just a matter of timing and order?
After reading you link I guess best way is to modify client-registry to copy-only-mode.
This can be on your "recommendation" for SQL-Guests, so you are on the safe side....
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,103
513
118
The guest agent sends a VSS_BT_FULL when the freeze command is called. I can see why it does that:
VSS_BT_FULL
Full backup: all files, regardless of whether they have been marked as backed up or not, are saved. This is
the default backup type and schema, and all writers support it.

There is an issue to make this behavior configurable in the guest agent on windows.
 
  • Like
Reactions: OsvaldoP
Oct 8, 2019
102
19
23
Ok, thanks for the info. Look like we can only really move to PBS once the windows agent gets the new feature, and we get our clients to upgrade the agent on the windows VMs, and get the clients to set a custom reg key. That's not going to happen quickly at all. Shame.
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,158
1,594
164
note that this is not at all PBS specific - any freeze action will trigger it, e.g. VMA backup with agent enabled, cloning a running VM with agent enabled, snapshots without RAM state and agent enabled. the best workaround would likely be to not use the guest agent for such VMs for the time being, since the main use case for it (freezing the disks for consistency purposes) is broken (or rather, has unwanted side-effects).
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,103
513
118
Does the agent play any part in guest memory ballooning ?
The ballooning is handled by the ballooning agent. The guest agent should not have anything to do with it.
 
Oct 8, 2019
102
19
23
Disabling the guest agent isn't ideal as
  1. We'll need to coordinate a VM stop and start with all customers to disable it
  2. We use the agent functionality for other things
We agree the fs-freeze has "unwanted side-effects" on windows so can you provide us an option that does a backup without calling fs-freeze? So a new backup mode : "Snapshot | Suspend | Stop | Basic". Doing a backup without calling fs-freeze would be the same as running a backup against a VM that wasn't running the guest agent.

Looking at the perl code for vzdump it looks like a very simple solution to this problem. I appreciate that the problem isn't caused by PBS, but for us we can't use PBS because of it.
 
Oct 8, 2019
102
19
23
We've made a simple change to the QemuServer code in VZDump and have resolved the problem for us. It's not elegant but it'll do for us to continue testing PBS without breaking customer SQL backups.

Adding this properly would be super simple, either as a "no-freeze" backup mode or a "Dont use guest agent freeze during snapshots" option on the VM. Hopefully you guys will add that to the feature request queue so people have a decent way to work around this issue if they use PBS or VZDump directly.
 

aaron

Proxmox Staff Member
Staff member
Jun 3, 2019
3,103
513
118
Could you please open an enhancement request over at https://bugzilla.proxmox.com/? Then we can keep track of it and discuss the benefits vs the caveats. The main one, AFAICT, would be that the backup will not be consistent, as anything that the guest had kept only in RAM will not be part of the backup because it hasn't been flushed down to disk prior to the backup starting.
 
Oct 8, 2019
102
19
23
Hi. Sure, I'll open a ticket for this. As far as inconsistency is concerned, yes, it's not ideal, but in this case it's better than the issues caused by the fs-freeze. It's basically the same as backing up a VM that isn't running the agent, or grabbing a SAN snapshot, or a ceph export, or restarting a physical server after a power outage. If we can choose to enable that behaviour then it's just another option available to us to work around other problems.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!