PBS corrupts MS Exchange database

nejcsuhadolc · Feb 2, 2024

I believe a bug in the PBS / Proxmox system can corrupt the Exchange database.

Here are the steps to reproduce it:
Environment:

pve-manager/8.0.4/d258a813cfa6b390
VM disk are located on raidz1 zfs pool, disks are Samsung Datacenter 860 DCT series. Pool stays healthy all the time.
Virtio driver 0.1.240.
The server is Supermicro dual proc, I can dig the exact info if somebody finds this relevant.
PBS is connected over IPSec, a slower connection, about 100 Mbps. The reason for this is that we ran out of space at the location. I do not think the issue itself is related to this slower link since this has happened before on a local PBS server as well.

To reproduce the bug, you have to start the backup, wait until it starts sending the data, and then stop the backup.

Here is the backup log:
INFO: starting new backup job: vzdump 106 107 108 109 110 111 112 --quiet 1 --mode snapshot --mailto xxxxx --mailnotification always --notes-template '{{guestname}}' --storage sgn02
INFO: Starting Backup of VM 106 (qemu)
INFO: Backup started at 2024-02-01 02:30:04
INFO: status = running
INFO: VM Name: ex19-02
INFO: include disk 'virtio0' 'dctpool:vm-106-disk-2' 900G
INFO: include disk 'virtio1' 'local-zfs:vm-106-disk-0' 500G
INFO: include disk 'virtio2' 'local-zfs:vm-106-disk-1' 500G
INFO: include disk 'virtio3' 'local-zfs:vm-106-disk-2' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/106/2024-02-01T01:30:04Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 106 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '70085ea0-5066-47ab-88ce-bc25b6c56429'
INFO: resuming VM again
INFO: virtio0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio2: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio3: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 0% (916.0 MiB of 2.3 TiB) in 3s, read: 305.3 MiB/s, write: 268.0 MiB/s
INFO: 1% (24.0 GiB of 2.3 TiB) in 2h 38m 12s, read: 2.5 MiB/s, write: 2.5 MiB/s
INFO: 2% (48.0 GiB of 2.3 TiB) in 5h 40m 25s, read: 2.2 MiB/s, write: 2.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 106 failed - interrupted by signal
INFO: Failed at 2024-02-01 10:12:54
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal

VM continues to run, but some disk corruption is also detected by windows. This server has 3 exchange databases, 2 of them got corrupted instantly.

We are still struggling to bring them back.

Any idea how to prevent this in the future? Otherwise, we'll have to switch to another virtualization platform.

Jenrej

sb-jw · Feb 2, 2024

I would recommend switching the discs to scsi with discard.

But more relevant, do you have iothread enabled? If so, deactivate it, stop the VM and start it again (restart or reboot is not enough) and trigger the backup. There is currently a bug with this flag and this could possibly be your problem.

But it can also be due to the long duration of the backups. All changes are first sent via the PBS during the backup so that it can make a consistent backup. It also seems to me that your line is clearly too small.

tom · Feb 2, 2024

nejcsuhadolc said:
ERROR: VM 106 qmp command 'guest-fsfreeze-thaw' failed - got timeout

Check your guest agent.

nejcsuhadolc · Feb 2, 2024

tom said:
Check your guest agent.

Hi, can you please be more specific?

I'm also wondering why should the changes to the disc be written first to the backup. Is it possible to do a snapshot first, back up the snapshot, and then delete it? This would probably be less prone to such errors?

tom · Feb 2, 2024

your qemu guest agent is not working inside your VM, so I suggest you fix and try again.

RolandK · Feb 2, 2024

>PBS is connected over IPSec, a slower connection, about 100 Mbps

>INFO: 0% (916.0 MiB of 2.3 TiB) in 3s, read: 305.3 MiB/s, write: 268.0 MiB/s
>INFO: 1% (24.0 GiB of 2.3 TiB) in 2h 38m 12s, read: 2.5 MiB/s, write: 2.5 MiB/s
>INFO: 2% (48.0 GiB of 2.3 TiB) in 5h 40m 25s, read: 2.2 MiB/s, write: 2.2 MiB/s

TB sized VM backup via 100Mbps IPSEC?

you know that pbs link/backup speed limits the VM write speed , as we do not have backup fleecing yet?

PBS corrupts MS Exchange database

nejcsuhadolc

Active Member

sb-jw

Famous Member

tom

Proxmox Staff Member

nejcsuhadolc

Active Member

tom

Proxmox Staff Member

RolandK

Famous Member

We value your privacy