I believe a bug in the PBS / Proxmox system can corrupt the Exchange database.
Here are the steps to reproduce it:
Environment:
pve-manager/8.0.4/d258a813cfa6b390
VM disk are located on raidz1 zfs pool, disks are Samsung Datacenter 860 DCT series. Pool stays healthy all the time.
Virtio driver 0.1.240.
The server is Supermicro dual proc, I can dig the exact info if somebody finds this relevant.
PBS is connected over IPSec, a slower connection, about 100 Mbps. The reason for this is that we ran out of space at the location. I do not think the issue itself is related to this slower link since this has happened before on a local PBS server as well.
To reproduce the bug, you have to start the backup, wait until it starts sending the data, and then stop the backup.
Here is the backup log:
INFO: starting new backup job: vzdump 106 107 108 109 110 111 112 --quiet 1 --mode snapshot --mailto xxxxx --mailnotification always --notes-template '{{guestname}}' --storage sgn02
INFO: Starting Backup of VM 106 (qemu)
INFO: Backup started at 2024-02-01 02:30:04
INFO: status = running
INFO: VM Name: ex19-02
INFO: include disk 'virtio0' 'dctpool:vm-106-disk-2' 900G
INFO: include disk 'virtio1' 'local-zfs:vm-106-disk-0' 500G
INFO: include disk 'virtio2' 'local-zfs:vm-106-disk-1' 500G
INFO: include disk 'virtio3' 'local-zfs:vm-106-disk-2' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/106/2024-02-01T01:30:04Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 106 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '70085ea0-5066-47ab-88ce-bc25b6c56429'
INFO: resuming VM again
INFO: virtio0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio2: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio3: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 0% (916.0 MiB of 2.3 TiB) in 3s, read: 305.3 MiB/s, write: 268.0 MiB/s
INFO: 1% (24.0 GiB of 2.3 TiB) in 2h 38m 12s, read: 2.5 MiB/s, write: 2.5 MiB/s
INFO: 2% (48.0 GiB of 2.3 TiB) in 5h 40m 25s, read: 2.2 MiB/s, write: 2.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 106 failed - interrupted by signal
INFO: Failed at 2024-02-01 10:12:54
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal
VM continues to run, but some disk corruption is also detected by windows. This server has 3 exchange databases, 2 of them got corrupted instantly.
We are still struggling to bring them back.
Any idea how to prevent this in the future? Otherwise, we'll have to switch to another virtualization platform.
Jenrej
Here are the steps to reproduce it:
Environment:
pve-manager/8.0.4/d258a813cfa6b390
VM disk are located on raidz1 zfs pool, disks are Samsung Datacenter 860 DCT series. Pool stays healthy all the time.
Virtio driver 0.1.240.
The server is Supermicro dual proc, I can dig the exact info if somebody finds this relevant.
PBS is connected over IPSec, a slower connection, about 100 Mbps. The reason for this is that we ran out of space at the location. I do not think the issue itself is related to this slower link since this has happened before on a local PBS server as well.
To reproduce the bug, you have to start the backup, wait until it starts sending the data, and then stop the backup.
Here is the backup log:
INFO: starting new backup job: vzdump 106 107 108 109 110 111 112 --quiet 1 --mode snapshot --mailto xxxxx --mailnotification always --notes-template '{{guestname}}' --storage sgn02
INFO: Starting Backup of VM 106 (qemu)
INFO: Backup started at 2024-02-01 02:30:04
INFO: status = running
INFO: VM Name: ex19-02
INFO: include disk 'virtio0' 'dctpool:vm-106-disk-2' 900G
INFO: include disk 'virtio1' 'local-zfs:vm-106-disk-0' 500G
INFO: include disk 'virtio2' 'local-zfs:vm-106-disk-1' 500G
INFO: include disk 'virtio3' 'local-zfs:vm-106-disk-2' 500G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/106/2024-02-01T01:30:04Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
ERROR: VM 106 qmp command 'guest-fsfreeze-thaw' failed - got timeout
INFO: started backup task '70085ea0-5066-47ab-88ce-bc25b6c56429'
INFO: resuming VM again
INFO: virtio0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio2: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: virtio3: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: 0% (916.0 MiB of 2.3 TiB) in 3s, read: 305.3 MiB/s, write: 268.0 MiB/s
INFO: 1% (24.0 GiB of 2.3 TiB) in 2h 38m 12s, read: 2.5 MiB/s, write: 2.5 MiB/s
INFO: 2% (48.0 GiB of 2.3 TiB) in 5h 40m 25s, read: 2.2 MiB/s, write: 2.2 MiB/s
ERROR: interrupted by signal
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 106 failed - interrupted by signal
INFO: Failed at 2024-02-01 10:12:54
ERROR: Backup job failed - interrupted by signal
TASK ERROR: interrupted by signal
VM continues to run, but some disk corruption is also detected by windows. This server has 3 exchange databases, 2 of them got corrupted instantly.
We are still struggling to bring them back.
Any idea how to prevent this in the future? Otherwise, we'll have to switch to another virtualization platform.
Jenrej