VMs unresponsive during backup

phil.lavin

New Member
Apr 25, 2020
5
0
1
35
Hi Folks,

Running Proxmox 6.1-7 and using the auto backup feature. Main disk is ZFS:

Code:
root@London-proxmox-04:~# zfs list
NAME                          USED  AVAIL     REFER  MOUNTPOINT
local-zfs                    28.2G  3.48T       96K  /local-zfs
local-zfs/vm-111-virtio0     2.57G  3.48T     2.51G  -
local-zfs/vm-112-state-phil   129M  3.48T      129M  -
local-zfs/vm-112-virtio0     3.31G  3.48T     3.30G  -
local-zfs/vm-113-virtio0     14.2G  3.48T     14.2G  -
local-zfs/vm-114-virtio0     7.99G  3.48T     7.71G  -

Backups are stored on a remote CIFS mount, added via the Proxmox GUI. Here's the backup command it runs:

Code:
vzdump --quiet 1 --storage backups --all 1 --node proxmox-04 --compress 0 --mailto noc@***.com --mailnotification always --mode snapshot

Whenever a VM is being backed up, it is totally unresponsive to all TCP connections until the backup completes. Here's a log of a recent backup:

Code:
2020-04-25 08:10:14 INFO: Starting Backup of VM 101 (qemu)
2020-04-25 08:10:14 INFO: status = running
2020-04-25 08:10:15 INFO: update VM 101: -lock backup
2020-04-25 08:10:15 INFO: VM Name: derp-02
2020-04-25 08:10:15 INFO: include disk 'virtio0' 'local-zfs:vm-101-disk-0' 32G
2020-04-25 08:10:15 INFO: backup mode: snapshot
2020-04-25 08:10:15 INFO: ionice priority: 7
2020-04-25 08:10:15 INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-101-2020_04_25-08_10_14.vma'
2020-04-25 08:10:15 INFO: started backup task '257efad6-4009-4817-8ee5-037708959b89'
2020-04-25 08:10:18 INFO: status: 3% (1302855680/34359738368), sparse 2% (979111936), duration 3, read/write 434/107 MB/s
2020-04-25 08:10:21 INFO: status: 7% (2728525824/34359738368), sparse 6% (2317668352), duration 6, read/write 475/29 MB/s
2020-04-25 08:10:24 INFO: status: 11% (4069916672/34359738368), sparse 10% (3511357440), duration 9, read/write 447/49 MB/s
2020-04-25 08:10:27 INFO: status: 15% (5371920384/34359738368), sparse 13% (4547731456), duration 12, read/write 434/88 MB/s
2020-04-25 08:10:30 INFO: status: 19% (6640828416/34359738368), sparse 16% (5653327872), duration 15, read/write 422/54 MB/s
2020-04-25 08:10:33 INFO: status: 22% (7853572096/34359738368), sparse 18% (6270709760), duration 18, read/write 404/198 MB/s
2020-04-25 08:10:36 INFO: status: 26% (9062055936/34359738368), sparse 20% (6965997568), duration 21, read/write 402/171 MB/s
2020-04-25 08:10:39 INFO: status: 30% (10352787456/34359738368), sparse 23% (7993864192), duration 24, read/write 430/87 MB/s
2020-04-25 08:10:42 INFO: status: 34% (11684675584/34359738368), sparse 26% (9193730048), duration 27, read/write 443/44 MB/s
2020-04-25 08:10:45 INFO: status: 37% (12953387008/34359738368), sparse 29% (10197540864), duration 30, read/write 422/88 MB/s
2020-04-25 08:10:48 INFO: status: 41% (14298644480/34359738368), sparse 33% (11410817024), duration 33, read/write 448/43 MB/s
2020-04-25 08:10:51 INFO: status: 45% (15627124736/34359738368), sparse 36% (12692594688), duration 36, read/write 442/15 MB/s
2020-04-25 08:10:54 INFO: status: 48% (16739336192/34359738368), sparse 37% (13036523520), duration 39, read/write 370/256 MB/s
2020-04-25 08:10:57 INFO: status: 51% (17785946112/34359738368), sparse 37% (13053493248), duration 42, read/write 348/343 MB/s
2020-04-25 08:11:00 INFO: status: 54% (18858377216/34359738368), sparse 38% (13069144064), duration 45, read/write 357/352 MB/s
2020-04-25 08:11:03 INFO: status: 58% (20188626944/34359738368), sparse 41% (14302089216), duration 48, read/write 443/32 MB/s
2020-04-25 08:11:06 INFO: status: 62% (21569601536/34359738368), sparse 45% (15682842624), duration 51, read/write 460/0 MB/s
2020-04-25 08:11:09 INFO: status: 66% (22908960768/34359738368), sparse 49% (16887791616), duration 54, read/write 446/44 MB/s
2020-04-25 08:11:12 INFO: status: 70% (24303370240/34359738368), sparse 53% (18281979904), duration 57, read/write 464/0 MB/s
2020-04-25 08:11:15 INFO: status: 74% (25669402624/34359738368), sparse 57% (19647750144), duration 60, read/write 455/0 MB/s
2020-04-25 08:11:18 INFO: status: 78% (27097956352/34359738368), sparse 61% (21076070400), duration 63, read/write 476/0 MB/s
2020-04-25 08:11:21 INFO: status: 84% (28925493248/34359738368), sparse 66% (22903386112), duration 66, read/write 609/0 MB/s
2020-04-25 08:11:24 INFO: status: 88% (30545281024/34359738368), sparse 67% (23342604288), duration 69, read/write 539/393 MB/s
2020-04-25 08:11:27 INFO: status: 94% (32339918848/34359738368), sparse 73% (25092702208), duration 72, read/write 598/14 MB/s
2020-04-25 08:11:30 INFO: status: 99% (34076033024/34359738368), sparse 76% (26330677248), duration 75, read/write 578/166 MB/s
2020-04-25 08:17:41 INFO: status: 100% (34359738368/34359738368), sparse 76% (26380124160), duration 446, read/write 0/0 MB/s
2020-04-25 08:17:41 INFO: transferred 34359 MB in 446 seconds (77 MB/s)
2020-04-25 08:17:42 INFO: archive file size: 7.44GB
2020-04-25 08:17:42 INFO: Finished Backup of VM 101 (00:07:28)

As I understand it... in this state, Proxmox is supposed to create a ZFS snapshot of the VM and then use zfs send to transfer the backup to the CIFS. However, I see no snapshots whilst the backup is running when I do zfs list -t snapshot.

If I create a snapshot manually (via the GUI) and zfs send it manually (via the CLI) then this works fine.

Can someone advise what I'm doing wrong/if there's a bug?
 
Are you mixing-up backups with replication? ZFS snaps are only send to another zfs storage and not via cifs. The snapshot in regards to backups refer to the qemu volume, which gets snapshotted internally and then that snapshot is saved to disk.
 
Are you mixing-up backups with replication? ZFS snaps are only send to another zfs storage and not via cifs. The snapshot in regards to backups refer to the qemu volume, which gets snapshotted internally and then that snapshot is saved to disk.

Thanks for getting back to me. I wasn't mixing them up conceptually but I had half read/assumed that a "snapshot" backup of a VM on ZFS used a ZFS snapshot rather than a bespoke mechanism. Either way, I presume a "snapshot" backup isn't designed to effectively suspend the VM for the duration of the backup or is this normal behavior?
 
I see. I just wanted to try out, if the snapshot-based backup of a guest actually holds it for the duration of the backup process, but I just encountered an issue, that renders this impossible: when trying to perform a guest backup, all available locations are finally pointing to the local storage drive, where pve is installed and which is only 4GB in size. The "real" backup folder is on a ZFS pool, but is also listed as having the same amount of space available as the local storage.

This is probaby a bug and I till check if there is already an issue open for it. While the backup had been encountering the issue, that the local volume filled up, pve itself seemd to become unresponsive and so had the guest as well.
 
The problem is really apparent when using uncompressed backups. See below log:

Code:
INFO: starting new backup job: vzdump 113 --mode snapshot --remove 0 --compress 0 --storage backups --node proxmox-04
INFO: Starting Backup of VM 113 (qemu)
INFO: Backup started at 2020-04-25 14:42:31
INFO: status = running
INFO: update VM 113: -lock backup
INFO: VM Name: bt-999-01
INFO: include disk 'virtio0' 'local-zfs:vm-113-virtio0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-113-2020_04_25-14_42_31.vma'
INFO: started backup task '155e190d-bef8-4da5-81d5-658a56ca1cfd'
INFO: status: 3% (1272315904/34359738368), sparse 0% (23207936), duration 3, read/write 424/416 MB/s
INFO: status: 8% (2751004672/34359738368), sparse 0% (45912064), duration 6, read/write 492/485 MB/s
INFO: status: 14% (4845862912/34359738368), sparse 4% (1432829952), duration 9, read/write 698/235 MB/s
INFO: status: 19% (6590627840/34359738368), sparse 6% (2245091328), duration 12, read/write 581/310 MB/s
INFO: status: 23% (8029863936/34359738368), sparse 6% (2262581248), duration 15, read/write 479/473 MB/s
INFO: status: 29% (10200219648/34359738368), sparse 12% (4125122560), duration 18, read/write 723/102 MB/s
INFO: status: 32% (11252924416/34359738368), sparse 12% (4258385920), duration 21, read/write 350/306 MB/s
INFO: status: 37% (12741836800/34359738368), sparse 13% (4501467136), duration 24, read/write 496/415 MB/s
INFO: status: 41% (14345437184/34359738368), sparse 14% (4840402944), duration 27, read/write 534/421 MB/s
INFO: status: 46% (15934947328/34359738368), sparse 14% (5135892480), duration 30, read/write 529/431 MB/s
INFO: status: 50% (17418878976/34359738368), sparse 15% (5279399936), duration 33, read/write 494/446 MB/s
INFO: status: 55% (18936758272/34359738368), sparse 15% (5446668288), duration 36, read/write 505/450 MB/s
INFO: status: 59% (20434059264/34359738368), sparse 16% (5568421888), duration 39, read/write 499/458 MB/s
INFO: status: 63% (21955805184/34359738368), sparse 16% (5664776192), duration 42, read/write 507/475 MB/s
INFO: status: 68% (23537057792/34359738368), sparse 17% (5854240768), duration 45, read/write 527/463 MB/s
INFO: status: 73% (25083445248/34359738368), sparse 17% (6098776064), duration 48, read/write 515/433 MB/s
INFO: status: 77% (26735083520/34359738368), sparse 19% (6555090944), duration 51, read/write 550/398 MB/s
INFO: status: 82% (28354543616/34359738368), sparse 19% (6853591040), duration 54, read/write 539/440 MB/s
INFO: status: 87% (29945692160/34359738368), sparse 20% (7175524352), duration 57, read/write 530/423 MB/s
INFO: status: 91% (31475171328/34359738368), sparse 21% (7326216192), duration 60, read/write 509/459 MB/s
INFO: status: 95% (32927252480/34359738368), sparse 21% (7476695040), duration 63, read/write 484/433 MB/s

When it gets to 95%, it's actually only copied a tiny fraction of the total data. At this point, the VM stops responding and the size of the backup file on the CIFS server continues to grow (and Proxmox is pumping a few hundred mbit over the network to it). When the file on the backup server reaches completion, the VM starts responding again and the backup job finishes. This is the last bit of the log:

Code:
ERROR: VM 113 qmp command 'query-backup' failed - got timeout
INFO: aborting backup job
ERROR: Backup of VM 113 failed - VM 113 qmp command 'query-backup' failed - got timeout
INFO: Failed at 2020-04-25 15:00:41
INFO: Backup job finished with errors
TASK ERROR: job errors
 
Last edited:
Soo… after getting my issue with my zpool fixed, I was able to look into this and for me at least, it's the way I remember… while performing a backup in snapshot mode, my guest (all my guests) are staying connected an operational throughout the backup process. No single ping got lost, when I ran a manual backup.

I am still thinking, that in your case, there's something wrong with the way the backup data gets handled. To me it looks as if the dump is written to your pve local volume and once that one becomes full, pve will stop responding until it has noticed the issue and has cleaned up the data again. Then pve starts becoming responsive again, as are your guests. Please check if the pve local volume fills up, while the backup is running.

Mine has only a couple of hundred MB available, but I also don't use it for guest storage, which is located on another zpool.
 
Thanks for working with me on this :)

The backup is definitely written to the CIFS mount from the outset. I see the files appear on the storage server as soon as the backup starts and their file size grows throughout the backup.

Have you tried an uncompressed backup? In my case, it seems to think it's finished copying long before it actually has.
 
Yeah, I am only using uncompressed backups, since I do have enough storage capacity available to go without compression. But then, I am not saving my dumps to a network device, but to a local ZFS folder. Maybe you can setup a local folder for testing and see, if the guest stays connected, if your're not using a networked backup.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!