Backup job keeps failing - VM is shut down

Aug 5, 2018
12
0
1
30
I have a cluster consisting of three nodes. As it's a fairly small cluster I run ceph directly on the pve nodes. Currently I'm running 13 VMs on that cluster. I'm using HA and I run automatic backups of most VMs once per night.

A couple of days ago a VM (id 107) couldn't complete the backup job anymore. I keep getting the same symptoms: The backup job starts, then suddenly stops and I'm being presented with the information that the VM in question is not running:
Code:
INFO: starting new backup job: vzdump 107 --compress lzo --storage backups --remove 0 --mode snapshot --node pve01n02
INFO: Starting Backup of VM 107 (qemu)
INFO: status = running
INFO: update VM 107: -lock backup
INFO: VM Name: platinum
INFO: include disk 'scsi0' 'vms_normal:vm-107-disk-0' 64G
INFO: include disk 'scsi1' 'vms_slow:vm-107-disk-0' 1T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-107-2019_02_26-15_28_01.vma.lzo'
INFO: started backup task '1fb04261-556d-4bbc-9313-c5e8dadae209'
INFO: status: 0% (805306368/1168231104512), sparse 0% (199729152), duration 3, read/write 268/201 MB/s
INFO: status: 1% (11752439808/1168231104512), sparse 0% (344735744), duration 127, read/write 88/87 MB/s
INFO: status: 2% (23412604928/1168231104512), sparse 0% (443133952), duration 267, read/write 83/82 MB/s
INFO: status: 3% (35114713088/1168231104512), sparse 0% (474980352), duration 407, read/write 83/83 MB/s
INFO: status: 4% (46821015552/1168231104512), sparse 0% (475738112), duration 570, read/write 71/71 MB/s
INFO: status: 5% (58514735104/1168231104512), sparse 0% (489873408), duration 720, read/write 77/77 MB/s
INFO: status: 6% (70103597056/1168231104512), sparse 0% (737226752), duration 880, read/write 72/70 MB/s
INFO: status: 7% (81793122304/1168231104512), sparse 0% (821075968), duration 1105, read/write 51/51 MB/s
INFO: status: 8% (93499424768/1168231104512), sparse 0% (857800704), duration 1345, read/write 48/48 MB/s
INFO: status: 9% (105163784192/1168231104512), sparse 0% (861306880), duration 1594, read/write 46/46 MB/s
INFO: status: 10% (116849115136/1168231104512), sparse 0% (6407180288), duration 1791, read/write 59/31 MB/s
INFO: status: 11% (128576389120/1168231104512), sparse 1% (13693362176), duration 1969, read/write 65/24 MB/s
INFO: status: 12% (140207194112/1168231104512), sparse 1% (17593872384), duration 2185, read/write 53/35 MB/s
INFO: status: 13% (151909302272/1168231104512), sparse 2% (23897571328), duration 2377, read/write 60/28 MB/s
INFO: status: 14% (163561078784/1168231104512), sparse 2% (31240200192), duration 2554, read/write 65/24 MB/s
INFO: status: 15% (175250604032/1168231104512), sparse 3% (39167156224), duration 2737, read/write 63/20 MB/s
INFO: status: 16% (186948517888/1168231104512), sparse 4% (49785163776), duration 2907, read/write 68/6 MB/s
INFO: status: 17% (198600294400/1168231104512), sparse 4% (57666060288), duration 3091, read/write 63/20 MB/s
ERROR: VM 107 not running
INFO: aborting backup job
ERROR: VM 107 not running
ERROR: Backup of VM 107 failed - VM 107 not running
INFO: Backup job finished with errors
TASK ERROR: job errors

Information worthwhile to have:
  • This setup was running fine for a couple of months now
  • This only affects VM 107. All other VMs can run their backup jobs without any issues
  • This happens both with automatically scheduled backups and manually started ones
  • The VM is doing fine both before and after the backup job
  • I'm not out-of-space or anything
  • All nodes are always up-to-date
I'm pretty much clue-less on this one. Has anyone seen this behavior before? Does anyone have an idea how to track down this problem?
 
Thanks for your quick reply!

Yep, the VM has two .qcow2 disks.

I checked the log files of the backups that failed a couple of days once more and I found that I sometimes get something different from "
ERROR: VM 107 not running"
:
Code:
107: 2019-02-22 01:30:02 INFO: Starting Backup of VM 107 (qemu)
107: 2019-02-22 01:30:02 INFO: status = running
107: 2019-02-22 01:30:02 INFO: update VM 107: -lock backup
107: 2019-02-22 01:30:02 INFO: VM Name: platinum
107: 2019-02-22 01:30:02 INFO: include disk 'scsi0' 'vms_normal:vm-107-disk-0' 64G
107: 2019-02-22 01:30:02 INFO: include disk 'scsi1' 'vms_slow:vm-107-disk-0' 1T
107: 2019-02-22 01:30:02 INFO: backup mode: snapshot
107: 2019-02-22 01:30:02 INFO: ionice priority: 7
107: 2019-02-22 01:30:02 INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-107-2019_02_22-01_30_02.vma.lzo'
107: 2019-02-22 01:30:07 INFO: started backup task 'b98403d8-85cc-4620-8760-8d06efd30f1c'
107: 2019-02-22 01:30:10 INFO: status: 0% (798228480/1168231104512), sparse 0% (193150976), duration 3, read/write 266/201 MB/s
107: 2019-02-22 01:32:06 INFO: status: 1% (11726356480/1168231104512), sparse 0% (334245888), duration 119, read/write 94/92 MB/s
107: 2019-02-22 01:33:56 INFO: status: 2% (23416799232/1168231104512), sparse 0% (436097024), duration 229, read/write 106/105 MB/s
107: 2019-02-22 01:35:45 INFO: status: 3% (35175268352/1168231104512), sparse 0% (466325504), duration 338, read/write 107/107 MB/s
107: 2019-02-22 01:37:48 INFO: status: 4% (46892318720/1168231104512), sparse 0% (467083264), duration 461, read/write 95/95 MB/s
107: 2019-02-22 01:40:06 INFO: status: 5% (58532954112/1168231104512), sparse 0% (481218560), duration 599, read/write 84/84 MB/s
107: 2019-02-22 01:42:32 INFO: status: 6% (70111068160/1168231104512), sparse 0% (728723456), duration 745, read/write 79/77 MB/s
107: 2019-02-22 01:46:46 INFO: status: 7% (81793122304/1168231104512), sparse 0% (812474368), duration 999, read/write 45/45 MB/s
107: 2019-02-22 01:51:04 INFO: status: 8% (93486841856/1168231104512), sparse 0% (849178624), duration 1257, read/write 45/45 MB/s
107: 2019-02-22 01:55:08 INFO: status: 9% (105155395584/1168231104512), sparse 0% (852914176), duration 1501, read/write 47/47 MB/s
107: 2019-02-22 01:58:24 INFO: status: 10% (116830240768/1168231104512), sparse 0% (6398545920), duration 1697, read/write 59/31 MB/s
107: 2019-02-22 02:01:26 INFO: status: 11% (128517668864/1168231104512), sparse 1% (13663756288), duration 1879, read/write 64/24 MB/s
107: 2019-02-22 02:05:13 INFO: status: 12% (140228165632/1168231104512), sparse 1% (17591836672), duration 2106, read/write 51/34 MB/s
107: 2019-02-22 02:08:21 INFO: status: 13% (151880073216/1168231104512), sparse 2% (23888936960), duration 2294, read/write 61/28 MB/s
107: 2019-02-22 02:11:16 INFO: status: 14% (163611410432/1168231104512), sparse 2% (31323746304), duration 2469, read/write 67/24 MB/s
107: 2019-02-22 02:14:19 INFO: status: 15% (175246409728/1168231104512), sparse 3% (39204560896), duration 2652, read/write 63/20 MB/s
107: 2019-02-22 02:17:26 INFO: status: 16% (186944323584/1168231104512), sparse 4% (49822666752), duration 2839, read/write 62/5 MB/s
107: 2019-02-22 02:20:52 INFO: status: 17% (198612877312/1168231104512), sparse 4% (57720340480), duration 3045, read/write 56/18 MB/s
107: 2019-02-22 02:25:14 INFO: status: 18% (210344345600/1168231104512), sparse 5% (66151485440), duration 3307, read/write 44/12 MB/s
107: 2019-02-22 02:28:19 INFO: status: 19% (221987733504/1168231104512), sparse 6% (76558831616), duration 3492, read/write 62/6 MB/s
107: 2019-02-22 02:29:29 ERROR: VM 107 qmp command 'query-backup' failed - client closed connection
107: 2019-02-22 02:29:29 INFO: aborting backup job
107: 2019-02-22 02:29:29 ERROR: VM 107 qmp command 'backup-cancel' failed - unable to connect to VM 107 qmp socket - Connection refused
107: 2019-02-22 02:29:29 ERROR: Backup of VM 107 failed - VM 107 qmp command 'query-backup' failed - client closed connection
 
Here's the log from the backup job that started this morning at 01:00:
Code:
VMID     NAME    STATUS TIME      SIZE        FILENAME

107        platinum              FAILED  01:00:06              VM 107 not running

TOTAL   01:00:06              0KB       



Detailed backup logs:




vzdump 107 --quiet 1 --mode snapshot --compress lzo --mailnotification failure --storage backups --mailto xxx@xxx.xxx



107: 2019-02-27 02:30:52 INFO: Starting Backup of VM 107 (qemu)

107: 2019-02-27 02:30:52 INFO: status = running

107: 2019-02-27 02:30:52 INFO: update VM 107: -lock backup

107: 2019-02-27 02:30:52 INFO: VM Name: platinum

107: 2019-02-27 02:30:52 INFO: include disk 'scsi0' 'vms_normal:vm-107-disk-0' 64G

107: 2019-02-27 02:30:52 INFO: include disk 'scsi1' 'vms_slow:vm-107-disk-0' 1T

107: 2019-02-27 02:30:53 INFO: backup mode: snapshot

107: 2019-02-27 02:30:53 INFO: ionice priority: 7

107: 2019-02-27 02:30:53 INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-107-2019_02_27-02_30_52.vma.lzo'

107: 2019-02-27 02:30:58 INFO: started backup task '631cdccf-6a20-4200-bef6-f1824c46df3a'

107: 2019-02-27 02:31:01 INFO: status: 0% (478937088/1168231104512), sparse 0% (169451520), duration 3, read/write 159/103 MB/s

107: 2019-02-27 02:33:15 INFO: status: 1% (11777605632/1168231104512), sparse 0% (339558400), duration 137, read/write 84/83 MB/s

107: 2019-02-27 02:35:39 INFO: status: 2% (23433576448/1168231104512), sparse 0% (442363904), duration 281, read/write 80/80 MB/s

107: 2019-02-27 02:37:47 INFO: status: 3% (35103047680/1168231104512), sparse 0% (471175168), duration 409, read/write 91/90 MB/s

107: 2019-02-27 02:40:23 INFO: status: 4% (46758100992/1168231104512), sparse 0% (471932928), duration 565, read/write 74/74 MB/s

107: 2019-02-27 02:42:57 INFO: status: 5% (58430324736/1168231104512), sparse 0% (486068224), duration 719, read/write 75/75 MB/s

107: 2019-02-27 02:45:35 INFO: status: 6% (70107791360/1168231104512), sparse 0% (733573120), duration 877, read/write 73/72 MB/s

107: 2019-02-27 02:49:36 INFO: status: 7% (81814093824/1168231104512), sparse 0% (817262592), duration 1118, read/write 48/48 MB/s

107: 2019-02-27 02:53:38 INFO: status: 8% (93506764800/1168231104512), sparse 0% (853966848), duration 1360, read/write 48/48 MB/s

107: 2019-02-27 02:57:36 INFO: status: 9% (105193144320/1168231104512), sparse 0% (857767936), duration 1598, read/write 49/49 MB/s

107: 2019-02-27 03:00:49 INFO: status: 10% (116861698048/1168231104512), sparse 0% (6403346432), duration 1791, read/write 60/31 MB/s

107: 2019-02-27 03:03:46 INFO: status: 11% (128551223296/1168231104512), sparse 1% (13702111232), duration 1968, read/write 66/24 MB/s

107: 2019-02-27 03:07:20 INFO: status: 12% (140236554240/1168231104512), sparse 1% (17605025792), duration 2182, read/write 54/36 MB/s

107: 2019-02-27 03:10:33 INFO: status: 13% (151917690880/1168231104512), sparse 2% (23893737472), duration 2375, read/write 60/27 MB/s

107: 2019-02-27 03:13:33 INFO: status: 14% (163556884480/1168231104512), sparse 2% (31274115072), duration 2555, read/write 64/23 MB/s

107: 2019-02-27 03:16:39 INFO: status: 15% (175263186944/1168231104512), sparse 3% (39226138624), duration 2741, read/write 62/20 MB/s

107: 2019-02-27 03:19:30 INFO: status: 16% (186935934976/1168231104512), sparse 4% (49819078656), duration 2912, read/write 68/6 MB/s

107: 2019-02-27 03:22:34 INFO: status: 17% (198654820352/1168231104512), sparse 4% (57767084032), duration 3096, read/write 63/20 MB/s

107: 2019-02-27 03:25:36 INFO: status: 18% (210319179776/1168231104512), sparse 5% (66131120128), duration 3278, read/write 64/18 MB/s

107: 2019-02-27 03:28:28 INFO: status: 19% (221970956288/1168231104512), sparse 6% (76563615744), duration 3450, read/write 67/7 MB/s

107: 2019-02-27 03:30:58 ERROR: VM 107 not running

107: 2019-02-27 03:30:58 INFO: aborting backup job

107: 2019-02-27 03:30:58 ERROR: VM 107 not running

107: 2019-02-27 03:30:58 ERROR: Backup of VM 107 failed - VM 107 not running

A job that started at 00:00 resulted in the following:
Code:
can't aquire lock '/var/run/vzdump.lock' - got timeout

VMID     NAME    STATUS TIME      SIZE        FILENAME
TOTAL   00:00:00              0KB       


Detailed backup logs:
vzdump 112 111 109 --mailnotification failure --quiet 1 --storage backups --mode snapshot --compress lzo --mailto xxx@xxx.xxx

Any ideas?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!