I have a cluster consisting of three nodes. As it's a fairly small cluster I run ceph directly on the pve nodes. Currently I'm running 13 VMs on that cluster. I'm using HA and I run automatic backups of most VMs once per night.
A couple of days ago a VM (id 107) couldn't complete the backup job anymore. I keep getting the same symptoms: The backup job starts, then suddenly stops and I'm being presented with the information that the VM in question is not running:
Information worthwhile to have:
A couple of days ago a VM (id 107) couldn't complete the backup job anymore. I keep getting the same symptoms: The backup job starts, then suddenly stops and I'm being presented with the information that the VM in question is not running:
Code:
INFO: starting new backup job: vzdump 107 --compress lzo --storage backups --remove 0 --mode snapshot --node pve01n02
INFO: Starting Backup of VM 107 (qemu)
INFO: status = running
INFO: update VM 107: -lock backup
INFO: VM Name: platinum
INFO: include disk 'scsi0' 'vms_normal:vm-107-disk-0' 64G
INFO: include disk 'scsi1' 'vms_slow:vm-107-disk-0' 1T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/backups/dump/vzdump-qemu-107-2019_02_26-15_28_01.vma.lzo'
INFO: started backup task '1fb04261-556d-4bbc-9313-c5e8dadae209'
INFO: status: 0% (805306368/1168231104512), sparse 0% (199729152), duration 3, read/write 268/201 MB/s
INFO: status: 1% (11752439808/1168231104512), sparse 0% (344735744), duration 127, read/write 88/87 MB/s
INFO: status: 2% (23412604928/1168231104512), sparse 0% (443133952), duration 267, read/write 83/82 MB/s
INFO: status: 3% (35114713088/1168231104512), sparse 0% (474980352), duration 407, read/write 83/83 MB/s
INFO: status: 4% (46821015552/1168231104512), sparse 0% (475738112), duration 570, read/write 71/71 MB/s
INFO: status: 5% (58514735104/1168231104512), sparse 0% (489873408), duration 720, read/write 77/77 MB/s
INFO: status: 6% (70103597056/1168231104512), sparse 0% (737226752), duration 880, read/write 72/70 MB/s
INFO: status: 7% (81793122304/1168231104512), sparse 0% (821075968), duration 1105, read/write 51/51 MB/s
INFO: status: 8% (93499424768/1168231104512), sparse 0% (857800704), duration 1345, read/write 48/48 MB/s
INFO: status: 9% (105163784192/1168231104512), sparse 0% (861306880), duration 1594, read/write 46/46 MB/s
INFO: status: 10% (116849115136/1168231104512), sparse 0% (6407180288), duration 1791, read/write 59/31 MB/s
INFO: status: 11% (128576389120/1168231104512), sparse 1% (13693362176), duration 1969, read/write 65/24 MB/s
INFO: status: 12% (140207194112/1168231104512), sparse 1% (17593872384), duration 2185, read/write 53/35 MB/s
INFO: status: 13% (151909302272/1168231104512), sparse 2% (23897571328), duration 2377, read/write 60/28 MB/s
INFO: status: 14% (163561078784/1168231104512), sparse 2% (31240200192), duration 2554, read/write 65/24 MB/s
INFO: status: 15% (175250604032/1168231104512), sparse 3% (39167156224), duration 2737, read/write 63/20 MB/s
INFO: status: 16% (186948517888/1168231104512), sparse 4% (49785163776), duration 2907, read/write 68/6 MB/s
INFO: status: 17% (198600294400/1168231104512), sparse 4% (57666060288), duration 3091, read/write 63/20 MB/s
ERROR: VM 107 not running
INFO: aborting backup job
ERROR: VM 107 not running
ERROR: Backup of VM 107 failed - VM 107 not running
INFO: Backup job finished with errors
TASK ERROR: job errors
Information worthwhile to have:
- This setup was running fine for a couple of months now
- This only affects VM 107. All other VMs can run their backup jobs without any issues
- This happens both with automatically scheduled backups and manually started ones
- The VM is doing fine both before and after the backup job
- I'm not out-of-space or anything
- All nodes are always up-to-date