Backup: ERROR: unable to find configuration file for VM

RobFantini

Famous Member
May 24, 2012
2,042
107
133
Boston,Mass
twice in 3 weeks at least one node had failed backups.

this thread showed a similar situation: https://forum.proxmox.com/threads/proxmox-ve-vzdump-backup-of-vms.54291/post-250155

there are about 10 vm's on the node. the 1st kvm backed up ok:
Code:
INFO: status: 98% (21063794688/21474836480), sparse 68% (14780002304), duration 84, read/write 2364/1 MB/s
INFO: status: 100% (21474836480/21474836480), sparse 70% (15191044096), duration 85, read/write 411/0 MB/s
INFO: transferred 21474 MB in 85 seconds (252 MB/s)
INFO: stopping kvm after backup task
INFO: archive file size: 2.63GB
INFO: delete old backup '/mnt/pve/bkup-nfs/dump/vzdump-qemu-106-2019_08_03-00_00_02.vma.lzo'
INFO: Finished Backup of VM 106 (00:01:28)
INFO: Backup finished at 2019-08-24 00:01:30
INFO: Starting Backup of VM 10782 (qemu)

then the remainder failed :
Code:
INFO: Backup finished at 2019-08-24 00:01:30
INFO: Starting Backup of VM 10782 (qemu)
INFO: Backup started at 2019-08-24 00:01:30
INFO: status = running
INFO: update VM 10782: -lock backup
INFO: VM Name: Kwiklok
INFO: include disk 'ide0' 'rbd_ssd:vm-10782-disk-0' 32G
INFO: backup mode: snapshot
INFO: ionice priority: 5
INFO: creating archive '/mnt/pve/bkup-nfs/dump/vzdump-qemu-10782-2019_08_24-00_01_30.vma.lzo'
INFO: started backup task 'b3f76c26-8da9-483d-ab3e-c62b3a9906ce'
INFO: status: 1% (509345792/34359738368), sparse 0% (186986496), duration 3, read/write 169/107 MB/s
INFO: status: 2% (851443712/34359738368), sparse 0% (206741504), duration 6, read/write 114/107 MB/s
INFO: status: 3% (1182793728/34359738368), sparse 0% (215281664), duration 9, read/write 110/107 MB/s
INFO: status: 4% (1606418432/34359738368), sparse 0% (230178816), duration 12, read/write 141/136 MB/s
INFO: status: 5% (1950351360/34359738368), sparse 0% (243765248), duration 15, read/write 114/110 MB/s
INFO: status: 6% (2326265856/34359738368), sparse 0% (275148800), duration 18, read/write 125/114 MB/s
INFO: status: 7% (2497314816/34359738368), sparse 0% (294576128), duration 21, read/write 57/50 MB/s
INFO: status: 8% (2862219264/34359738368), sparse 1% (357285888), duration 48, read/write 13/11 MB/s
INFO: status: 9% (3229614080/34359738368), sparse 1% (363343872), duration 51, read/write 122/120 MB/s
INFO: status: 10% (3607101440/34359738368), sparse 1% (377077760), duration 54, read/write 125/121 MB/s
INFO: status: 11% (3997171712/34359738368), sparse 1% (392585216), duration 57, read/write 130/124 MB/s
INFO: status: 12% (4356046848/34359738368), sparse 1% (408723456), duration 60, read/write 119/114 MB/s
INFO: status: 13% (4739956736/34359738368), sparse 1% (428412928), duration 63, read/write 127/121 MB/s
INFO: status: 14% (5051645952/34359738368), sparse 1% (456728576), duration 66, read/write 103/94 MB/s
INFO: status: 15% (5188354048/34359738368), sparse 1% (462778368), duration 74, read/write 17/16 MB/s
INFO: status: 16% (5570035712/34359738368), sparse 1% (528830464), duration 78, read/write 95/78 MB/s
INFO: status: 17% (5963907072/34359738368), sparse 1% (563154944), duration 81, read/write 131/119 MB/s
INFO: status: 18% (6302203904/34359738368), sparse 1% (607698944), duration 84, read/write 112/97 MB/s
INFO: status: 19% (6694109184/34359738368), sparse 1% (647401472), duration 87, read/write 130/117 MB/s
INFO: status: 20% (7032012800/34359738368), sparse 2% (688037888), duration 90, read/write 112/99 MB/s
INFO: status: 21% (7381975040/34359738368), sparse 2% (720273408), duration 93, read/write 116/105 MB/s
INFO: status: 22% (7742816256/34359738368), sparse 2% (764256256), duration 96, read/write 120/105 MB/s
INFO: status: 23% (8023703552/34359738368), sparse 2% (784474112), duration 99, read/write 93/86 MB/s
INFO: status: 24% (8430551040/34359738368), sparse 2% (835563520), duration 102, read/write 135/118 MB/s
INFO: status: 25% (8749318144/34359738368), sparse 2% (860991488), duration 105, read/write 106/97 MB/s
INFO: status: 26% (9092202496/34359738368), sparse 2% (900317184), duration 108, read/write 114/101 MB/s
INFO: status: 27% (9361686528/34359738368), sparse 2% (935059456), duration 111, read/write 89/78 MB/s
INFO: status: 28% (9639559168/34359738368), sparse 2% (1011642368), duration 114, read/write 92/67 MB/s
INFO: status: 29% (10078912512/34359738368), sparse 3% (1134518272), duration 117, read/write 146/105 MB/s
INFO: status: 30% (10433986560/34359738368), sparse 3% (1175810048), duration 121, read/write 88/78 MB/s
INFO: status: 31% (10715267072/34359738368), sparse 3% (1234259968), duration 124, read/write 93/74 MB/s
INFO: status: 32% (11072569344/34359738368), sparse 3% (1278296064), duration 127, read/write 119/104 MB/s
INFO: status: 33% (11492392960/34359738368), sparse 3% (1360261120), duration 132, read/write 83/67 MB/s
ERROR: unable to find configuration file for VM 10782 - no such machine
INFO: aborting backup job
ERROR: unable to find configuration file for VM 10782 - no such machine
INFO: ipcc_send_rec[1] failed: Connection refused
INFO: ipcc_send_rec[2] failed: Connection refused
INFO: ipcc_send_rec[3] failed: Connection refused
INFO: Unable to load access control list: Connection refused
command 'qm unlock 10782' failed: exit code 111
ERROR: Backup of VM 10782 failed - unable to find configuration file for VM 10782 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 108 (qemu)
INFO: Backup started at 2019-08-24 00:03:48
ERROR: Backup of VM 108 failed - unable to find configuration file for VM 108 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 111 (qemu)
INFO: Backup started at 2019-08-24 00:03:48
ERROR: Backup of VM 111 failed - unable to find configuration file for VM 111 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 114 (qemu)
INFO: Backup started at 2019-08-24 00:03:48
ERROR: Backup of VM 114 failed - unable to find configuration file for VM 114 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 445 (qemu)
INFO: Backup started at 2019-08-24 00:03:48
ERROR: Backup of VM 445 failed - unable to find configuration file for VM 445 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 900 (qemu)
INFO: Backup started at 2019-08-24 00:03:48
ERROR: Backup of VM 900 failed - unable to find configuration file for VM 900 - no such machine
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 102 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = running
ERROR: Backup of VM 102 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 104 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = stopped
ERROR: Backup of VM 104 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 110 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = stopped
ERROR: Backup of VM 110 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 113 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = running
ERROR: Backup of VM 113 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 115 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = stopped
ERROR: Backup of VM 115 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 160 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = running
ERROR: Backup of VM 160 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 707 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = running
ERROR: Backup of VM 707 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 7101 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = stopped
ERROR: Backup of VM 7101 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Starting Backup of VM 9136 (lxc)
INFO: Backup started at 2019-08-24 00:03:48
INFO: status = stopped
ERROR: Backup of VM 9136 failed - Connection refused
INFO: Failed at 2019-08-24 00:03:48
INFO: Backup job finished with errors
ERROR: Connection refused

TASK ERROR: job errors

the next 3 nodes completed backups with no errors.


I schedule backups one hour apart so there is not a network overload. I'll try to debug this later on.

Next I'll schedule the node to do another backup in a few minutes.

Edit: the next backup worked.

Any clues to prevent this or debug?
 
Last edited:
/etc/pve was disconnecting during the backup. from syslog on that node:
Code:
Aug 24 00:01:30 pve14 vzdump[1989762]: INFO: Starting Backup of VM 10782 (qemu)
..

Aug 24 00:01:41 pve14 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Aug 24 00:01:41 pve14 systemd[1]: pve-cluster.service: Service RestartSec=100ms expired, scheduling restart.
Aug 24 00:01:41 pve14 systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 29.
Aug 24 00:01:41 pve14 systemd[1]: Stopped The Proxmox VE cluster filesystem.
Aug 24 00:01:41 pve14 systemd[1]: Starting The Proxmox VE cluster filesystem...
Aug 24 00:01:41 pve14 systemd[1]: etc-pve.mount: Succeeded.

we have a site wide rsyslog server. another node had /etc/pve issue, however the backup did not fail:
Code:
# grep etc-pve.mount zz-sitewide.log
Aug 24 00:01:41 pve14 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 00:03:40 pve14 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 00:03:50 pve14 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 00:05:12 pve14 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 01:03:19 pve10 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 02:18:00 pve10 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 05:06:01 pve10 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 05:12:26 pve10 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 06:44:28 pve10 systemd[1]: etc-pve.mount: Succeeded.
Aug 24 06:56:15 pve10 systemd[1]: etc-pve.mount: Succeeded.

for network we use a dual ring network for cluster , 10G for storage. and 1G for vmbr0 .

vzdump :
Code:
# 2019-03-11 using default pve , adjusted with https://github.com/extremeshok/xshok-proxmox/blob/master/install-post.sh
#            also pigz set up using same link.

#tmpdir: DIR
#dumpdir: DIR
#storage: STORAGE_ID
#mode: snapshot|suspend|stop
bwlimit: 0
ionice: 5
#lockwait: MINUTES
#stopwait: MINUTES
#size: MB
#stdexcludes: BOOLEAN
#mailto: ADDRESSLIST
#maxfiles: N
#script: FILENAME
#exclude-path: PATHLIST
pigz: 1


We backup to a nfs system connected to the 10G switch.

Is there anything we could try to further debug or prevent the backup issue?
 
We made the following changes:

1- changed backup storage to cephfs [ it was nfs]
2- tried stop mode instead of suspend after the 1st fail.
3- changed /etc/vzdump.conf to developers default after the 1st fail.

we still have backup fail.
last night 2 different nodes had a failed backup due to 'unable to find configuration file for VM <vm id> - no such machine' .

there has been improvement as there was only one vm to fail backup per node. last week all subsequent vm's on the node would fail.

So most backups are working - which is not good enough.

Does anyone have a suggestion on something else to try to fix this issue?
 
Next :
the system that had the majority of fails - had two vm's that had heavy network traffic. i moved one of the vm's to another node to balance network load.
in addition will set bandwidth limit on all the heavy network vm's at all nodes.