Backup randomly stops

m.witt

Member
Aug 31, 2020
9
0
6
30
Hi,
im trying to backup about 40 VMs with PBS.

The backup stops/fails everyday at a different VM. From the logs i can't really see the cause. Only the following error is printed :

Sep 2 09:00:26 prx002 vzdump[15610]: ERROR: Backup of VM 2026 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed Sep 2 09:00:26 prx002 vzdump[15610]: INFO: Starting Backup of VM 2027 (qemu)

Logs of the last two days are attached.

root@prx002:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-1
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-11
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

Any hints on how i can debug this ?
 

Attachments

Hi,

Is your PBS updated?
and can you send the version of "libproxmox-backup-qemu0" and "proxmox-backup-client" on the PVE host?
Does a Desktop environment is installed on this PVE host?
 
Is your PBS updated?
and can you send the version of "libproxmox-backup-qemu0" and "proxmox-backup-client" on the PVE host?
Does a Desktop environment is installed on this PVE host?

PBS is updated.
Code:
ii  proxmox-backup-server          0.8.13-1                     amd64        Proxmox Backup Server daemon with tools and GUI

Please see below for requested versions on the PVE Host:
Code:
ii  libproxmox-backup-qemu0              0.6.4-1                             amd64        Proxmox Backup Server client library for QEMU
ii  proxmox-backup-client                0.8.13-1                            amd64        Proxmox Backup Client tools

There is no desktop / GUI installed on either of the servers.
Both hosts where installed with the corresponding official ISO.

The Backup Job will work fine for several hours until it aborts. It never finished succesfully. Is there anything we can do to produce more debug output in either the client or the server side? Reproducing the issue takes a long time but is generally is not a problem.
 
The problem is DBus debugging is not funny if you don't know what you are looking for.
Have any extra installed software on this node(PVE)?
 
The problem is DBus debugging is not funny if you don't know what you are looking for.
Have any extra installed software on this node(PVE)?

Yes. For example: fail2ban, lldp, nrpe, ksmd, etc. I will disable all those services and start another backup run. Will report back.
 
We moved all 40 VMs to a new/vanilla pve host. The first couple backup runs where completing without errors. All VMs where correctly processed and it took roughly 15 hours to backup all the VMs.
However, recently the same error
Code:
ERROR: Backup of VM 2056 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed
started to show on the new pve host as well. We updated to 0.8.14 yesterday and started another backup run during the night. Got the same (dbus) error after roughly 4 hours.
Any pointers as to what to change, where to look etc would be appreciated.
 
Could it be that the server is overloaded. I mean in case of CPU?
Do you have any CPU workloads during the backup?
If not can you run atop and record the CPU workload during the next Backup?
 
Could it be that the server is overloaded. I mean in case of CPU?
Do you have any CPU workloads during the backup?
If not can you run atop and record the CPU workload during the next Backup?

I find it unlikely that CPU is experiencing issues but i can certainly make sure. Will create atop file for both the pve node as well as the pbs node and report back here.
 
I find it unlikely that CPU is experiencing issues but i can certainly make sure.
We never see this problem before. And I think 40 VM for a single node can be very much load for cpu scheduling.
We know also if the VM vcpus are not used the load on the host CPU is rising through scheduling the vcpus.
The error you are having means, in my opinion, that you cannot start the VM in time.
Maybe we could increase the timeout but this is not normal. And as this product, Proxmox Backup server is beta we try to understand things before we change.
 
I understand. But: This is a testing scenario and the 40 VMs are offline. There is almost nothing else consuming any resources - and there are plenty of resources.

Having said that: Here is the atop files and some context:
prx003 is the pve node currently home to ~40 offline VMs.
pbx-backup is the pbs receiving the data from the pve node
The backup ran for a little less than 24h and then aborted with the error we have seen before:
Code:
ERROR: Backup of VM 2056 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed

prx003 atop file here
pbx-backup atop file

Here is the interesting portion of the job log for the time when it aborted:
INFO: 99% (297.0 GiB of 300.0 GiB) in 20m 7s, read: 674.0 MiB/s, write: 0 B/s
INFO: 100% (300.0 GiB of 300.0 GiB) in 20m 21s, read: 216.6 MiB/s, write: 0 B/s
INFO: backup is sparse: 139.51 GiB (46%) total zero data
INFO: backup was done incrementally, reused 300.00 GiB (100%)
INFO: transferred 300.00 GiB in 1221 seconds (251.6 MiB/s)
INFO: stopping kvm after backup task
trying to acquire lock...
OK
INFO: Finished Backup of VM 2052 (00:20:44)
INFO: Backup finished at 2020-09-09 09:17:43
INFO: Starting Backup of VM 2053 (qemu)
INFO: Backup started at 2020-09-09 09:17:43
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 5
INFO: VM Name: d2053.x-mailer.de
INFO: include disk 'scsi0' 'data:vm-2053-disk-0' 100G
INFO: creating Proxmox Backup Server archive 'vm/2053/2020-09-09T07:17:43Z'
INFO: starting kvm to execute backup task
ERROR: Backup of VM 2053 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed
INFO: Failed at 2020-09-09 09:17:43

We think that during dev/testing, perhaps the backup of lots of offline VM was not in focus as it is arguably an edge case scenario.
We will re-try this test with all 40 VMs running but with disconnected network.
 
Thanks for the logs, I will report if I find something.

We think that during dev/testing, perhaps the backup of lots of offline VM was not in focus as it is arguably an edge case scenario.
We will re-try this test with all 40 VMs running but with disconnected network.
This works here fine without problems.
 
  • Like
Reactions: abien
Hi,

This is a vanilla PVE ;-)
I guess the problems comes from the "prometheus-node-exporter"
It uses DBUS.
and it runs at the same time with load.

Please try the following fix
edit line 46 of /usr/share/perl5/PVE/Systemd.pm
form
Code:
    my $bus = Net::DBus->system( );
to
Code:
    my $bus = Net::DBus->system( ("private", 1) );

restart all pve* services.
and please test again.
 
  • Like
Reactions: abien
I edited this file (on PVE nodes) as per request, rebooted PVE nodes to be safe and re-started the Backup-Job.
After 2 successfull VM's the 3rd and all following VMs recieved the following error:

Code:
INFO: Starting Backup of VM 2026 (qemu)
INFO: Backup started at 2020-09-10 14:44:01
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 5
INFO: VM Name: d297.x-mailer.de
INFO: include disk 'scsi0' 'data:vm-2026-disk-0' 500G
INFO: creating Proxmox Backup Server archive 'vm/2026/2020-09-10T12:44:01Z'
INFO: starting kvm to execute backup task
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 977, <GEN14790> chunk 1.
ERROR: Backup of VM 2026 failed - VM 2026 not running
INFO: Failed at 2020-09-10 14:44:01

To be sure i didnt mess up the change, i am pasting here:
Perl:
root@prx003:~# grep -3 private /usr/share/perl5/PVE/Systemd.pm
sub systemd_call($;$) {
    my ($code, $timeout) = @_;

    my $bus = Net::DBus->system( ("private", 1) );
    my $reactor = Net::DBus::Reactor->main();

    my $service = $bus->get_service('org.freedesktop.systemd1');
 
Your code is correct.
At my machine, I get this error also 1 on 10 runs.
I will analyze it and report back.
 
Please try again with

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1 );
 
Please try again with

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1 );

That exact line gives syntax errors. I will be testing with one less bracket
Perl:
my $bus = Net::DBus->system("private", 1, "nomainloop", 1 );
Will report back :)
 
Sorry, the correct line would

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1) );
 
Hi Wolfgang,
i suppose the last patch fixed the issue. We have done 8 full backup runs, ~14h/30vm's each - no error. All good.
Can you confirm the patch will make it into upstream release at some point?
 
i suppose the last patch fixed the issue. We have done 8 full backup runs, ~14h/30vm's each - no error. All good.
I have to correct this: While the patch fixes the backups - it kills migrations. The patched PVE node can no longer recieve online migrations.

Log output:
2020-09-22 15:33:04 use dedicated network address for sending migration traffic (192.168.X.X)
2020-09-22 15:33:04 starting migration of VM 2057 to node 'prx003' (192.168.X.X)
2020-09-22 15:33:04 found local disk 'data:vm-2057-disk-0' (in current VM config)
2020-09-22 15:33:04 copying local disk images
2020-09-22 15:33:04 starting VM 2057 on remote node 'prx003'
2020-09-22 15:33:16 [prx003] start failed: timeout waiting on systemd
2020-09-22 15:33:16 ERROR: online migrate failure - remote command failed with exit code 255
2020-09-22 15:33:16 aborting phase 2 - cleanup resources
2020-09-22 15:33:16 migrate_cancel
2020-09-22 15:33:17 ERROR: migration finished with problems (duration 00:00:13)
TASK ERROR: migration problems

Migrations are successfull again if we revert the patch and restart pve* services.
 
Hi,

sorry for the late response.
I guess the migration does not work because this was a dirty fix and the DBus connections are not closed.
So I will test this here and give you the full patch if this works.
 
  • Like
Reactions: m.witt

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!