Backup randomly stops

m.witt

New Member
Aug 31, 2020
2
0
1
26
Hi,
im trying to backup about 40 VMs with PBS.

The backup stops/fails everyday at a different VM. From the logs i can't really see the cause. Only the following error is printed :


Sep 2 09:00:26 prx002 vzdump[15610]: ERROR: Backup of VM 2026 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed
Sep 2 09:00:26 prx002 vzdump[15610]: INFO: Starting Backup of VM 2027 (qemu)


Logs of the last two days are attached.

root@prx002:~# pveversion -v
proxmox-ve: 6.2-1 (running kernel: 5.4.55-1-pve)
pve-manager: 6.2-11 (running version: 6.2-11/22fb4983)
pve-kernel-5.4: 6.2-6
pve-kernel-helper: 6.2-6
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.60-1-pve: 5.4.60-1
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksmtuned: 4.20150325+b1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.4
libpve-access-control: 6.1-2
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-1
libpve-guest-common-perl: 3.1-2
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.2-11
pve-cluster: 6.1-8
pve-container: 3.1-13
pve-docs: 6.2-5
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-2
pve-firmware: 3.1-3
pve-ha-manager: 3.0-9
pve-i18n: 2.1-3
pve-qemu-kvm: 5.0.0-13
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-14
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve1

Any hints on how i can debug this ?
 

Attachments

  • vzdump0109-0209.txt
    19.6 KB · Views: 8

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Hi,

Is your PBS updated?
and can you send the version of "libproxmox-backup-qemu0" and "proxmox-backup-client" on the PVE host?
Does a Desktop environment is installed on this PVE host?
 

abien

New Member
Apr 16, 2020
14
0
1
41
Is your PBS updated?
and can you send the version of "libproxmox-backup-qemu0" and "proxmox-backup-client" on the PVE host?
Does a Desktop environment is installed on this PVE host?

PBS is updated.
Code:
ii  proxmox-backup-server          0.8.13-1                     amd64        Proxmox Backup Server daemon with tools and GUI

Please see below for requested versions on the PVE Host:
Code:
ii  libproxmox-backup-qemu0              0.6.4-1                             amd64        Proxmox Backup Server client library for QEMU
ii  proxmox-backup-client                0.8.13-1                            amd64        Proxmox Backup Client tools

There is no desktop / GUI installed on either of the servers.
Both hosts where installed with the corresponding official ISO.

The Backup Job will work fine for several hours until it aborts. It never finished succesfully. Is there anything we can do to produce more debug output in either the client or the server side? Reproducing the issue takes a long time but is generally is not a problem.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
The problem is DBus debugging is not funny if you don't know what you are looking for.
Have any extra installed software on this node(PVE)?
 

abien

New Member
Apr 16, 2020
14
0
1
41
The problem is DBus debugging is not funny if you don't know what you are looking for.
Have any extra installed software on this node(PVE)?

Yes. For example: fail2ban, lldp, nrpe, ksmd, etc. I will disable all those services and start another backup run. Will report back.
 

abien

New Member
Apr 16, 2020
14
0
1
41
We moved all 40 VMs to a new/vanilla pve host. The first couple backup runs where completing without errors. All VMs where correctly processed and it took roughly 15 hours to backup all the VMs.
However, recently the same error
Code:
ERROR: Backup of VM 2056 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed
started to show on the new pve host as well. We updated to 0.8.14 yesterday and started another backup run during the night. Got the same (dbus) error after roughly 4 hours.
Any pointers as to what to change, where to look etc would be appreciated.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Could it be that the server is overloaded. I mean in case of CPU?
Do you have any CPU workloads during the backup?
If not can you run atop and record the CPU workload during the next Backup?
 

abien

New Member
Apr 16, 2020
14
0
1
41
Could it be that the server is overloaded. I mean in case of CPU?
Do you have any CPU workloads during the backup?
If not can you run atop and record the CPU workload during the next Backup?

I find it unlikely that CPU is experiencing issues but i can certainly make sure. Will create atop file for both the pve node as well as the pbs node and report back here.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
I find it unlikely that CPU is experiencing issues but i can certainly make sure.
We never see this problem before. And I think 40 VM for a single node can be very much load for cpu scheduling.
We know also if the VM vcpus are not used the load on the host CPU is rising through scheduling the vcpus.
The error you are having means, in my opinion, that you cannot start the VM in time.
Maybe we could increase the timeout but this is not normal. And as this product, Proxmox Backup server is beta we try to understand things before we change.
 

abien

New Member
Apr 16, 2020
14
0
1
41
I understand. But: This is a testing scenario and the 40 VMs are offline. There is almost nothing else consuming any resources - and there are plenty of resources.

Having said that: Here is the atop files and some context:
prx003 is the pve node currently home to ~40 offline VMs.
pbx-backup is the pbs receiving the data from the pve node
The backup ran for a little less than 24h and then aborted with the error we have seen before:
Code:
ERROR: Backup of VM 2056 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed

prx003 atop file here
pbx-backup atop file

Here is the interesting portion of the job log for the time when it aborted:
INFO: 99% (297.0 GiB of 300.0 GiB) in 20m 7s, read: 674.0 MiB/s, write: 0 B/s
INFO: 100% (300.0 GiB of 300.0 GiB) in 20m 21s, read: 216.6 MiB/s, write: 0 B/s
INFO: backup is sparse: 139.51 GiB (46%) total zero data
INFO: backup was done incrementally, reused 300.00 GiB (100%)
INFO: transferred 300.00 GiB in 1221 seconds (251.6 MiB/s)
INFO: stopping kvm after backup task
trying to acquire lock...
OK
INFO: Finished Backup of VM 2052 (00:20:44)
INFO: Backup finished at 2020-09-09 09:17:43
INFO: Starting Backup of VM 2053 (qemu)
INFO: Backup started at 2020-09-09 09:17:43
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 5
INFO: VM Name: d2053.x-mailer.de
INFO: include disk 'scsi0' 'data:vm-2053-disk-0' 100G
INFO: creating Proxmox Backup Server archive 'vm/2053/2020-09-09T07:17:43Z'
INFO: starting kvm to execute backup task
ERROR: Backup of VM 2053 failed - start failed: org.freedesktop.DBus.Error.Disconnected: Connection is closed
INFO: Failed at 2020-09-09 09:17:43

We think that during dev/testing, perhaps the backup of lots of offline VM was not in focus as it is arguably an edge case scenario.
We will re-try this test with all 40 VMs running but with disconnected network.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Thanks for the logs, I will report if I find something.

We think that during dev/testing, perhaps the backup of lots of offline VM was not in focus as it is arguably an edge case scenario.
We will re-try this test with all 40 VMs running but with disconnected network.
This works here fine without problems.
 
  • Like
Reactions: abien

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Hi,

This is a vanilla PVE ;-)
I guess the problems comes from the "prometheus-node-exporter"
It uses DBUS.
and it runs at the same time with load.

Please try the following fix
edit line 46 of /usr/share/perl5/PVE/Systemd.pm
form
Code:
    my $bus = Net::DBus->system( );
to
Code:
    my $bus = Net::DBus->system( ("private", 1) );

restart all pve* services.
and please test again.
 
  • Like
Reactions: abien

abien

New Member
Apr 16, 2020
14
0
1
41
I edited this file (on PVE nodes) as per request, rebooted PVE nodes to be safe and re-started the Backup-Job.
After 2 successfull VM's the 3rd and all following VMs recieved the following error:

Code:
INFO: Starting Backup of VM 2026 (qemu)
INFO: Backup started at 2020-09-10 14:44:01
INFO: status = stopped
INFO: backup mode: stop
INFO: ionice priority: 5
INFO: VM Name: d297.x-mailer.de
INFO: include disk 'scsi0' 'data:vm-2026-disk-0' 500G
INFO: creating Proxmox Backup Server archive 'vm/2026/2020-09-10T12:44:01Z'
INFO: starting kvm to execute backup task
malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tools.pm line 977, <GEN14790> chunk 1.
ERROR: Backup of VM 2026 failed - VM 2026 not running
INFO: Failed at 2020-09-10 14:44:01

To be sure i didnt mess up the change, i am pasting here:
Perl:
root@prx003:~# grep -3 private /usr/share/perl5/PVE/Systemd.pm
sub systemd_call($;$) {
    my ($code, $timeout) = @_;

    my $bus = Net::DBus->system( ("private", 1) );
    my $reactor = Net::DBus::Reactor->main();

    my $service = $bus->get_service('org.freedesktop.systemd1');
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Your code is correct.
At my machine, I get this error also 1 on 10 runs.
I will analyze it and report back.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Please try again with

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1 );
 

abien

New Member
Apr 16, 2020
14
0
1
41
Please try again with

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1 );

That exact line gives syntax errors. I will be testing with one less bracket
Perl:
my $bus = Net::DBus->system("private", 1, "nomainloop", 1 );
Will report back :)
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Sorry, the correct line would

Code:
my $bus = Net::DBus->system( ("private", 1, "nomainloop", 1) );
 

abien

New Member
Apr 16, 2020
14
0
1
41
Hi Wolfgang,
i suppose the last patch fixed the issue. We have done 8 full backup runs, ~14h/30vm's each - no error. All good.
Can you confirm the patch will make it into upstream release at some point?
 

abien

New Member
Apr 16, 2020
14
0
1
41
i suppose the last patch fixed the issue. We have done 8 full backup runs, ~14h/30vm's each - no error. All good.
I have to correct this: While the patch fixes the backups - it kills migrations. The patched PVE node can no longer recieve online migrations.

Log output:
2020-09-22 15:33:04 use dedicated network address for sending migration traffic (192.168.X.X)
2020-09-22 15:33:04 starting migration of VM 2057 to node 'prx003' (192.168.X.X)
2020-09-22 15:33:04 found local disk 'data:vm-2057-disk-0' (in current VM config)
2020-09-22 15:33:04 copying local disk images
2020-09-22 15:33:04 starting VM 2057 on remote node 'prx003'
2020-09-22 15:33:16 [prx003] start failed: timeout waiting on systemd
2020-09-22 15:33:16 ERROR: online migrate failure - remote command failed with exit code 255
2020-09-22 15:33:16 aborting phase 2 - cleanup resources
2020-09-22 15:33:16 migrate_cancel
2020-09-22 15:33:17 ERROR: migration finished with problems (duration 00:00:13)
TASK ERROR: migration problems

Migrations are successfull again if we revert the patch and restart pve* services.
 

wolfgang

Proxmox Staff Member
Staff member
Oct 1, 2014
6,293
435
103
Hi,

sorry for the late response.
I guess the migration does not work because this was a dirty fix and the DBus connections are not closed.
So I will test this here and give you the full patch if this works.
 
  • Like
Reactions: m.witt

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!