vzdump timeout

kobuki

Renowned Member
Dec 30, 2008
473
27
93
I have a server where a few KVM VMs are replicated with DRBD. Almost every night when the backup runs, it times out at the same exact operation. A log excerpt is as follows:

Code:
105: May 27 02:15:04 INFO: Starting Backup of VM 105 (qemu)
105: May 27 02:15:04 INFO: status = running
105: May 27 02:15:04 INFO: update VM 105: -lock backup
105: May 27 02:15:05 INFO: backup mode: snapshot
105: May 27 02:15:05 INFO: ionice priority: 7
105: May 27 02:15:05 INFO: creating archive '/backup/vzdump/dump/vzdump-qemu-105-2014_05_27-02_15_04.vma.lzo'
105: [COLOR=red]May 27 02:15:08 ERROR: got timeout
[/COLOR]105: May 27 02:15:08 INFO: aborting backup job
105: [COLOR=red]May 27 02:15:11 ERROR: Backup of VM 105 failed - got timeout
[/COLOR]
109: May 27 02:15:11 INFO: Starting Backup of VM 109 (qemu)
109: May 27 02:15:11 INFO: status = running
109: May 27 02:15:12 INFO: update VM 109: -lock backup
109: May 27 02:15:12 INFO: backup mode: snapshot
109: May 27 02:15:12 INFO: ionice priority: 7
109: May 27 02:15:12 INFO: creating archive '/backup/vzdump/dump/vzdump-qemu-109-2014_05_27-02_15_11.vma.lzo'
109: [COLOR=red]May 27 02:15:15 ERROR: got timeout
[/COLOR]109: May 27 02:15:15 INFO: aborting backup job
109: [COLOR=red]May 27 02:15:17 ERROR: Backup of VM 109 failed - got timeout[/COLOR]

The timeout always happens after the "creating archive" part and it's always 3 seconds. It indicates that it's probably a fixed timeout of 3 secs somewhere. Is there a way to modify this particular timeout? I tried changing the others in vzdump.conf, but they didn't help. Running the backup by hand is always successful, BTW.
 
So I take it's not possible without debugging and rewriting the code by hand?
 
No, I have "proxmox-ve-2.6.32: 3.1-113 (running kernel: 2.6.32-25-pve)" and the associated deps on this server. I'll try to allocate a time window for an upgrade if you say it solves this problem. But for the meantime it'd be sufficient if I could modify the timeout, I think.
 
Last edited:
I've found a resolution to this problem. In the QMPClient.pm module (https://git.proxmox.com/?p=qemu-server.git;a=blob;f=PVE/QMPClient.pm) at line 94, there's a default timeout value:

Code:
  94             $timeout = 3; # default

Changing this to 180 (probably smaller values are also sufficient) allowed my previously failed backup processes to proceed beyond the mentioned 3-sec timeout and they're now always successful. There's a 6-7 secs of time spent on some waiting where it timed out after 3 secs most of the time. This particular line is the same in my version and the latest code linked above, so even a full upgarde wouldn't help, probably.

Do you think it's possible to fix it upstream by using a longer timeout? That would be great.
 
Dietmar, could you be a little more specific, please: what does not look correct? I have only concluded what helped me solve my problem. I'd be glad if you guided me how I should correctly present you my findings.

I don't know what causes the delay. It must be a qemu client call for which the default timeout applies. What is the 3 seconds default timeout based on?

I already answered your question about the newest version. I can't update the host any time, I need to schedule downtime. Haven't been able to do that yet. But again, as I've previously noted, the latest version in your Git repo doesn't differ on that line from the 3.1 version I'm using there.
 
Dietmar, could you be a little more specific, please: what does not look correct? I have only concluded what helped me solve my problem. I'd be glad if you guided me how I should correctly present you my findings.

I don't know what causes the delay. It must be a qemu client call for which the default timeout applies. What is the 3 seconds default timeout based on?

I already answered your question about the newest version. I can't update the host any time, I need to schedule downtime. Haven't been able to do that yet. But again, as I've previously noted, the latest version in your Git repo doesn't differ on that line from the 3.1 version I'm using there.

there were changes recently, so I maybe your issue could be solved just by upgrading. so hunting for an maybe already solved bug is just a waste of time.

=> Upgrade to latest, test again and if there is still an issue we can go further.
 
Alright. My fix borrowed me some time, after an upgrade I'll test again with pristine PVE code and report back.
 
Since upgrading to 3.2-126 a week ago the error hasn't even once appeared. It seems to have done the trick.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!