Backup - ERROR: got timeout

e100

Renowned Member
Nov 6, 2010
1,268
47
88
Columbus, Ohio
ulbuilder.wordpress.com
This problem is rare but annoying.
Happens once in about 1000 backups.
Very random, not specific to any server.

Sometimes I get an error like this:
Code:
132: Aug 17 01:19:47 INFO: creating archive '/backup/dump/vzdump-qemu-132-2014_08_17-01_19_44.vma.lzo'
132: [COLOR=red]Aug 17 01:19:50 ERROR: got timeout
[/COLOR]


It is always three seconds between "creating archive" and "got timeout"
This leads me to believe that the problem is related to the default timeout on line 94 in PVE/QMPClient.pm https://git.proxmox.com/?p=qemu-server.git;a=blob;f=PVE/QMPClient.pm
Looking over the code it looks like only two monitor commands are sent between "creating archive" and "started backup task" on normal backup runs.

Those commands are 'backup' and 'getfd'

My suspicion is one or both of those might need a slightly longer timeout.

Thoughts, other suggestions?

pve-manager/3.2-4/e24a91c1
 
Could the command getfd be affected if at the time of issuing the command the file system is under heavy IO in which case disk latency temporarily is high?

If this is the case then timeout could be dynamically calculated according to latency?
 
Please test using latest code from git.
I hate to waste your time, I did not see any changes since my version that seem relevant to a timeout condition.
If you feel there is I will be updating to latest subscription repo shortly after the firewall feature is released to it.

FYI: http://forum.proxmox.com/threads/18617-vzdump-timeout

In my case the solution was a PVE upgrade to the newest version (non-subscription running there).

My version is not as old as yours was, maybe there is some new fix but looking over git logs I do not see anything that looks relevant.

I am sure it has to do with some latency issue either on the backup disk or the source disk, thats why I was looking at the timeouts.
Making it configurable or simply increasing the 3 second default to something larger seems like a good idea.
 
Yeah, my train of thought was exactly the same as yours. I modified the code at the exact same place you pointed out. The code affecting this particular timeout did not change between the updated versions either, but the upgrade still helped. OTOH, it's kinda short-sighted to make this a) unconfigurable, b) make it this short.