Backup of VM to NFS fails on 2.3

This server has 4 KVM machines :
8105, 8106, 8107 are Win XP machines
8200 is a linux box (running centos/asterisk)


I´ve reviewed all the logs and found following:
I updated to 2.3.13 on March 20
backup is scheduled on wed/sat to NFS1 (Filer.20) and on tue/fri to NFS2 (Filer.21)
These are the results for all bakups :
Code:
Date         NFS   Result
march 22    21     OK
march 23    20     OK
march 26    21     OK
march 27    20     OK
march 29    21     OK
march 30    20     OK
april  2    21     OK
april  3    20     8106, 8107 & 8200 Failed      (NFS full  >>  NOT Related to problem)
april  5    21     8105 & 8200 Failed  as described, both machines WinXP and CenOs were stoped during backup
april  6    20     OK
april  9    21     Machine 8105 failed 
april 10    20     OK   (this time with gzip format)
So from 10 backup jobs (4 machines each) I've had 2 fails on machine 8105 (XP) and one fail with machine 8200 (Centos)

Log for april 5 ( 2 fails & 2 ok) is :
Code:
vzdump 8105 8106 8109 8200 8107 --quiet 1 --mode snapshot --mailto root --compress lzo --storage Filer.122.21.backups4  
8105: Apr 05 05:35:02 INFO: Starting Backup of VM 8105 (qemu) 8105: Apr 05 05:35:02 INFO: status = running 
8105: Apr 05 05:35:02 INFO: backup mode: snapshot 8105: Apr 05 05:35:02 INFO: ionice priority: 7 
8105: Apr 05 05:35:02 INFO: creating archive '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8105-2013_04_05-05_35_02.vma.lzo' 8105: Apr 05 05:35:02 INFO: started backup task '29690223-368d-4382-ab84-ce1195b9e6eb' 
8105: Apr 05 05:35:05 INFO: status: 1% (149880832/10737418240), sparse 0% (2740224), duration 3, 49/49 MB/s 
8105: Apr 05 05:35:08 INFO: status: 2% (277086208/10737418240), sparse 0% (3149824), duration 6, 42/42 MB/s 
8105: Apr 05 05:35:11 INFO: status: 3% (411893760/10737418240), sparse 0% (3633152), duration 9, 44/44 MB/s 
8105: Apr 05 05:35:14 INFO: status: 5% (554762240/10737418240), sparse 0% (4059136), duration 12, 47/47 MB/s 
8105: Apr 05 05:35:17 INFO: status: 6% (695664640/10737418240), sparse 0% (4411392), duration 15, 46/46 MB/s 
8105: Apr 05 05:35:20 INFO: status: 7% (820969472/10737418240), sparse 0% (5267456), duration 18, 41/41 MB/s 
8105: Apr 05 05:35:23 INFO: status: 8% (935067648/10737418240), sparse 0% (5734400), duration 21, 38/37 MB/s 
8105: [COLOR=red]Apr 05 05:35:49 ERROR: VM 8105 not running [/COLOR]
8105: Apr 05 05:35:49 INFO: aborting backup job 
8105: [COLOR=red]Apr 05 05:35:49 ERROR: VM 8105 not running [/COLOR]
8105: [COLOR=red]Apr 05 05:35:50 ERROR: Backup of VM 8105 failed - VM 8105 not running [/COLOR] 

8106: Apr 05 05:35:50 INFO: Starting Backup of VM 8106 (qemu) 8106: Apr 05 05:35:50 INFO: status = stopped 
8106: Apr 05 05:35:50 INFO: backup mode: stop 8106: Apr 05 05:35:50 INFO: ionice priority: 7 
8106: Apr 05 05:35:50 INFO: creating archive '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8106-2013_04_05-05_35_50.vma.lzo' 8106: Apr 05 05:35:50 INFO: starting kvm to execute backup task 
8106: Apr 05 05:35:52 INFO: started backup task '8b3143df-ae55-4cb5-b3da-b5951b90cb75' 
8106: Apr 05 05:35:55 INFO: status: 1% (151453696/10737418240), sparse 0% (2129920), duration 3, 50/49 MB/s .......................................................  
8106: Apr 05 05:45:47 INFO: status: 99% (10631643136/10737418240), sparse 0% (93331456), duration 595, 31/31 MB/s 
8106: Apr 05 05:46:26 INFO: status: 100% (10737418240/10737418240), sparse 0% (104988672), duration 634, 2/2 MB/s 
8106: Apr 05 05:46:26 INFO: transferred 10737 MB in 634 seconds (16 MB/s) 
8106: Apr 05 05:46:28 INFO: stopping kvm after backup task 
8106: Apr 05 05:46:29 INFO: archive file size: 6.42GB 
8106: Apr 05 05:46:29 INFO: delete old backup '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8106-2013_03_22-05_43_45.vma.lzo' 8106: Apr 05 05:46:46 INFO: Finished Backup of VM 8106 (00:10:56)  

8107: Apr 05 05:46:46 INFO: Starting Backup of VM 8107 (qemu) 
8107: Apr 05 05:46:46 INFO: status = running 
8107: Apr 05 05:46:47 INFO: backup mode: snapshot 
8107: Apr 05 05:46:47 INFO: ionice priority: 7 
8107: Apr 05 05:46:47 INFO: creating archive '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8107-2013_04_05-05_46_46.vma.lzo' 8107: Apr 05 05:46:47 INFO: started backup task '90fa9070-a3b7-441e-b408-c8c51b525a00' 
8107: Apr 05 05:46:50 INFO: status: 2% (159645696/6442450944), sparse 0% (708608), duration 3, 53/52 MB/s    
.......................................................
8107: Apr 05 05:51:22 INFO: status: 100% (6442450944/6442450944), sparse 34% (2240045056), duration 275, 155/29 MB/s 
8107: Apr 05 05:51:22 INFO: transferred 6442 MB in 275 seconds (23 MB/s) 
8107: Apr 05 05:51:33 INFO: archive file size: 2.65GB 
8107: Apr 05 05:51:33 INFO: Finished Backup of VM 8107 (00:04:47)  


8200: Apr 05 05:59:39 INFO: Starting Backup of VM 8200 (qemu) 
8200: Apr 05 05:59:39 INFO: status = running 
8200: Apr 05 05:59:40 INFO: backup mode: snapshot 
8200: Apr 05 05:59:40 INFO: ionice priority: 7 
8200: Apr 05 05:59:40 INFO: creating archive '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8200-2013_04_05-05_59_39.vma.lzo' 8200: Apr 05 05:59:40 INFO: started backup task '133217e3-9894-49ce-8820-7d3f767a8d67' 
8200: Apr 05 05:59:43 INFO: status: 4% (259850240/6442450944), sparse 1% (108974080), duration 3, 86/50 MB/s 
8200: Apr 05 05:59:46 INFO: status: 6% (402915328/6442450944), sparse 1% (113086464), duration 6, 47/46 MB/s 
8200: Apr 05 05:59:49 INFO: status: 8% (526516224/6442450944), sparse 1% (118534144), duration 9, 41/39 MB/s 
8200: Apr 05 05:59:52 INFO: status: 10% (659095552/6442450944), sparse 1% (122650624), duration 12, 44/42 MB/s 
8200: Apr 05 05:59:55 INFO: status: 11% (764018688/6442450944), sparse 1% (122650624), duration 15, 34/34 MB/s 
8200: Apr 05 05:59:58 INFO: status: 13% (888012800/6442450944), sparse 1% (128090112), duration 18, 41/39 MB/s 
8200: Apr 05 06:00:01 INFO: status: 15% (1003487232/6442450944), sparse 2% (132206592), duration 21, 38/37 MB/s 
8200: Apr 05 06:00:04 INFO: status: 16% (1033895936/6442450944), sparse 2% (132206592), duration 24, 10/10 MB/s 
8200: Apr 05 06:00:07 INFO: status: 17% (1109852160/6442450944), sparse 2% (137633792), duration 27, 25/23 MB/s 
8200: Apr 05 06:00:10 INFO: status: 19% (1246756864/6442450944), sparse 2% (141737984), duration 30, 45/44 MB/s 
8200: Apr 05 06:00:13 INFO: status: 20% (1299972096/6442450944), sparse 2% (141737984), duration 33, 17/17 MB/s 
8200: Apr 05 06:01:26 INFO: status: 23% (1493237760/6442450944), sparse 4% (297713664), duration 106, 2/0 MB/s 
8200: Apr 05 06:01:29 INFO: status: 25% (1631518720/6442450944), sparse 4% (309272576), duration 109, 46/42 MB/s 
8200: Apr 05 06:01:32 INFO: status: 26% (1729495040/6442450944), sparse 4% (313470976), duration 112, 32/31 MB/s 
8200: Apr 05 06:01:44 INFO: status: 27% (1779433472/6442450944), sparse 4% (313876480), duration 124, 4/4 MB/s 
8200: Apr 05 06:01:47 INFO: status: 30% (1943273472/6442450944), sparse 4% (317964288), duration 127, 54/53 MB/s 
8200: Apr 05 06:01:50 INFO: status: 32% (2107768832/6442450944), sparse 4% (321945600), duration 130, 54/53 MB/s 
8200: Apr 05 06:01:53 INFO: status: 35% (2268921856/6442450944), sparse 5% (329244672), duration 133, 53/51 MB/s 
8200: Apr 05 06:01:56 INFO: status: 37% (2428698624/6442450944), sparse 5% (333377536), duration 136, 53/51 MB/s 
8200: Apr 05 06:01:59 INFO: status: 40% (2591096832/6442450944), sparse 5% (336326656), duration 139, 54/53 MB/s 
8200: Apr 05 06:02:02 INFO: status: 42% (2753757184/6442450944), sparse 5% (339742720), duration 142, 54/53 MB/s 
8200: Apr 05 06:02:05 INFO: status: 45% (2901082112/6442450944), sparse 5% (342732800), duration 145, 49/48 MB/s 
8200: [COLOR=red]Apr 05 06:02:39 ERROR: VM 8200 not running [/COLOR]
8200: Apr 05 06:02:39 INFO: aborting backup job 
8200: [COLOR=red]Apr 05 06:02:39 ERROR: VM 8200 not running [/COLOR]
8200: [COLOR=red]Apr 05 06:02:40 ERROR: Backup of VM 8200 failed - VM 8200 not running [/COLOR]
So I have failures on two machines, both against same NFS server ...

Regards
Vicente
 
I have seen this kind of mysterious before. My conclusion was that if the NFS storage is slow it will fail with the combination vma+lzo but the combination vma+gzip always works. Maybe this is caused by gzip is a slower algorithm compared to lzo but never the less the backup routine or the new vma should be able to handle this.

One work-around for slow NFS is to enable sync on the mounts which seems to prevent vma+lzo to break.
 
I have seen this kind of mysterious before. My conclusion was that if the NFS storage is slow it will fail with the combination vma+lzo but the combination vma+gzip always works. Maybe this is caused by gzip is a slower algorithm compared to lzo but never the less the backup routine or the new vma should be able to handle this.

One work-around for slow NFS is to enable sync on the mounts which seems to prevent vma+lzo to break.

Switched back to lzo mode , an tried (vzdump.conf) bandwidth limit: 20000 KB/s, just to see if it helps....

but never the less the backup routine or the new vma should be able to handle this.
Of course I agree. I can not understad how a problem in backup can stop/corrupt(as reported by other users) a VM...
 
After change compression of a backup with gzip my VM don't stop/restart. Thanks for the solution my question!!!
 
Are you sure this has any effect? AFAIK proxmox sets the bwlimit on the command line.
It works. See output of log:

Code:
INFO: starting new backup job: vzdump 8105 --remove 0 --mode snapshot  --compress lzo --storage Filer.122.21.backups4 --node proxmox177
INFO: Starting Backup of VM 8105 (qemu)
INFO: status = running
INFO: backup mode: snapshot
INFO: bandwidth limit: 20000 KB/s      <<<<<<<<<<<<<<<<<<<< 
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/Filer.122.21.backups4/dump/vzdump-qemu-8105-2013_04_10-18_21_03.vma.lzo'
INFO: started backup task 'e42e0564-0329-4c23-9aec-6eff812d07f8'

Regards
 
One of my installations suffer from the same "VM stopping" problem during backups on a system recently upgraded to 2.3. It's unpredictably and randomly fails on a single VM (about 1 in 3 times), which runs a Windows 2003 Server Standard Edition, on IDE (it was P2V conversion). All other Windows VMs are just fine. Maybe the problem has something to do with the guest OS, like how it handles high IO latency situations (while being backed up), or maybe driver problems. Starting the backup by hand is always successful. I don't really want to set a bandwith limit since it will unnecessarily slow down all other backup processes. And it really should not be necessary. I will try to upgrade the VM to Virtio and if it helps I'll get back here.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!