Proxmox hypervisor crash during backup

deludi

New Member
Oct 14, 2013
26
0
1
Hello all,

i have a serious problem with proxmox 3.x.
When i do a backup of a larger vm (> 50GB) the proxmox hypervisor crashes 50% of the times.
I have tested this on proxmox 3.0 and 3.1.
The vm is on a napp-it nfs store with gigabit interconnect.
As a target i have tested both an nfs store and a cifs store on the same storage box.
The nfs target causes relatively more crashes than the cifs target.
The hardware is premium hardware: supermicro cases, intel xeons, ecc memory, etc.
I have 3 proxmox hypervisor boxes and can replay the error on all 3 boxes.
I have reinstalled proxmox multiple times without result.
When the hypervisor crashes the screen is complete black in ipmi.
The only possible action is to reboot from ipmi.
I can provide logs, etc. if needed.
Thank you in advance.

Dirk Adamsky
 
read the logs and post relevant parts here, also include a sample VM config and all info about your NFS server, also your pveversion -v.

just to note, this is NOT expected.
 
Hi Tom,

I will post them asap.
The behaviour is indeed strange.
I have 5 or 6 HP microserver boxes with proxmox 2.3 and proxmox 3, they have 1 100GB windows 7 vm
and make backups to netgear nasboxes (nfs async) without problems....
 
Hi Tom,

here is the messages log and a screenshot of the nappit nfs settings.
All further nappit settings are default, no dedup, encryption, etc.
There is a dedicated zil ssd on the vm storage.
Please let me know if you need more logs and/or screenshots.
 

Attachments

  • nfs-settings-nappit.jpg
    nfs-settings-nappit.jpg
    39.3 KB · Views: 14
  • messages.zip
    messages.zip
    19.1 KB · Views: 6
i have upgraded one of the three proxmox nodes to the latest kernel: 2.6.32-23-pve
Then started a backup at 17:00 hr.
Between 17:04 hr. and 17:08 hr. the node restarted.
Here is the deamon log:

Oct 23 17:00:01 proxmox3 vzdump[9943]: <root@pam> starting task UPID:proxmox3:000026D9:00051E5D:5267E471:vzdump::root@pam:
Oct 23 17:00:01 proxmox3 vzdump[9945]: INFO: starting new backup job: vzdump 101 --quiet 1 --mode snapshot --compress lzo --storage storage02-backupstorage01
Oct 23 17:00:01 proxmox3 vzdump[9945]: INFO: Starting Backup of VM 101 (qemu)
Oct 23 17:00:02 proxmox3 qm[9950]: <root@pam> update VM 101: -lock backup
Oct 23 17:00:06 proxmox3 ntpd[2353]: Listen normally on 12 tap101i0 fe80::547f:98ff:fede:ebcb UDP 123
Oct 23 17:00:06 proxmox3 ntpd[2353]: peers refreshed
Oct 23 17:04:16 proxmox3 rrdcached[2397]: flushing old values
Oct 23 17:04:16 proxmox3 rrdcached[2397]: rotating journals
Oct 23 17:04:16 proxmox3 rrdcached[2397]: started new journal /var/lib/rrdcached/journal/rrd.journal.1382540656.349880
Oct 23 17:04:17 proxmox3 pveproxy[3305]: worker 7454 finished
Oct 23 17:04:17 proxmox3 pveproxy[3305]: starting 1 worker(s)
Oct 23 17:04:17 proxmox3 pveproxy[3305]: worker 10397 started
Oct 23 17:08:39 proxmox3 ntpd[2338]: ntpd 4.2.6p5@1.2349-o Sat May 12 09:54:55 UTC 2012 (1)


I also (like in the other thread) suspect that this is a kernel problem.
I am working around the problem by not using proxmox backup, instead make a copy of the vm on storage level.
kinda sucks though...

Regards,

Dirk Adamsky
 
Did you try to do a backup on local storage and see if it creates error during backup?

Sent from my ASUS Transformer Pad TF700T using Tapatalk
 
Thank you for the tip.
Unfortunately the 3 proxmox nodes all have only 1 SSD bootdisk (60GB,60GB and 40GB).
The backup problem is with large vm's (~100GB).
I will try to connect an extra hdd to one node but that will be next week when i am in the datacenter.
 
I did another test this morning:
did a backup from prompt with the bwlimit argument added:

vzdump 110 --remove 0 --mode snapshot --compress lzo --storage storage02-backupstorage01 --node proxmox1 --bwlimit 30000

The proxmox1 hypervisor unfortunately did crash again (backup of the vm was at 41%).
 
Hi symmcom,

Thank you for your input.
i ran this job through putty on the first of our 3 proxmox nodes. The vm is on an nfs share (omnios+napp-it) and the backup target is another nfs share on the same storage box.
My collegue and i have decided to use napp-it (zfs) for backups. We have ~500GB of vm's and made a napp-it local replication task from the vm volume to a local backup volume (on another pool).
The task ran in about 22 minutes (360MB/s). This task will be scheduled weekly.
When we have rebuilded our second storage box from freenas to napp-it, the replication task will be to the other machine instead of a local copy.
We will not use proxmox backup because of the above problems.

Regards,

Dirk Adamsky
 
@deludi, I think everybody else is trying to say to you: <<if you could try to backup locally, you could sort out if nfs backup is involved in crashing or not>>,
basically you could take the network (and perhaps remote hosts issues) out of the backup job, in order to find out what is causing your issue...

Marco
 
Hi Marco,

the 3 proxmox hypervisors only have small ssd bootdisks (60GB, 60GB and 40GB).
They are in a datacenter 60km away....
I do have several other single node proxmox setups with local storage that backup large vm's without problems.
The issue is that i need my vm's on shared storage (in this setup an NFS share) because it's a 3 node proxmox cluster.
The current proxmox kernel has a problem with vzdumps of large vm's when both vm and storage target are on an NFS share.
I have also tested vm's on NFS share and backup target on a CIFS share: same result.
I have tested this multiple times: the backup halts halfway, the proxmox node crashes, the other vm's are failed over to the other nodes.
The cluster works ok as long as i do not start vzdump on a large vm.
As stated above i will make my backup now on storage level with zfs replication jobs.
I will try the proxmox backup again with newer kernels (proxmox 3.5) but for now i gave up and chose for zfs/napp-it backups.

Regards,

Dirk