KVM backup host crash on Proxmox 4.0

dirk.nilius

Member
Nov 5, 2015
47
0
6
Berlin, Germany
Hi,

after upgrading a cluster from 3.4 -> 4.0 we have see a 60% chance of a complete system crash of cluster nodes (maybe reset after kernel panic). This happens only while a KVM backup is running. This applies to more than one machine. The latest log info is the VM locking info. After that the next messages are the first kernel logs after reboot. I had to disable backups temporarily.

Specs:

- 3 cluster nodes
- Backup via NFS to a FreeNAS
- version 4.0-57

Any ideas, or known issue?
 
do you use separated network for your your storage, cluster, etc?

pls post:

> pveversion -v
> cat /etc/network/interfaces
> cat /etc/pve/storage.cfg
> qm config VMID (of the VM in question)
> cat /etc/hosts
 
do you use separated network for your your storage, cluster, etc?

pls post:

> pveversion -v
> cat /etc/network/interfaces
> cat /etc/pve/storage.cfg
> qm config VMID (of the VM in question)
> cat /etc/hosts

Yep, cluster ist running on 192.168.13.x and backup storage is running on 192.168.100.x.
 
proxmox-ve: 4.0-19 (running kernel: 4.2.3-2-pve)
pve-manager: 4.0-57 (running version: 4.0-57/cc7c2b53)
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-19
lvm2: 2.02.116-pve1
corosync-pve: 2.3.5-1
libqb0: 0.17.2-1
pve-cluster: 4.0-24
qemu-server: 4.0-35
pve-firmware: 1.1-7
libpve-common-perl: 4.0-36
libpve-access-control: 4.0-9
libpve-storage-perl: 4.0-29
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.4-12
pve-container: 1.0-20
pve-firewall: 2.0-13
pve-ha-manager: 1.0-13
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.4-3
lxcfs: 0.10-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve4~jessie

----

# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage part of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

iface eth0 inet manual

auto eth1
iface eth1 inet static
address 192.168.100.2
netmask 255.255.255.0

auto vmbr0
iface vmbr0 inet static
address 192.168.13.12
netmask 255.255.255.0
gateway 192.168.13.253
bridge_ports eth0
bridge_stp off
bridge_fd 0

----

dir: local
path /var/lib/vz
maxfiles 1
content backup,iso,vztmpl

nfs: FreeNAS
server 192.168.100.1
export /mnt/zfsvol/Proxmox
path /mnt/pve/FreeNAS
content vztmpl,backup,iso
options vers=3
maxfiles 3

zfspool: zfs
pool rpool
content rootdir,images

----

balloon: 2048
boot: dcn
bootdisk: virtio0
cores: 4
ide2: none,media=cdrom
memory: 8192
name: ckc-b-kdvdb
net0: virtio=52:54:00:E4:2A:6F,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
sockets: 1
virtio0: zfs:vm-100-disk-1,size=100G

----

127.0.0.1 localhost.localdomain localhost
192.168.13.12 ckc-b-p0003.ckc.de ckc-b-p0003 pvelocalhost

# The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
 
please try without using ballooning in your VM.
 
Allright I'll try that. Is there a general recommendation not to use balooning? Or will this specific issue be fixed in the future?

I remember some problems with ballooning (depends on the OS of the VM). Try it.

What OS do you run?
 
I have had this problem backing up a FreeBSD 10 VM using ZTS send on the VM. I ended up using ZFS send on the PVE host instead. I could never get the vzdump to work on this as there was 2Tb of disk used and it took more then 24 hours.
 
poking in the dark here:
The common denominator seems to be:
1) the Source FileSystem
2) the Destination File-System (as in your freeNas)



I vaguely remember reading something on this support forum during last 3 weeks with a guy that had issues while restoring VM's from a NAS causing in the node to hang and then crash. I do not seem to be able to find the thread ... might be mixing multiple topics in my brain (so take with a grain of salt)
 
@dirk:

when you say the host crashes, do you mean it hangs, and do not reponds to commands, or do you mean it rebooted by itself ?

If the node reboots itself there is usually not that much we can do, either a hardware error detecting by a fencing device, or a CPU overheating

I am also doing backing over NFS to a FreeNAS instance, and I have not noticed problems so far
 
It does a reboot. But I don't think that this has anything todo with the hardware. I had Proxmox 3.x running for nealy two years without problems. After upgrading to 4.0 I see this on ALL cluster nodes (I have three of them). It is more likely that the kernel detected an urgent problem, probably a kernel panic, and made a reboot.
 
Indeed the kernel will reboot in case of a panic, but only if you configured him so:
What is the ouput of sysctl kernel.panic ?
 
Hi
Does the problem happens when you start the backup manually via the vzdump command line ?

You can copy the backup line contained in /etc/cron.d/vzdump and run that
it should look like this:

vzdump 100 102 103 --storage freenas

vzdump output is quite verbose if you set --quiet 0

please post the output of the vzdump command here
 
Hi,

So far the crash appears only at the nightly backup job. It looks like this:

vzdump --all 1 --mailto nilius@ckc.de --compress lzo --storage FreeNAS --quiet 1 --mailnotification failure --mode snapshot

I cannot abandon that this is a kernel issue. I can't imagine any scenario where any user space application cat lead to this behavior. I updated to latest pvetest packages with a newer kernel yesterday. I try to figure out if it has something changed. If I have news I let you know.
 
So far the crash appears only at the nightly backup job. It looks like this:
vzdump --all 1 --mailto nilius@ckc.de --compress lzo --storage FreeNAS --quiet 1 --mailnotification failure --mode snapshot
[...]


  • when you execute the command above in shell, does it crash aswell ?
  • It doesn't send the crash mail ?
  • What happens when you run the command with “--quiet 0“ instead of “--quiet 1“? Any helpfull "hints" given ?

If t does with all of em, could narrow it down even more:
  • Try the Backup modes snapshot|suspend|stop <-- if they all crash, its probably not one of em.
  • Try with -compress (0 | 1 | gzip | lzo) <-- if they all crash its probably not one of em.
  • Try with <vmid> instead of " -all " <-- Narrow down at which Vm it crashes.

If all that fails, the only thing i can think of suggesting is to Memtest your Physical Memory Banks for errors. Maybe vzdump while backing up accesses a address-space of your Memory that is normally not "touched", and said part is broken.
 
  • when you execute the command above in shell, does it crash aswell ?
  • It doesn't send the crash mail ?
  • What happens when you run the command with “--quiet 0“ instead of “--quiet 1“? Any helpfull "hints" given ?

If t does with all of em, could narrow it down even more:
  • Try the Backup modes snapshot|suspend|stop <-- if they all crash, its probably not one of em.
  • Try with -compress (0 | 1 | gzip | lzo) <-- if they all crash its probably not one of em.
  • Try with <vmid> instead of " -all " <-- Narrow down at which Vm it crashes.

If all that fails, the only thing i can think of suggesting is to Memtest your Physical Memory Banks for errors. Maybe vzdump while backing up accesses a address-space of your Memory that is normally not "touched", and said part is broken.


I have the same crashes after migrate to 4.0
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!