problem with external storage for backup causes cluster split

liska_ · Apr 8, 2014

Hi,
I have problems with backing up my guests. My architecture now consists of two servers and on one server is virtual guest with pass-through controller running debian installed on local drive (just a sidenote: I made some benchmarks of nfs and a difference in performance between zfs on OpenIndianna or zfs on debian was just up to 5% in my case).

When I run backup of just a few machines (kvm+openvz), there is no problem at all. But if I run backup of more than approximately 5 or all 25 guests that nfs debian guest suddenly stops working - everytime on different id, percentage or technology (both on kvm and openvz containers). Nothing can be found in logs either on host or guest server. Guest just hangs on - it is not possible to connect via console.
I applied the latest patch I found here on a different topic but unfortunatelly nothing has changed.

The biggest problem is that when there is a shortage of any shared storage (I tested with nfs, ceph and gluster), my cluster gets disconnected and gui is hardly usable as there are no names or states of guests visible. That is really annoying because I have not found any easy way how to connect the cluster together again. It does not help to remove and unmount this faulty storages, delete from storage.cfg, restarting pvedaemon or pvestatd. I have to restart any of my two servers or turn on the storage again in order to have my cluster joins again. It happens when using external file servers as well, so I believe it is a bug in PVE.

After this it is not possible to run the nfs guests again, i get:
start failed: command '/usr/bin/kvm -id 100 -chardev 'socket,id=qmp,path=/var/run/qemu-server/100.qmp,server,nowait' -mon 'chardev=qmp,mode=control' -vnc unix:/var/run/qemu-server/100.vnc,x509,password -pidfile /var/run/qemu-server/100.pid -daemonize -name indian -smp 'sockets=2,cores=2' -nodefaults -boot 'menu=on' -vga cirrus -cpu host,+x2apic -k en-us -m 16384 -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' -device 'pci-assign,host=03:00.0,id=hostpci0,bus=pci.0,addr=0x10' -device 'virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3' -drive 'file=/mnt/pve/skladka_virtualy/template/iso/debian-7.2.0-amd64-CD-1.iso,if=none,id=drive-ide2,media=cdrom,aio=native' -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=200' -drive 'file=/var/lib/vz/images/100/vm-100-disk-1.qcow2,if=none,id=drive-virtio0,format=qcow2,aio=native,cache=none' -device 'virtio-blk-pci,drive=drive-virtio0,id=virtio0,bus=pci.0,addr=0xa,bootindex=100' -netdev 'type=tap,id=net0,ifname=tap100i0,script=/var/lib/qemu-server/pve-bridge' -device 'e1000,mac=02:BF:56:9B:10:7C,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300' -netdev 'type=tap,id=net1,ifname=tap100i1,script=/var/lib/qemu-server/pve-bridge' -device 'vmxnet3,mac=62:93:03:92:18:48,netdev=net1,bus=pci.0,addr=0x13,id=net1,bootindex=301'' failed: got timeout

I have to manually remove interfaces
ip link set tap100i0 down
ip link set tap100i1 down

After this it is possible to start that guest again and in some time cluster gets conected again and I can continue in my work. I am in the middle of moving from ESXI to proxmox and I would be so glad to find a solution to this problem, because using Proxmox is much better in comparison to esxi. Thanks a lot for your work and help as well.

pveversion -v
proxmox-ve-2.6.32: 3.2-124 (running kernel: 2.6.32-28-pve)
pve-manager: 3.2-2 (running version: 3.2-2/82599a65)
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-6
ksm-control-daemon: 1.1-1
glusterfs-client: 3.4.2-1

liska_ · Apr 11, 2014

It seems that cause of my problem was pci pass-through. I installed nfs-kernel-server to proxmox and everything is working now.

Anyway, I again stumbled upon a problem, that if external storage is not accessible, my pve cluster gets disconnected and it is not possible to use a gui and so on. It was happening on version 3.1 and 3.2 as well. Can someone give me a hint how to resolve this problem? It seems to me very unpractical if cluster splits just because of failed backup storage. Is it a "feature" or is it a problem just in my setup - two servers connected to external nfs server running linux?

liska_ · Apr 15, 2014

I tried build a virtual test environment of two servers connected to cluster, then added nfs storage and stopped it afterwards. These servers stops communicating and it is not possible to restore them without rebooting one of this servers.

Am I really the only one who has this problem with failed external storage?

cesarpk · Apr 16, 2014

Hi Liska

I don't have practice with zfs nor OpenIndianna, but i can tell you about of my good experiences:

1- If is possible, the network for the PVE cluster communication will be good to have it physically independent of the rest of the network (or at most, share it only with the network of the VMs and CTs).
The purpose of this configuration is avoid a network saturation that can interrupt the PVE cluster communication since that backups of VMs/CTs can do it.

2- My backup nfs server is in a PC independent where i have installed PVE with the purpose of do tests of restoration of backups on the same computer, for after raising VMs without these have network comunication enabled

3- If i have switches managed and network bonding, i get the double the speed of the network communication for the backups, LAN and cluster communication

I hope these ideas help you

Re edited: Also you should run in your PVE 3.2 Hosts: aptitude update && aptitude full-upgrade
And for get better performance of backups:
1- In the backup NFS server, each disk (or array of disks) must be independent for contain backups of a only PVE Host.
2- In the backup NFS server, each physical network must be connected to a only PVE Host "through NFS".

liska_ · Apr 16, 2014

Thanks for your answer cesarpk.
I was using zfs on OpenIndianna virtualized on ESXI host for few years and there were absolutely no problems when serving data from this virtual machine to another ESXI hosts. I replicated this configuration on proxmox, but unfortuantelly it crushed everytime on high load when doing backup. Running nfs server directly on proxmox solved my issue. In following weeks I would like to move to glusterfs because of better HA.

Anyway, I think I have configuration as you described - two networks, one for "internet" traffic and one for background traffic like nfs, backing up and so on.
As backup nfs server I am using now proxmox as well because of testing new versions of packages - that`s very practical.
I have some dell switch which support bonding, but I am not able to get it work with proxmox. But we will buy a new switch in few month.s
I have applied all the latest patches so my servers are up to date.
I have a common network and pool for backing up, and I can see speed around 100mb/s which is ok for me.

Search

Search

problem with external storage for backup causes cluster split

liska_

Member

liska_

Member

liska_

Member

cesarpk

Well-Known Member

liska_

Member

We value your privacy