I have setup a two cluster using proxmox 2.1 with drbd. This cluster works fine. I configured the cluster in that way that it starts snapshot backups to our company nfs servers.
Sometimes it works perfectly but sometimes one of the clusternodes stops working correctly. The node loses the cluster connection. It seems to me, that this behavior occurs, when the nfs server is under heavy load. The backup lock display the following messages:
After this the node turned to red in the web interface and wasn't accessible any more from the remaining node.
The syslog of the crashed node was flooded with:
I could remove the snapshots with lvremove but I wasn't able to bring the node back to normal operation. Only a reboot of this node has brought it back to operation.
The nfs-storage is connected with these options in storage.cfg:
The output of pveversion ist:
My questions are: Is there a way to reintegrate the failed node into the cluster without rebooting after such a crash occurs and much more important: how can I avoid such a crash?
Currently snapshot backup is not useable for us.
Any help is appreciated
Sometimes it works perfectly but sometimes one of the clusternodes stops working correctly. The node loses the cluster connection. It seems to me, that this behavior occurs, when the nfs server is under heavy load. The backup lock display the following messages:
Code:
INFO: starting new backup job: vzdump 107 --remove 0 --mode snapshot --compress
lzo --storage ubuntu-nfs-1 --node proxmox1
INFO: Starting Backup of VM 107
(qemu)
INFO: status = running
INFO: backup mode: snapshot
INFO: ionice
priority: 7
INFO: suspend vm to make snapshot
INFO: Logical volume
"vzsnap-proxmox1-0" created
INFO: Logical volume "vzsnap-proxmox1-1"
created
INFO: resume vm
INFO: vm is online again after 1 seconds
INFO:
creating archive
'/mnt/pve/ubuntu-nfs-1/dump/vzdump-qemu-107-2012_06_07-16_44_20.tar.lzo'
INFO:
adding
'/mnt/pve/ubuntu-nfs-1/dump/vzdump-qemu-107-2012_06_07-16_44_20.tmp/qemu-server.conf'
to archive ('qemu-server.conf')
INFO: adding '/dev/drbdvg/vzsnap-proxmox1-0'
to archive ('vm-disk-ide0.raw')
INFO: adding '/dev/drbdvg/vzsnap-proxmox1-1'
to archive ('vm-disk-virtio0.raw')
INFO: Total bytes written: 214756756480
(64.00 MiB/s)
INFO: archive file size: 69.01GB
INFO: unable to open file
'/etc/pve/nodes/proxmox1/qemu-server/107.conf.tmp.314129' - Device or resource
busy
trying to aquire cfs lock 'storage-drbdvg' ...trying to aquire cfs lock
'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg' ...trying to
aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed - trying again in 8
seconds
trying to aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed -
trying again in 16 seconds
trying to aquire cfs lock 'storage-drbdvg'
...INFO: lvremove failed - trying again in 32 seconds
trying to aquire cfs
lock 'storage-drbdvg' ...ERROR: got lock request timeout
trying to aquire cfs
lock 'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg' ...trying to
aquire cfs lock 'storage-drbdvg' ...trying to aquire cfs lock 'storage-drbdvg'
...INFO: lvremove failed - trying again in 8 seconds
trying to aquire cfs
lock 'storage-drbdvg' ...INFO: lvremove failed - trying again in 16
seconds
trying to aquire cfs lock 'storage-drbdvg' ...INFO: lvremove failed -
trying again in 32 seconds
trying to aquire cfs lock 'storage-drbdvg'
...ERROR: got lock request timeout
INFO: Finished Backup of VM 107
(00:58:13)
INFO: Backup job finished successfully
TASK OK
After this the node turned to red in the web interface and wasn't accessible any more from the remaining node.
The syslog of the crashed node was flooded with:
Code:
...
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
Jun 9 06:47:35 proxmox1 pmxcfs[586854]: [status] crit: cpg_send_message failed: 9
...
I could remove the snapshots with lvremove but I wasn't able to bring the node back to normal operation. Only a reboot of this node has brought it back to operation.
The nfs-storage is connected with these options in storage.cfg:
Code:
nfs: ubuntu-nfs-1
path /mnt/pve/ubuntu-nfs-1
server 195.50.143.109
export /backup-ext3-sdc/proxmox
options vers=3,rw,rsize=32768,wsize=32768
content images,iso,vztmpl,rootdir,backup
maxfiles 4
The output of pveversion ist:
Code:
#pveversion --verbose
pve-manager: 2.1-1 (pve-manager/2.1/f9b0f63a)
running kernel: 2.6.32-12-pve
proxmox-ve-2.6.32: 2.1-68
pve-kernel-2.6.32-12-pve: 2.6.32-68
pve-kernel-2.6.32-7-pve: 2.6.32-60
lvm2: 2.02.95-1pve2
clvm: 2.02.95-1pve2
corosync-pve: 1.4.3-1
openais-pve: 1.1.4-2
libqb: 0.10.1-2
redhat-cluster-pve: 3.1.8-3
resource-agents-pve: 3.9.2-3
fence-agents-pve: 3.1.7-2
pve-cluster: 1.0-26
qemu-server: 2.0-39
pve-firmware: 1.0-16
libpve-common-perl: 1.0-27
libpve-access-control: 1.0-21
libpve-storage-perl: 2.0-18
vncterm: 1.0-2
vzctl: 3.0.30-2pve5
vzprocps: 2.0.11-2
vzquota: 3.0.12-3
pve-qemu-kvm: 1.0-9
ksm-control-daemon: 1.1-1
My questions are: Is there a way to reintegrate the failed node into the cluster without rebooting after such a crash occurs and much more important: how can I avoid such a crash?
Currently snapshot backup is not useable for us.
Any help is appreciated