[SOLVED] restarted nfs - web frontend stopped working

liska_ · Sep 22, 2014

Hi,
I have currently three node cluster, one node on 3.3 and two on 3.2.
It was running without any problem but today I had to restart one nfs server.
I received some error logs about vanished connections and servers stopped to "see" each other on the web page.
I experienced this few months ago but it was solved by turning on nfs server again and restarting pvedaemon, pveproxy and pvestatd.
Not this time. I get no messages in syslog or anywhere else.
All shared storage are available on all nodes but other nodes are marked red on each other web frontend.
Directory /etc/pve is mounted and pvecm nodes or status shows correct values. Problem is just with the UI.

Thanks a lot for your help

liska_ · Sep 23, 2014

Ok, I tried to restart cman and pve-cluster on every node and on console there was everything written OK.
But now there are these errors on each node:
and now I can see this in logs of all nodes:
ep 23 10:43:20 cluster corosync[2794]: [SERV ] Unloading all Corosync service engines.
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync extended virtual synchrony service
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync configuration service
Sep 23 10:43:20 cluster pmxcfs[10731]: [status] crit: cpg_dispatch failed: 2
Sep 23 10:43:20 cluster pmxcfs[10731]: [status] crit: cpg_leave failed: 2
Sep 23 10:43:20 cluster pmxcfs[10731]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Sep 23 10:43:20 cluster pmxcfs[10731]: [status] crit: cpg_dispatch failed: 2
Sep 23 10:43:20 cluster pmxcfs[10731]: [status] crit: cpg_leave failed: 2
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01
Sep 23 10:43:20 cluster pmxcfs[10731]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Sep 23 10:43:20 cluster pmxcfs[10731]: [confdb] crit: confdb_dispatch failed: 2
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync cluster config database access v1.01
Sep 23 10:43:20 cluster pmxcfs[10731]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync profile loading service
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais cluster membership service B.01.01
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais checkpoint service B.01.01
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais event service B.01.01
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais distributed locking service B.03.01
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais message service B.03.01
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync CMAN membership service 2.90
Sep 23 10:43:20 cluster pmxcfs[10731]: [quorum] crit: quorum_dispatch failed: 2
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1
Sep 23 10:43:20 cluster pmxcfs[10731]: [libqb] warning: epoll_ctl(del): Bad file descriptor (9)
Sep 23 10:43:20 cluster corosync[2794]: [SERV ] Service engine unloaded: openais timer service A.01.01
Sep 23 10:43:20 cluster corosync[2794]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1893.
Sep 23 10:43:22 cluster pmxcfs[10731]: [quorum] crit: quorum_initialize failed: 6
Sep 23 10:43:22 cluster pmxcfs[10731]: [quorum] crit: can't initialize service
Sep 23 10:43:22 cluster pmxcfs[10731]: [confdb] crit: confdb_initialize failed: 6
Sep 23 10:43:22 cluster pmxcfs[10731]: [quorum] crit: can't initialize service
Sep 23 10:43:22 cluster pmxcfs[10731]: [dcdb] notice: start cluster connection
Sep 23 10:43:22 cluster pmxcfs[10731]: [dcdb] crit: cpg_initialize failed: 6
Sep 23 10:43:22 cluster pmxcfs[10731]: [quorum] crit: can't initialize service
Sep 23 10:43:22 cluster pmxcfs[10731]: [dcdb] notice: start cluster connection
Sep 23 10:43:22 cluster pmxcfs[10731]: [dcdb] crit: cpg_initialize failed: 6
Sep 23 10:43:22 cluster pmxcfs[10731]: [quorum] crit: can't initialize service
Sep 23 10:43:22 cluster pmxcfs[10731]: [status] crit: cpg_send_message failed: 9
Sep 23 10:43:22 cluster pmxcfs[10731]: [status] crit: cpg_send_message failed: 9
.
.
.
Last two messages are repeated all the time.

But pvecm nodes shows all three nodes and pvecm status seems ok. But I can not write to /etc/pve

dietmar · Sep 23, 2014

liska_ said:
Sep 23 10:43:20 cluster corosync[2794]: [MAIN ] Corosync Cluster Engine exiting with status 0 at main.c:1893.

seems cman is still stopped.

what is the output of

# pveversion -v

liska_ · Sep 23, 2014

I finally found a solution here - http://forum.proxmox.com/archive/index.php/t-16196.html .
I had to stop pve-cluster, manually kill dlm_controld and fenced processes and start pve-cluster.
Everything is working now, but it is strange that it all starts just with restarting external nfs server. In last months I got just two bigger problems with proxmox and both regards failure of external nfs storage. Except for this it is an amazing project. Now especially due to new noVNC console. It is incredible how this "small" change in UI improves overall experience.

Anyway I am attaching output of that command pveversion -v
Two older nodes:
proxmox-ve-2.6.32: 3.2-124 (running kernel: 2.6.32-28-pve)
pve-manager: 3.2-2 (running version: 3.2-2/82599a65)
pve-kernel-2.6.32-28-pve: 2.6.32-124
pve-kernel-2.6.32-26-pve: 2.6.32-114
pve-kernel-2.6.32-23-pve: 2.6.32-109
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.5-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.5-1
pve-cluster: 3.0-12
qemu-server: 3.1-15
pve-firmware: 1.1-2
libpve-common-perl: 3.0-14
libpve-access-control: 3.0-11
libpve-storage-perl: 3.0-19
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-6
vzctl: 4.0-1pve5
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 1.7-6
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

and new node
proxmox-ve-2.6.32: 3.2-136 (running kernel: 2.6.32-32-pve)
pve-manager: 3.3-1 (running version: 3.3-1/a06c9f73)
pve-kernel-2.6.32-32-pve: 2.6.32-136
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-1
pve-cluster: 3.0-15
qemu-server: 3.1-34
pve-firmware: 1.1-3
libpve-common-perl: 3.0-19
libpve-access-control: 3.0-15
libpve-storage-perl: 3.0-23
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-5
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

Search

Search

[SOLVED] restarted nfs - web frontend stopped working

liska_

Member

liska_

Member

dietmar

Proxmox Staff Member

liska_

Member

We value your privacy