Red machine icon

LnxBil

Distinguished Member
Feb 21, 2015
9,648
1,835
273
Saarland, Germany
Hi,

I stumbled upon a problem I never had and I did not find any useful information. I have currently a 6 node PVE 3.4 cluster (previously 7 nodes). One machine got fenced due to a ZFS stuck kernel problem and since then halv of my cluster is shown offline (red machine icon). The VMs are still running, I can start, stop, migrate via commandline but the GUI is not working as expected. The GUI updates also hardware information like used ram and cpu, but does not update the rrd graphs. These are blank since the fencing of the other node.

There are no obvious entries in the logfiles I checked, it seams only the GUI is not working.

clustat shows a normal cluster:
Code:
Cluster Status for cluster @ Tue Nov  3 09:11:20 2015
Member Status: Quorate


 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 proxmox1                                                            1 Online, rgmanager
 proxmox2                                                            2 Online, rgmanager
 proxmox3                                                            3 Online, Local, rgmanager
 apu-01                                                              4 Online, rgmanager
 apu-02                                                              5 Online, rgmanager
 proxmox4                                                            7 Online, rgmanager

and also pvecm

Code:
Version: 6.2.0
Config Version: 62
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 2988
Membership state: Cluster-Member
Nodes: 6
Expected votes: 6
Total votes: 6
Node votes: 1
Quorum: 4  
Active subsystems: 7
Flags: 
Ports Bound: 0 11 177  
Node name: proxmox3
Node ID: 3
Multicast addresses: 239.192.52.104 
Node addresses: 10.192.0.243

Here my pveversion -v:

Code:
root@proxmox3 ~ > pveversion  -v
proxmox-ve-2.6.32: 3.4-160 (running kernel: 3.10.0-11-pve)
pve-manager: 3.4-9 (running version: 3.4-9/4b51d87a)
pve-kernel-2.6.32-40-pve: 2.6.32-160
pve-kernel-3.10.0-11-pve: 3.10.0-36
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-3
pve-cluster: 3.0-18
qemu-server: 3.4-6
pve-firmware: 1.1-4
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-33
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.2-11
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

I already tried to restart some services like pve-manager, pveproxy, pvedaemon, pvestatd.
 
do you tried also the pve-cluster?
 
No, i didn't try that. Fuse filesystem was working correctly.
I did now, but no change. Restarting the whole machine worked for one node, but I do not want to restart every time a problem arises, it's not windows.
 
Sometimes you have to restart all pve-clusters in the cluster.
 
Tried that also, no change:

Code:
$ for i in proxmox1 proxmox2 proxmox3 proxmox4 apu-01 apu-02; do echo $i; ssh ${i} time service pve-cluster restart; done

proxmox1
Restarting pve cluster filesystem: pve-cluster.


real    0m12.066s
user    0m0.006s
sys     0m0.021s
proxmox2
Restarting pve cluster filesystem: pve-cluster.


real    0m2.047s
user    0m0.006s
sys     0m0.011s
proxmox3
Restarting pve cluster filesystem: pve-cluster.


real    0m12.058s
user    0m0.006s
sys     0m0.019s
proxmox4
Restarting pve cluster filesystem: pve-cluster.


real    0m1.770s
user    0m0.015s
sys     0m0.010s
apu-01
Restarting pve cluster filesystem: pve-cluster.


real    0m1.305s
user    0m0.028s
sys     0m0.045s
apu-02
Restarting pve cluster filesystem: pve-cluster.


real    0m12.212s
user    0m0.050s
sys     0m0.071s

The different runtimes are strange. The machines went offline as the service is restarted and get back online, but not the ones that are already marked offline.
 
Can you pleas send me the /etc/pve/cluster.conf
 
Hi Wolfgang,

Cluster is working perfectly and everything is fine if you do not use the GUI.

Here is the cluster.conf

Code:
<?xml version="1.0"?>
<cluster config_version="62" name="cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_ilo_mp" identity_file="/root/.ssh/id_rsa" ipaddr="1.1.0.246" login="fence" login_timeout="30" name="fence_proxmox1" power_wait="10" secure="true"/>
    <fencedevice agent="fence_ilo_mp" identity_file="/root/.ssh/id_rsa" ipaddr="1.1.0.244" login="fence" login_timeout="30" name="fence_proxmox2" power_wait="10" secure="true"/>
    <fencedevice agent="fence_ilo_mp" identity_file="/root/.ssh/id_rsa" ipaddr="1.1.0.242" login="fence" login_timeout="30" name="fence_proxmox3" power_wait="10" secure="true"/>
    <fencedevice agent="fence_ilo_mp" identity_file="/root/.ssh/id_rsa" ipaddr="1.1.0.253" login="fence" login_timeout="30" name="fence_apu_01" power_wait="10" secure="true"/>
    <fencedevice agent="fence_ilo_mp" identity_file="/root/.ssh/id_rsa" ipaddr="1.1.0.252" login="fence" login_timeout="30" name="fence_apu_02" power_wait="10" secure="true"/>
    <fencedevice agent="fence_ipmilan" ipaddr="10.1.0.202" login="ADMIN" passwd="BLABLUBB" name="fence_proxmox4" power_wait="5" />
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox1" nodeid="1" votes="1">
      <fence>
        <method name="pve">
          <device action="reboot" name="fence_proxmox1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox2" nodeid="2" votes="1">
      <fence>
        <method name="pve">
          <device action="reboot" name="fence_proxmox2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox3" nodeid="3" votes="1">
      <fence>
        <method name="pve">
          <device action="reboot" name="fence_proxmox3"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="apu-01" votes="1" nodeid="4">
      <fence>
        <method name="pve">
          <device action="reboot" name="fence_apu_01"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="apu-02" votes="1" nodeid="5">
      <fence>
        <method name="pve">
          <device action="reboot" name="fence_apu_02"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="proxmox4" votes="1" nodeid="7">
       <fence>
        <method name="pve">
          <device action="reboot" name="fence_proxmox4"/>
        </method>
      </fence>
    </clusternode>
   
</clusternodes>
  <rm>
    <pvevm autostart="1" domain="main-cluster-group-1" recovery="relocate" vmid="1000"/>
    <pvevm autostart="1" domain="main-cluster-group-1" recovery="relocate" vmid="1001"/>
    <pvevm autostart="1" domain="main-cluster-group-3" recovery="relocate" vmid="1003"/>
    <pvevm autostart="1" domain="main-cluster-group-2" recovery="relocate" vmid="1004"/>
    <pvevm autostart="1" domain="main-cluster-group-3" recovery="relocate" vmid="1005"/>
    <pvevm autostart="1" domain="main-cluster-group-2" recovery="relocate" vmid="1006"/>
    <pvevm autostart="1" domain="main-cluster-group-2" recovery="relocate" vmid="1007"/>
    <pvevm autostart="1" domain="main-cluster-group-3" recovery="relocate" vmid="1008"/>
    <pvevm autostart="1" domain="main-cluster-group-3" recovery="relocate" vmid="1009"/>
    <pvevm autostart="1" domain="main-cluster-group-1" recovery="relocate" vmid="1010"/>
    <pvevm autostart="1" domain="main-cluster-group-2" recovery="relocate" vmid="1011"/>
    <pvevm autostart="1" domain="main-cluster-group-1" recovery="relocate" vmid="1012"/>
    <pvevm autostart="1" domain="main-cluster-group-3" recovery="relocate" vmid="1013"/>
    <pvevm autostart="1" domain="main-cluster-group-2" recovery="relocate" vmid="1014"/>


    <pvevm autostart="1" domain="apu-cluster-group-2" recovery="relocate" vmid="1015"/>


    <failoverdomains>
      <failoverdomain name="main-cluster-group-1" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="proxmox1" priority="1"/>
        <failoverdomainnode name="proxmox2" priority="2"/>
        <failoverdomainnode name="proxmox3" priority="3"/>
      </failoverdomain>
      <failoverdomain name="main-cluster-group-2" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="proxmox2" priority="1"/>
        <failoverdomainnode name="proxmox3" priority="2"/>
        <failoverdomainnode name="proxmox1" priority="3"/>
      </failoverdomain>
      <failoverdomain name="main-cluster-group-3" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="proxmox3" priority="1"/>
        <failoverdomainnode name="proxmox1" priority="2"/>
        <failoverdomainnode name="proxmox2" priority="3"/>
      </failoverdomain>
      <failoverdomain name="apu-cluster-group-1" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="apu-01" priority="1"/>
        <failoverdomainnode name="apu-02" priority="2"/>
      </failoverdomain>
      <failoverdomain name="apu-cluster-group-2" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="apu-02" priority="1"/>
        <failoverdomainnode name="apu-01" priority="2"/>
      </failoverdomain>
    </failoverdomains>
  </rm>
</cluster>
 
Thank's for the link. I'll investigate when the problem arises again. Rebooted the node yesterday and everything is back online since then.
 
I just had another of these problems and it was a hanging, but not hanging pvestatd run of pvs. After killing it with -9 and restarting pvestatd it went back to green.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!