Proxmox with a node in "fence" mode since last update

Sep 12, 2015
25
0
41
Hi.

We have a 4 node cluster, with ceph and subscription repositories, and one of the nodes, in the last updates, is allways in "fence" mode in the cluster. This has as a consequence that i can not migrate VMs with HA active to and from this node.

I have searched everything i can think, and i can't find a reason or a solution to this problem. I have tried service restarts, system restarts, etc, and nothing fixes the issue. I also can't find nothing relevent on Google. :(

Output from commands on the node afected (nodeD)

ha-manager status --verbose
Code:
quorum OK
master nodec (active, Sat Oct  6 15:21:22 2018)
lrm nodea (idle, Sat Oct  6 15:21:25 2018)
lrm nodeb (active, Sat Oct  6 15:21:19 2018)
lrm nodec (active, Sat Oct  6 15:21:25 2018)
lrm noded (idle, Sat Oct  6 15:21:25 2018)
full cluster state:
{
   "lrm_status" : {
      "nodea" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1538835685
      },
      "nodeb" : {
         "mode" : "active",
         "results" : {
            "WbYc19DhuDGOXgdSDoo7yA" : {
               "exit_code" : 7,
               "sid" : "vm:147",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1538835679
      },
      "nodec" : {
         "mode" : "active",
         "results" : {
            "gacBZedFAdXV2F0OstGfhA" : {
               "exit_code" : 0,
               "sid" : "vm:147",
               "state" : "migrate"
            }
         },
         "state" : "active",
         "timestamp" : 1538835685
      },
      "noded" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1538835685
      }
   },
   "manager_status" : {
      "master_node" : "nodec",
      "node_status" : {
         "nodea" : "online",
         "nodeb" : "online",
         "nodec" : "online",
         "noded" : "fence"
      },
      "service_status" : {},
      "timestamp" : 1538835682
   },
   "quorum" : {
      "node" : "noded",
      "quorate" : "1"
   }
}

vi /etc/corosync/corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: nodea
    nodeid: 1
    quorum_votes: 1
    ring0_addr: nodea
  }
  node {
    name: nodeb
    nodeid: 2
    quorum_votes: 1
    ring0_addr: nodeb
  }
  node {
    name: nodec
    nodeid: 3
    quorum_votes: 1
    ring0_addr: nodec
  }
  node {
    name: noded
    nodeid: 4
    quorum_votes: 1
    ring0_addr: noded
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: zwame
  config_version: 4
  interface {
    bindnetaddr: 10.133.10.3
    ringnumber: 0

journalctl -u pve-ha-crm
Code:
-- Logs begin at Tue 2018-10-02 21:36:03 WEST, end at Sat 2018-10-06 15:31:01 WEST. --
Oct 02 21:36:12 noded systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Oct 02 21:36:13 noded pve-ha-crm[5197]: starting server
Oct 02 21:36:13 noded pve-ha-crm[5197]: status change startup => wait_for_quorum
Oct 02 21:36:13 noded systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Oct 02 21:44:38 noded pve-ha-crm[5197]: status change wait_for_quorum => slave
Oct 02 22:07:45 noded pve-ha-crm[5197]: status change slave => wait_for_quorum

journalctl -u corosync
Code:
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Using quorum provider corosync_votequorum
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Oct 02 22:26:06 noded corosync[94145]:  [QB    ] server name: votequorum
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Oct 02 22:26:06 noded corosync[94145]:  [QB    ] server name: quorum
Oct 02 22:26:06 noded corosync[94145]:  [TOTEM ] A new membership (10.133.10.6:4244) was formed. Members joined: 4
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Members[1]: 4
Oct 02 22:26:06 noded corosync[94145]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [TOTEM ] A new membership (10.133.10.3:4248) was formed. Members joined: 1 2 3
Oct 02 22:26:06 noded corosync[94145]:  [TOTEM ] A new membership (10.133.10.3:4248) was formed. Members joined: 1 2 3
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] This node is within the primary component and will provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [QUORUM] This node is within the primary component and will provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [QUORUM] Members[4]: 1 2 3 4
Oct 02 22:26:06 noded corosync[94145]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Members[4]: 1 2 3 4
Oct 02 22:26:06 noded corosync[94145]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 04 09:30:06 noded corosync[94145]: notice  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]:  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]:  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]: notice  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c

Package Versions (equal on the 4 nodes):
Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-35
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

I someone can help, i appreciate it. Thanks. :)
 
I'm trying several things to diagnose this problem, but i can't find a solution. :(

I tested multicast, with omping for 10 minutes:
Code:
nodea :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.076/0.188/0.346/0.044
nodea : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.057/0.172/0.347/0.049
nodeb :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.071/0.184/0.289/0.043
nodeb : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.066/0.172/0.294/0.045
nodec :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.149/0.283/0.048
nodec : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.160/0.298/0.043
Everything is fine.

I tried with kernel 4.15.18-4-pve, and i continue with this problem.

I tried with IPMI Fencing (Its a Supermicro Twin), and it did not resolve.

There is one thing that I find strange. When i run /usr/sbin/watchdog-mux, with the watchdog service up, in nodeD, I get:
Code:
root@noded:~# /usr/sbin/watchdog-mux
watchdog open: Device or resource busy

When I run /usr/sbin/watchdog-mux, with the watchdog service up, in the other nodes (nodeA, nodeB, nodeC), I get:
Code:
root@nodec:~# /usr/sbin/watchdog-mux
watchdog active - unable to restart watchdog-mux

Thanks.
I'm out of ideas. :(
 
I'm affected too. Same error after update on 1 node of 4. 1 node succesful without errors, 1 node failed.

No suggestions for solving this at the moment, but if I'll find one, I would post here.

Anybody else?
 
After removing VMs in "fencing" state from the HA-Manager and re-adding them, I was able to continue with the failed node.

Maybe this will help @Nemesis11 too.

Business as usual...

Cheers Knuuut
 
  • Like
Reactions: gurubert