Proxmox with a node in "fence" mode since last update

Sep 12, 2015
25
0
41
Hi.

We have a 4 node cluster, with ceph and subscription repositories, and one of the nodes, in the last updates, is allways in "fence" mode in the cluster. This has as a consequence that i can not migrate VMs with HA active to and from this node.

I have searched everything i can think, and i can't find a reason or a solution to this problem. I have tried service restarts, system restarts, etc, and nothing fixes the issue. I also can't find nothing relevent on Google. :(

Output from commands on the node afected (nodeD)

ha-manager status --verbose
Code:
quorum OK
master nodec (active, Sat Oct  6 15:21:22 2018)
lrm nodea (idle, Sat Oct  6 15:21:25 2018)
lrm nodeb (active, Sat Oct  6 15:21:19 2018)
lrm nodec (active, Sat Oct  6 15:21:25 2018)
lrm noded (idle, Sat Oct  6 15:21:25 2018)
full cluster state:
{
   "lrm_status" : {
      "nodea" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1538835685
      },
      "nodeb" : {
         "mode" : "active",
         "results" : {
            "WbYc19DhuDGOXgdSDoo7yA" : {
               "exit_code" : 7,
               "sid" : "vm:147",
               "state" : "started"
            }
         },
         "state" : "active",
         "timestamp" : 1538835679
      },
      "nodec" : {
         "mode" : "active",
         "results" : {
            "gacBZedFAdXV2F0OstGfhA" : {
               "exit_code" : 0,
               "sid" : "vm:147",
               "state" : "migrate"
            }
         },
         "state" : "active",
         "timestamp" : 1538835685
      },
      "noded" : {
         "mode" : "active",
         "results" : {},
         "state" : "wait_for_agent_lock",
         "timestamp" : 1538835685
      }
   },
   "manager_status" : {
      "master_node" : "nodec",
      "node_status" : {
         "nodea" : "online",
         "nodeb" : "online",
         "nodec" : "online",
         "noded" : "fence"
      },
      "service_status" : {},
      "timestamp" : 1538835682
   },
   "quorum" : {
      "node" : "noded",
      "quorate" : "1"
   }
}

vi /etc/corosync/corosync.conf
Code:
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: nodea
    nodeid: 1
    quorum_votes: 1
    ring0_addr: nodea
  }
  node {
    name: nodeb
    nodeid: 2
    quorum_votes: 1
    ring0_addr: nodeb
  }
  node {
    name: nodec
    nodeid: 3
    quorum_votes: 1
    ring0_addr: nodec
  }
  node {
    name: noded
    nodeid: 4
    quorum_votes: 1
    ring0_addr: noded
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: zwame
  config_version: 4
  interface {
    bindnetaddr: 10.133.10.3
    ringnumber: 0

journalctl -u pve-ha-crm
Code:
-- Logs begin at Tue 2018-10-02 21:36:03 WEST, end at Sat 2018-10-06 15:31:01 WEST. --
Oct 02 21:36:12 noded systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Oct 02 21:36:13 noded pve-ha-crm[5197]: starting server
Oct 02 21:36:13 noded pve-ha-crm[5197]: status change startup => wait_for_quorum
Oct 02 21:36:13 noded systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Oct 02 21:44:38 noded pve-ha-crm[5197]: status change wait_for_quorum => slave
Oct 02 22:07:45 noded pve-ha-crm[5197]: status change slave => wait_for_quorum

journalctl -u corosync
Code:
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Using quorum provider corosync_votequorum
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Oct 02 22:26:06 noded corosync[94145]:  [QB    ] server name: votequorum
Oct 02 22:26:06 noded corosync[94145]:  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Oct 02 22:26:06 noded corosync[94145]:  [QB    ] server name: quorum
Oct 02 22:26:06 noded corosync[94145]:  [TOTEM ] A new membership (10.133.10.6:4244) was formed. Members joined: 4
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Members[1]: 4
Oct 02 22:26:06 noded corosync[94145]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [TOTEM ] A new membership (10.133.10.3:4248) was formed. Members joined: 1 2 3
Oct 02 22:26:06 noded corosync[94145]:  [TOTEM ] A new membership (10.133.10.3:4248) was formed. Members joined: 1 2 3
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]: warning [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [CPG   ] downlist left_list: 0 received
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] This node is within the primary component and will provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [QUORUM] This node is within the primary component and will provide service.
Oct 02 22:26:06 noded corosync[94145]: notice  [QUORUM] Members[4]: 1 2 3 4
Oct 02 22:26:06 noded corosync[94145]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 02 22:26:06 noded corosync[94145]:  [QUORUM] Members[4]: 1 2 3 4
Oct 02 22:26:06 noded corosync[94145]:  [MAIN  ] Completed service synchronization, ready to provide service.
Oct 04 09:30:06 noded corosync[94145]: notice  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]:  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]:  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c
Oct 04 09:30:06 noded corosync[94145]: notice  [TOTEM ] Retransmit List: e0f99 e0f9a e0f9b e0f9c

Package Versions (equal on the 4 nodes):
Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-5-pve)
pve-manager: 5.2-9 (running version: 5.2-9/4b30e8f9)
pve-kernel-4.15: 5.2-8
pve-kernel-4.15.18-5-pve: 4.15.18-24
pve-kernel-4.15.18-4-pve: 4.15.18-23
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-38
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-29
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-20
pve-cluster: 5.0-30
pve-container: 2.0-27
pve-docs: 5.2-8
pve-firewall: 3.0-14
pve-firmware: 2.0-5
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.2-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-35
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.11-pve1~bpo1

I someone can help, i appreciate it. Thanks. :)
 
I'm trying several things to diagnose this problem, but i can't find a solution. :(

I tested multicast, with omping for 10 minutes:
Code:
nodea :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.076/0.188/0.346/0.044
nodea : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.057/0.172/0.347/0.049
nodeb :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.071/0.184/0.289/0.043
nodeb : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.066/0.172/0.294/0.045
nodec :   unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.056/0.149/0.283/0.048
nodec : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.064/0.160/0.298/0.043
Everything is fine.

I tried with kernel 4.15.18-4-pve, and i continue with this problem.

I tried with IPMI Fencing (Its a Supermicro Twin), and it did not resolve.

There is one thing that I find strange. When i run /usr/sbin/watchdog-mux, with the watchdog service up, in nodeD, I get:
Code:
root@noded:~# /usr/sbin/watchdog-mux
watchdog open: Device or resource busy

When I run /usr/sbin/watchdog-mux, with the watchdog service up, in the other nodes (nodeA, nodeB, nodeC), I get:
Code:
root@nodec:~# /usr/sbin/watchdog-mux
watchdog active - unable to restart watchdog-mux

Thanks.
I'm out of ideas. :(
 
I'm affected too. Same error after update on 1 node of 4. 1 node succesful without errors, 1 node failed.

No suggestions for solving this at the moment, but if I'll find one, I would post here.

Anybody else?
 
After removing VMs in "fencing" state from the HA-Manager and re-adding them, I was able to continue with the failed node.

Maybe this will help @Nemesis11 too.

Business as usual...

Cheers Knuuut
 
  • Like
Reactions: gurubert

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!