Cluster Node reboot vm mode freeze

HFernandez · May 20, 2019

Hello everyone, I have 3 nodes in cluster. h1, h3, h6
Virtual Environment 5.4-5

All Vm are in a shared resource that all nodes access.
A few days ago the node h1 was restarted but never returned.
Wait for vm to be passed to the other nodes but that did not happen.

The node h1 retained the vm and placed it in freeze mode.

I could not take control of the vm in that node.

When I went to the HA sector, I put them in Ignored mode, but when I wanted to start them from the console of the node h3, I said that there was no VM or the .conf file of the VM.

At this moment I have some Vm in Fence mode

In the syslog it shows me this error:
May 20 09:26:34 h3 pve-ha-crm [2083]: recover service 'vm: 117' from fenced node 'h1' to node 'h3'
May 20 09:26:34 h3 pve-ha-crm [2083]: got unexpected error - Configuration file 'nodes / h1 / qemu-server / 117.conf' does not exist

The result of omping I think is correct:
unicast, xmt/rcv/%loss = 9983/9983/0%, min/avg/max/std-dev = 0.045/0.084/0.281/0.018
multicast, xmt/rcv/%loss = 9983/9983/0%, min/avg/max/std-dev = 0.054/0.094/0.187/0.021

In my switches I do not have IGMP enabled, they have to be enabled if or if so that HA works?

What could have happened? What do I need to configure or test?

Attachment screenshot

Richard · May 22, 2019

Try to restart HA as follows

Code:

systemctl restart pve-ha-lrm

If this does not help post a pvereport about all your nodes and check what

Code:

systemctl status pve-ha-lrm

reports

HFernandez · May 22, 2019

Hi Richard, thnks, is every ok now!

klowet · May 28, 2019

Hi

Same problem here. I restarted 1 of 3 nodes. That node is now in the status "wair_for_agent_lock".
Restarting pve-ha-lrm did not work.

Code:

root@anr1-a-pve02:/home/klowet# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2019-05-28 12:13:09 CEST; 10min ago
  Process: 41801 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
  Process: 41850 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 41944 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   Memory: 80.7M
      CPU: 459ms
   CGroup: /system.slice/pve-ha-lrm.service
           └─41944 pve-ha-lrm

mei 28 12:13:08 anr1-a-pve02 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
mei 28 12:13:09 anr1-a-pve02 pve-ha-lrm[41944]: starting server
mei 28 12:13:09 anr1-a-pve02 pve-ha-lrm[41944]: status change startup => wait_for_agent_lock
mei 28 12:13:09 anr1-a-pve02 systemd[1]: Started PVE Local HA Ressource Manager Daemon.

Code:

# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 10.0.12.1
         2          1 10.0.12.2 (local)
         3          1 10.0.12.3

# pvecm status
Quorum information
------------------
Date:             Tue May 28 12:13:42 2019
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/25372
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2 
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 10.0.12.1
0x00000002          1 10.0.12.2 (local)
0x00000003          1 10.0.12.3

# cat /etc/pve/corosync.conf 2>/dev/null
logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: anr1-a-pve01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.12.1
  }
  node {
    name: anr1-a-pve02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.12.2
  }
  node {
    name: anr1-a-pve03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.12.3
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: pveclu01
  config_version: 3
  interface {
    bindnetaddr: 10.0.12.1
    ringnumber: 0
  }
  ip_version: ipv4
  secauth: on
  version: 2
}

Code:

root@anr1-a-pve02:/home/klowet# pvereport 
Process hostname...OK
Process pveversion --verbose...OK
Process cat /etc/hosts...OK
Process top -b -n 1  | head -n 15...OK
Process pvesubscription get...OK
Process lscpu...OK
Process pvesh get /cluster/resources --type node --output-format=yaml...OK
Process cat /etc/pve/storage.cfg...OK
Process pvesm status...OK
Process cat /etc/fstab...OK
Process findmnt --ascii...OK
Process df --human...OK
Process qm list...OK
OK
Process pct list...OK
OK
Process ip -details -statistics address...OK
Process ip -details -4 route show...OK
Process ip -details -6 route show...OK
Process cat /etc/network/interfaces...OK
OK
Process cat /etc/pve/local/host.fw...OK
Process iptables-save...OK
Process pvecm nodes...OK
Process pvecm status...OK
Process cat /etc/pve/corosync.conf 2>/dev/null...OK
Process dmidecode -t bios...OK
Process lspci -nnk...OK
Process lsblk --ascii...OK
Process ls -l /dev/disk/by-*/...OK
Process iscsiadm -m node...OK
Process iscsiadm -m session...OK
Process pvs...OK
Process lvs...OK
Process vgs...OK
Process zpool status...OK
Process zpool list -v...OK
Process zfs list...OK
Process ceph status...OK
Process ceph osd status...OK
Process ceph df...OK
Process pveceph status...OK
Process pveceph lspools...OK
Process echo rbd-vms
rbd ls rbd-vms
...OK

Code:

==== general system info ====

# hostname
anr1-a-pve02

# pveversion --verbose
proxmox-ve: 5.4-1 (running kernel: 4.15.18-14-pve)
pve-manager: 5.4-6 (running version: 5.4-6/aa7856c5)
pve-kernel-4.15: 5.4-2
pve-kernel-4.15.18-14-pve: 4.15.18-39
pve-kernel-4.15.18-13-pve: 4.15.18-37
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.12-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-10
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-52
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-43
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-37
pve-container: 2.0-39
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-21
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-2
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-51
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Thanks

HFernandez · May 28, 2019

Hello, I rebooted the node that had problems, then I started a vm on that node and it was solved. Luck.

Cluster Node reboot vm mode freeze

HFernandez

Active Member

Richard

Renowned Member

HFernandez

Active Member

klowet

Well-Known Member

HFernandez

Active Member

We value your privacy