HA vm always starting status

Jan 16, 2014
40
0
26
Hi,
If I try to enable HA for a vm (128 or 123 in my case see attachment), from a node (vs1 is belonging to a 3 node cluster), I always have vm in status "starting". I read from https://pve.proxmox.com/wiki/High_Availability that this state is when:
Pending start request. But the CRM has not got any confirmation from the LRM that the service is running.
but I don't know where to fix this problem.
I get this problem only when the vm is running on vs1. If the vm is on the other nodes HA for the vm 128 become "started" in the right way.
Where could be the problem?
Thanks.
 

Attachments

  • Schermata 2018-06-11 alle 17.09.48.png
    Schermata 2018-06-11 alle 17.09.48.png
    184.9 KB · Views: 30
Is the LRM running on node vs1 (systemctl status pve-lrm)?
 
Yes it's running:

Code:
# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2018-06-09 10:09:47 CEST; 2 days ago
  Process: 4036 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 4129 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   Memory: 28.3M
      CPU: 44.491s
   CGroup: /system.slice/pve-ha-lrm.service
           └─4129 pve-ha-lrm

giu 09 10:09:46 vs1 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
giu 09 10:09:47 vs1 pve-ha-lrm[4129]: starting server
giu 09 10:09:47 vs1 pve-ha-lrm[4129]: status change startup => wait_for_agent_lock
giu 09 10:09:47 vs1 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: successfully acquired lock 'ha_agent_vs1_lock'
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: watchdog active
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: status change wait_for_agent_lock => active
 
How does the vmid.conf and the storage.cfg look like?
 
storage.cf is:
Code:
root@vs1:~# cat /etc/pve/storage.cfg
dir: local
   path /var/lib/vz
   content vztmpl,backup,iso

lvmthin: local-lvm
   thinpool data
   vgname pve
   content rootdir,images

rbd: machines
   content images
   krbd 0
   monhost 10.10.10.1 10.10.10.2 10.10.10.3
   pool machines
   username admin

dir: usb2
   path /media/usb2
   content backup
   maxfiles 1
   nodes vs1
   shared 0

dir: usb3
   path /media/usb3
   content backup
   maxfiles 1
   nodes vs2
   shared 0

dir: usb1
   path /media/usb1
   content backup
   maxfiles 1
   nodes vs1
   shared 0

dir: usb4
   path /media/usb4
   content backup
   maxfiles 1
   nodes vs2
   shared 0

dir: usb5
   path /media/usb5
   content backup
   maxfiles 1
   nodes vs3
   shared 0

and for example 128.conf is:
Code:
root@vs1:~# cat /etc/pve/nodes/vs1/qemu-server/128.conf
boot: cdn
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 1024
name: Server
net0: virtio=F2:DD:92:96:CF:AA,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
sockets: 1
virtio0: machines:vm-128-disk-1,size=6148M
 
Can you start the VM on the vs1 without HA?
 
Yes. Furthermore If virtual machines are not belonging to HA I can start stop and migrate them; if they belongs to HA I can't start stop or migrate them.
On all machines or just vs1?

What is the 'pveversion -v' showing? Are all nodes on the latest packages?
 
On all machines or just vs1?
On all machines I can't migrate, start or stop if vms belong to HA. Only on vs1 I always get status "starting" if a vm belong to HA.
What is the 'pveversion -v' showing? Are all nodes on the latest packages?
Yes they are.
Code:
root@vs1:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

root@vs2:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

root@vs3:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9
 
What are syslog/journal telling? Did you try to reboot the vs1 to see if there is any change in behavior?
 
Yes even if I reboot vs1, nothing changes. Which specific logs do you mean? Because in /var/log/messages or /var/log/syslog I can't see nothing about the problem..
 
Check the watchdog on the machine, it might not be running. There should be messages in the logs (dmesg/syslog/journal).
 
I see logs about wathdog only in journal. From these logs watchdog seem to be active in every node:

Code:
root@vs1:~# journalctl |grep watchdog
lug 12 09:35:36 vs1 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:35:38 vs1 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:35:38 vs1 watchdog-mux[1165]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:35:56 vs1 corosync[2432]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:35:56 vs1 corosync[2432]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:35:57 vs1 corosync[2432]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:35:57 vs1 corosync[2432]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:35:57 vs1 corosync[2432]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:35:57 vs1 corosync[2432]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:40:32 vs1 pve-ha-lrm[3896]: watchdog active

Code:
root@vs2:~# journalctl |grep watchdog
lug 12 09:25:30 vs2 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:25:32 vs2 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:25:32 vs2 watchdog-mux[1328]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:25:51 vs2 corosync[2611]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:25:51 vs2 corosync[2611]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:25:51 vs2 corosync[2611]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:25:51 vs2 corosync[2611]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:25:51 vs2 corosync[2611]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:25:51 vs2 corosync[2611]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:40:28 vs2 pve-ha-crm[3120]: watchdog active
lug 12 09:41:02 vs2 pve-ha-lrm[4143]: watchdog active

Code:
root@vs3:~# journalctl |grep watchdog
lug 12 09:09:47 vs3 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:09:49 vs3 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:09:49 vs3 watchdog-mux[1241]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:10:07 vs3 corosync[2487]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:10:07 vs3 corosync[2487]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:10:07 vs3 corosync[2487]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:10:07 vs3 corosync[2487]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:10:07 vs3 corosync[2487]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:10:07 vs3 corosync[2487]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:41:51 vs3 pve-ha-lrm[4156]: watchdog active
 
Hi,

I have the same problem. The problem is happened after the last update :

Code:
proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5..
...

I always have vm starting on a single node (node01) by the 3 nodes (node02 and node03). The migration in or out of this node does not work in HA mode.

If I migrate a vm (HA active) to this node01 (from node03 to node01), I have the error :

Code:
ha-manager migrate vm:102 node01

node02 pve-ha-crm[2561]: crm command error - node not online: migrate vm:102 node01

Status OK :

Code:
Quorum information
------------------
Date:             Fri Nov 16 22:09:21 2018
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.8.97
0x00000002          1 192.168.8.98 (local)
0x00000003          1 192.168.8.99

But to node02 from node03

Code:
ha-manager migrate vm:102 node02

=> All OK

Without HA mode the migration works to all nodes in and out.

Code:
ha-manager migrate vm:102 node01

=> All OK

Do you have a solution ?

Sincerely
 
Last edited:
What does it say in the logs (please more then the one line ;))? Can you please post the HA configuration?
 
Hi,

Yes, I've Commercial Support Subscription. OK for the documentation. The fact that the problem appeared during the last update, will I find the answer in this documentation ?

Sincerely
 
Yes, I've Commercial Support Subscription.
I don't understand, I know that you have a subscription, the banner under your member name (on the right) shows this.

OK for the documentation.
What do you mean by that?

Hm, do you maybe refer to my signature under my posts? The signature is added to all my posts with the same text.

What does it say in the logs? Can you please post the HA configuration? Is the 'pve-ha-crm.service' running on the node?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!