HA vm always starting status

cheissol · Jun 11, 2018

Hi,
If I try to enable HA for a vm (128 or 123 in my case see attachment), from a node (vs1 is belonging to a 3 node cluster), I always have vm in status "starting". I read from https://pve.proxmox.com/wiki/High_Availability that this state is when:

Pending start request. But the CRM has not got any confirmation from the LRM that the service is running.

but I don't know where to fix this problem.
I get this problem only when the vm is running on vs1. If the vm is on the other nodes HA for the vm 128 become "started" in the right way.
Where could be the problem?
Thanks.

Alwin · Jun 12, 2018

Is the LRM running on node vs1 (systemctl status pve-lrm)?

cheissol · Jun 12, 2018

Yes it's running:

Code:

# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2018-06-09 10:09:47 CEST; 2 days ago
  Process: 4036 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 4129 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   Memory: 28.3M
      CPU: 44.491s
   CGroup: /system.slice/pve-ha-lrm.service
           └─4129 pve-ha-lrm

giu 09 10:09:46 vs1 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
giu 09 10:09:47 vs1 pve-ha-lrm[4129]: starting server
giu 09 10:09:47 vs1 pve-ha-lrm[4129]: status change startup => wait_for_agent_lock
giu 09 10:09:47 vs1 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: successfully acquired lock 'ha_agent_vs1_lock'
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: watchdog active
giu 09 10:17:44 vs1 pve-ha-lrm[4129]: status change wait_for_agent_lock => active

cheissol · Jun 13, 2018

cheissol · Jul 12, 2018

No one can help me, please?

Alwin · Jul 16, 2018

How does the vmid.conf and the storage.cfg look like?

cheissol · Jul 16, 2018

storage.cf is:

Code:

root@vs1:~# cat /etc/pve/storage.cfg
dir: local
   path /var/lib/vz
   content vztmpl,backup,iso

lvmthin: local-lvm
   thinpool data
   vgname pve
   content rootdir,images

rbd: machines
   content images
   krbd 0
   monhost 10.10.10.1 10.10.10.2 10.10.10.3
   pool machines
   username admin

dir: usb2
   path /media/usb2
   content backup
   maxfiles 1
   nodes vs1
   shared 0

dir: usb3
   path /media/usb3
   content backup
   maxfiles 1
   nodes vs2
   shared 0

dir: usb1
   path /media/usb1
   content backup
   maxfiles 1
   nodes vs1
   shared 0

dir: usb4
   path /media/usb4
   content backup
   maxfiles 1
   nodes vs2
   shared 0

dir: usb5
   path /media/usb5
   content backup
   maxfiles 1
   nodes vs3
   shared 0

and for example 128.conf is:

Code:

root@vs1:~# cat /etc/pve/nodes/vs1/qemu-server/128.conf
boot: cdn
bootdisk: virtio0
cores: 1
ide2: none,media=cdrom
memory: 1024
name: Server
net0: virtio=F2:DD:92:96:CF:AA,bridge=vmbr1
numa: 0
onboot: 1
ostype: l26
sockets: 1
virtio0: machines:vm-128-disk-1,size=6148M

Alwin · Jul 16, 2018

Can you start the VM on the vs1 without HA?

cheissol · Jul 16, 2018

Yes. Furthermore If virtual machines are not belonging to HA I can start stop and migrate them; if they belongs to HA I can't start stop or migrate them.

Alwin · Jul 16, 2018

cheissol said:
Yes. Furthermore If virtual machines are not belonging to HA I can start stop and migrate them; if they belongs to HA I can't start stop or migrate them.

On all machines or just vs1?

What is the 'pveversion -v' showing? Are all nodes on the latest packages?

cheissol · Jul 16, 2018

Alwin said:
On all machines or just vs1?

On all machines I can't migrate, start or stop if vms belong to HA. Only on vs1 I always get status "starting" if a vm belong to HA.

What is the 'pveversion -v' showing? Are all nodes on the latest packages?

Yes they are.

Code:

root@vs1:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

root@vs2:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

root@vs3:~# pveversion -v
proxmox-ve: 5.2-2 (running kernel: 4.15.17-3-pve)
pve-manager: 5.2-5 (running version: 5.2-5/eb24855a)
pve-kernel-4.15: 5.2-3
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-2-pve: 4.15.17-10
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.5-pve1
corosync: 2.4.2-pve5
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.0-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-35
libpve-guest-common-perl: 2.0-17
libpve-http-server-perl: 2.0-9
libpve-storage-perl: 5.0-23
libqb0: 1.0.1-1
lvm2: 2.02.168-pve6
lxc-pve: 3.0.0-3
lxcfs: 3.0.0-1
novnc-pve: 1.0.0-1
proxmox-widget-toolkit: 1.0-19
pve-cluster: 5.0-27
pve-container: 2.0-24
pve-docs: 5.2-4
pve-firewall: 3.0-12
pve-firmware: 2.0-4
pve-ha-manager: 2.0-5
pve-i18n: 1.0-6
pve-libspice-server1: 0.12.8-3
pve-qemu-kvm: 2.11.1-5
pve-xtermjs: 1.0-5
qemu-server: 5.0-29
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.9-pve1~bpo9

Alwin · Jul 17, 2018

What are syslog/journal telling? Did you try to reboot the vs1 to see if there is any change in behavior?

cheissol · Jul 17, 2018

Yes even if I reboot vs1, nothing changes. Which specific logs do you mean? Because in /var/log/messages or /var/log/syslog I can't see nothing about the problem..

Alwin · Jul 17, 2018

Check the watchdog on the machine, it might not be running. There should be messages in the logs (dmesg/syslog/journal).

cheissol · Jul 18, 2018

I see logs about wathdog only in journal. From these logs watchdog seem to be active in every node:

Code:

root@vs1:~# journalctl |grep watchdog
lug 12 09:35:36 vs1 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:35:38 vs1 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:35:38 vs1 watchdog-mux[1165]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:35:56 vs1 corosync[2432]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:35:56 vs1 corosync[2432]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:35:57 vs1 corosync[2432]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:35:57 vs1 corosync[2432]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:35:57 vs1 corosync[2432]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:35:57 vs1 corosync[2432]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:40:32 vs1 pve-ha-lrm[3896]: watchdog active

Code:

root@vs2:~# journalctl |grep watchdog
lug 12 09:25:30 vs2 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:25:32 vs2 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:25:32 vs2 watchdog-mux[1328]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:25:51 vs2 corosync[2611]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:25:51 vs2 corosync[2611]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:25:51 vs2 corosync[2611]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:25:51 vs2 corosync[2611]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:25:51 vs2 corosync[2611]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:25:51 vs2 corosync[2611]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:40:28 vs2 pve-ha-crm[3120]: watchdog active
lug 12 09:41:02 vs2 pve-ha-lrm[4143]: watchdog active

Code:

root@vs3:~# journalctl |grep watchdog
lug 12 09:09:47 vs3 kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
lug 12 09:09:49 vs3 systemd[1]: Started Proxmox VE watchdog multiplexer.
lug 12 09:09:49 vs3 watchdog-mux[1241]: Watchdog driver 'Software Watchdog', version 0
lug 12 09:10:07 vs3 corosync[2487]: info    [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:10:07 vs3 corosync[2487]:  [MAIN  ] Corosync built-in features: dbus rdma monitoring watchdog augeas systemd upstart xmlconf qdevices qnetd snmp pie relro bindnow
lug 12 09:10:07 vs3 corosync[2487]: warning [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:10:07 vs3 corosync[2487]: notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:10:07 vs3 corosync[2487]:  [WD    ] Watchdog /dev/watchdog exists but couldn't be opened.
lug 12 09:10:07 vs3 corosync[2487]:  [SERV  ] Service engine loaded: corosync watchdog service [7]
lug 12 09:41:51 vs3 pve-ha-lrm[4156]: watchdog active

jdelbecque · Nov 16, 2018

Hi,

I have the same problem. The problem is happened after the last update :

Code:

proxmox-ve: 5.2-2 (running kernel: 4.15.18-8-pve)
pve-manager: 5.2-10 (running version: 5.2-10/6f892b40)
pve-kernel-4.15: 5.2-11
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-1-pve: 4.15.17-9
ceph: 12.2.8-pve1
corosync: 2.4.2-pve5..
...

I always have vm starting on a single node (node01) by the 3 nodes (node02 and node03). The migration in or out of this node does not work in HA mode.

If I migrate a vm (HA active) to this node01 (from node03 to node01), I have the error :

Code:

ha-manager migrate vm:102 node01

node02 pve-ha-crm[2561]: crm command error - node not online: migrate vm:102 node01

Status OK :

Code:

Quorum information
------------------
Date:             Fri Nov 16 22:09:21 2018
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000002
Ring ID:          1/12
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.8.97
0x00000002          1 192.168.8.98 (local)
0x00000003          1 192.168.8.99

But to node02 from node03

Code:

ha-manager migrate vm:102 node02

=> All OK

Without HA mode the migration works to all nodes in and out.

Code:

ha-manager migrate vm:102 node01

=> All OK

Do you have a solution ?

Sincerely

jdelbecque · Nov 29, 2018

Alwin said:
Check the watchdog on the machine, it might not be running. There should be messages in the logs (dmesg/syslog/journal).

Hi Alwin,

Can you please look at this post ? I added an information because I didn't find the solution

Sincerely

Alwin · Nov 29, 2018

What does it say in the logs (please more then the one line

)? Can you please post the HA configuration?

jdelbecque · Nov 30, 2018

Hi,

Yes, I've Commercial Support Subscription. OK for the documentation. The fact that the problem appeared during the last update, will I find the answer in this documentation ?

Sincerely

Alwin · Nov 30, 2018

jdelbecque said:
Yes, I've Commercial Support Subscription.

I don't understand, I know that you have a subscription, the banner under your member name (on the right) shows this.

jdelbecque said:
OK for the documentation.

What do you mean by that?

Hm, do you maybe refer to my signature under my posts? The signature is added to all my posts with the same text.

What does it say in the logs? Can you please post the HA configuration? Is the 'pve-ha-crm.service' running on the node?

HA vm always starting status

Active Member

Attachments

Proxmox Retired Staff

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

New Member

New Member

Proxmox Retired Staff

New Member

Proxmox Retired Staff