HA Proxmox 5 3node + 1 NFS server not work

borjamf · Dec 14, 2017

Hello

We're planning to migrate part of our production environment to Proxmox if it posible. I have set up a testing env this week with 3 Proxmox nodes: 2 physical and 1 virtual and 1 basic NFS Debian Server, but I could not make HA work in any way.

When I plug off the network cable or power some node off just show the node unreachable, but the VM allocated inside does not automove to another node avalaible

. I have encountered some weird behaviour with HA-manager:

Current conf:

--- Cluster status

-- HA-Manager weird issue (old timestamp - dead?)

I set up watchdog module file etc/modprobe.d/ipmi_watchdog.conf with
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

After I disconnect the network cable, syslog shows this.

Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: successfully acquired lock 'ha_manager_lock'
Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: ERROR: unable to open watchdog socket - No such file or directory
Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: server received shutdown request
Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: server stopped
Dec 14 16:17:38 pmtest1 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/n/a
Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: successfully acquired lock 'ha_agent_pmtest1_lock'
Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: ERROR: unable to open watchdog socket - No such file or directory
Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: restart LRM, freeze all services
Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: server stopped
Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/n/a
Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-crm.service: Unit entered failed state.
Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'.
Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Unit entered failed state.
Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'

Could anybody help me? I dont know what else to do.
Thanks.

wolfgang · Dec 15, 2017

Hi,

As your logs tell you

borjamf said:
Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: ERROR: unable to open watchdog socket - No such file or directory

I think your watchdog is not proper setup.
Send the output of this two commnads.

Code:

cat /etc/default/pve-ha-manager
journalctl -u watchdog-mux.service

dietmar · Dec 15, 2017

And where does that virtual node run?

borjamf · Dec 15, 2017

Thanks for the reply:

the virtual node is running on Oracle Virtual Box, but it does not contain anything. I have configured it with Proxmox just for test HA with 3 nodes min.

output of /etc/default/pve-ha-manager:
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog

So there is something wrong with the watchdog

borjamf · Jan 15, 2018

Hello,
looks like watchdog software aka softdog runs ok. The problem was that I did not have any kind of hardware watchdog, so I'll update when I test it in another machine.

Search

Search

HA Proxmox 5 3node + 1 NFS server not work

borjamf

New Member

wolfgang

Proxmox Retired Staff

dietmar

Proxmox Staff Member

borjamf

New Member

borjamf

New Member

We value your privacy