HA Proxmox 5 3node + 1 NFS server not work

borjamf

New Member
Dec 14, 2017
5
0
1
79
  1. Hello

    We're planning to migrate part of our production environment to Proxmox if it posible. I have set up a testing env this week with 3 Proxmox nodes: 2 physical and 1 virtual and 1 basic NFS Debian Server, but I could not make HA work in any way.

    When I plug off the network cable or power some node off just show the node unreachable, but the VM allocated inside does not automove to another node avalaible

    . I have encountered some weird behaviour with HA-manager:

    Current conf:

    --- Cluster status

    ClusterOK.jpg

    -- HA-Manager weird issue (old timestamp - dead?)

    HA-manager Issue.jpg

    I set up watchdog module file etc/modprobe.d/ipmi_watchdog.conf with
    options ipmi_watchdog action=power_cycle panic_wdt_timeout=10


    After I disconnect the network cable, syslog shows this.

    Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: successfully acquired lock 'ha_manager_lock'
    Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: ERROR: unable to open watchdog socket - No such file or directory
    Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: server received shutdown request
    Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: server stopped
    Dec 14 16:17:38 pmtest1 systemd[1]: pve-ha-crm.service: Main process exited, code=exited, status=255/n/a
    Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: successfully acquired lock 'ha_agent_pmtest1_lock'
    Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: ERROR: unable to open watchdog socket - No such file or directory
    Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: restart LRM, freeze all services
    Dec 14 16:17:39 pmtest1 pve-ha-lrm[1083]: server stopped
    Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Main process exited, code=exited, status=255/n/a
    Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-crm.service: Unit entered failed state.
    Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-crm.service: Failed with result 'exit-code'.
    Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Unit entered failed state.
    Dec 14 16:17:39 pmtest1 systemd[1]: pve-ha-lrm.service: Failed with result 'exit-code'


Could anybody help me? I dont know what else to do.
Thanks.
 
Hi,

As your logs tell you
Dec 14 16:17:38 pmtest1 pve-ha-crm[1073]: ERROR: unable to open watchdog socket - No such file or directory

I think your watchdog is not proper setup.
Send the output of this two commnads.
Code:
cat /etc/default/pve-ha-manager
journalctl -u watchdog-mux.service
 
Thanks for the reply:

the virtual node is running on Oracle Virtual Box, but it does not contain anything. I have configured it with Proxmox just for test HA with 3 nodes min.

output of /etc/default/pve-ha-manager:
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog


upload_2017-12-15_9-15-10.png

So there is something wrong with the watchdog
 
Hello,
looks like watchdog software aka softdog runs ok. The problem was that I did not have any kind of hardware watchdog, so I'll update when I test it in another machine.