PVE 4 with HA

shafeeks · Aug 6, 2015

Hi Dietmar,

I haven't set any kernel option as far as I know.

How can I check if there is any kernel option activated that disable reboot or how to enable this feature?

Thanks

Shafeek

mir · Aug 6, 2015

dietmar said:
@mir: we do not use rgmanager anymore. But we have a similar feature (ha/groups).

Sorry, I missed it was for pve4;-)

shafeeks · Aug 7, 2015

Hi Dietmar / Mir,

I just test it this morning on node3. The trick with echo 1 > /dev/watchdog works. It reboots the node once I connect the network cord.

But I want to ask whether the VMs found should return to its original node once the reboot terminates or not? Because here it does not comes back to its original node.

As far as the HA configuration is concerned on the datacenter webui are as follows:
1. All nodes are selected [Node1,2&3]
2. Restricted - NO [Not selected]
3. nofailback - NO [Not selected]

Is that ok?

Thanks for the support.

Shafeek

dietmar · Aug 7, 2015

shafeeks said:
But I want to ask whether the VMs found should return to its original node once the reboot terminates or not? Because here it does not comes back to its original node.

Try to define a group with one node, add the VM to that group. Then the VM should move to that node as soon as the node is online.

shafeeks · Aug 7, 2015

Hi Dietmar,

dietmar said:
Try to define a group with one node, add the VM to that group. Then the VM should move to that node as soon as the node is online.

I created their respective groups for each node. I test it and I just want to say that it works like charm.

I think the problem of watchdog is only with 1 node in the cluster. I think it must be an issue with the motherboard (A Fujitsu Siemens PC)

I have 2 more questions:
1. I wanted to know when migrating evenly the VMs on a node that has problem to other nodes, does PVE cluster consider the available resources (CPU, RAM etc) on the Nodes before migrating the VMs? How does the cluster handle this?
2. When the migration occur between nodes, is there a downtime in the VMs?

Thanks for the support

Shafeek

dietmar · Aug 7, 2015

shafeeks said:
I think it must be an issue with the motherboard (A Fujitsu Siemens PC)

Maybe you can do a clean re-install and test again?

shafeeks said:
I have 2 more questions:
1. I wanted to know when migrating evenly the VMs on a node that has problem to other nodes, does PVE cluster consider the available resources (CPU, RAM etc) on the Nodes before migrating the VMs? How does the cluster handle this?

We count the number of VMs, So that each node have about the same VM count.

shafeeks said:
2. When the migration occur between nodes, is there a downtime in the VMs?

Sure, because we only migrate after existing node is 'fenced'. So average downtime is about 120 seconds with watchdog fencing.

stefws · Dec 8, 2015

shafeeks said:
Just before getting broken pipe of watchdog, I got this on the syslog:

Jul 24 14:07:56 node2 watchdog-mux[898]: client watchdog expired - disable watchdog updates
Jul 24 14:12:07 node2 watchdog-mux[898]: exit watchdog-mux with active connections
Jul 24 14:12:07 node2 kernel: [ 1441.792768] watchdog watchdog0: watchdog did not stop!
Jul 24 14:12:17 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe
Jul 24 14:12:27 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe
Jul 24 14:12:37 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe

The only errors I found for watchdog

Thanks for you help

Shafeek

Have similar issues currently after a network failure on one of our 7 nodes today, which caused brief network issues on various nodes, which they recovered from though. This also caused brief issues with our iSCSI SAN plan

We've got two corosync rings each by their bonded NIC across dual-hw switches (ring0 -> 2x10Gbs vm/iscsi vlans, ring1 -> 2x1Gbs management):

Code:

root@n1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
        id      = 10.45.71.1
        status  = ring 0 active with no faults
RING ID 1
        id      = 193.1xx.xxx.xxx
        status  = ring 1 active with no faults
root@n1:~# corosync-quorumtool 
Quorum information
------------------
Date:             Tue Dec  8 23:23:40 2015
Quorum provider:  corosync_votequorum
Nodes:            6
Node ID:          1
Ring ID:          4464
Quorate:          Yes


Votequorum information
----------------------
Expected votes:   7
Highest expected: 7
Total votes:      6
Quorum:           4  
Flags:            Quorate 


Membership information
----------------------
    Nodeid      Votes Name
         1          1 n1 (local)
         2          1 n2
         3          1 n3
         4          1 n4
         6          1 n6
         7          1 n7

PS! Current n5 has been taken down

We ended up with all nodes complaining about the watchdog-mix which stopped. So after trying to recover the watchdogs and reading a bit more on HA, we wonder if it might be an issue that we run HP-ASR on our DL360 gen9 boxes as well.

Code:

root@n1:~# pveversion 
pve-manager/4.0-57/cc7c2b53 (running kernel: 4.2.3-2-pve)

root@n1:~# systemctl status hp-asrd.service â hp-asrd.service - LSB: HP Advanced Server Recovery Daemon
   Loaded: loaded (/etc/init.d/hp-asrd)
   Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
  Process: 4004 ExecStart=/etc/init.d/hp-asrd start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/hp-asrd.service
           ââ4014 /opt/hp/hp-health/bin/hp-asrd -p 1
           ââ4016 /opt/hp/hp-health/bin/hp-asrd -p 1

Dec 08 14:47:23 n1 hp-asrd[4004]: Starting HP Advanced Server Recovery Daemon.
Dec 08 14:47:52 n1 hpasrd[4016]: Starting with poll 1 and timeout 600
Dec 08 14:47:52 n1 hpasrd[4016]: Setting the watchdog timer.
Dec 08 14:47:52 n1 hpasrd[4016]: Found iLO memory at 0x92a8d000.
Dec 08 14:47:52 n1 hpasrd[4016]: Successfully mapped device.

root@n1:~# systemctl status watchdog-mux.service 
â watchdog-mux.service - Proxmox VE watchdog multiplexer
   Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
   Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
 Main PID: 3831 (watchdog-mux)
   CGroup: /system.slice/watchdog-mux.service
           ââ3831 /usr/sbin/watchdog-mux

Dec 08 14:47:23 n1 watchdog-mux[3831]: Watchdog driver 'HP iLO2+ HW Watchdog Timer', version 0

root@n1:~# systemctl status watchdog-mux.socket 
â watchdog-mux.socket - Proxmox VE watchdog multiplexer socket
   Loaded: loaded (/lib/systemd/system/watchdog-mux.socket; enabled)
   Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
   Listen: /run/watchdog-mux.sock (Stream)

root@n1:~# ha-manager status
quorum OK
master n2 (old timestamp - dead?, Tue Dec  8 22:24:00 2015)
lrm n1 (old timestamp - dead?, Tue Dec  8 22:20:04 2015)
lrm n2 (old timestamp - dead?, Tue Dec  8 22:24:05 2015)
lrm n3 (old timestamp - dead?, Tue Dec  8 22:24:05 2015)
lrm n4 (old timestamp - dead?, Tue Dec  8 22:24:00 2015)
lrm n5 (old timestamp - dead?, Tue Dec  8 22:24:00 2015)
lrm n6 (old timestamp - dead?, Tue Dec  8 22:23:56 2015)
lrm n7 (old timestamp - dead?, Tue Dec  8 22:24:00 2015)
service vm:203 (n7, started)
service vm:304 (n6, started)
service vm:307 (n5, started)
service vm:310 (n4, started)
service vm:328 (n7, started)

On n5 we tried to disable hp-asr.service, but then a watchdog (mux?) fences/shuts box off after the default HP ASR 600sec timeout, what's BCP on watchdog? from this post it seems we should blacklist hpwdt, which we'll try next...

stefws · Dec 9, 2015

With blacklisted hpwdt it seems to use the SW WD (might be a good idea to default blacklist this as suggested elsewhere here), will see how this works out for us...

cloudjumper · Dec 9, 2015

Sure, because we only migrate after existing node is 'fenced'. So average downtime is about 120 seconds with watchdog fencing.[/QUOTE]

Dear Support!

I just started proxmox ve enviroment in vmware workstation, and I have question too about the migrated node?
In the workstation I hade made 3 VE enviroment with 2 network adapters. ( one host-only and one bridge ), I also made 1 virtual tinycore linux as vm which is managed by HA.

Software fencing working great.
1 min to fence the VE node. ( I managed the problem with deactivate the network adapters )
3 min to migrating the node. ( great )

I do not understand the type of the migration. Is could occure with rebooting the node, or it is a live migration with reserving the memory of the VM in the qourum, and it is a live migration?

Another questions is about the HA.
The described workflow for HA cluster in proxmox 4 is to managet 2 type of network. One for network and one for HA traffic. In the installation gui is preferred to manage the Ha interface or the production one?

Best Regards:
Imre Szollosi

heldigard · Oct 16, 2016

dietmar said:
Ah, yes - seems watchdog-mux is still running. Try:

# systemctl stop watchdog-mux.service
# echo 1 >/dev/watchdog

Thank you, my scenario is: Debian 8.6 with proxmox ve 4.3, and I solved thanks to your response with a custom service unit:

Code:

# nano /etc/systemd/system/StopWatchdog.service

Code:

[Unit]
Description=Stop WatchDog
Wants=shutdown.target reboot.target poweroff.target halt.target
Before=shutdown.target reboot.target poweroff.target halt.target

[Service]
Type=oneshot
ExecStart=/root/stop_watchdog.sh

[Install]
WantedBy=reboot.target poweroff.target halt.target

Code:

# nano /root/stop_watchdog.sh

The script content:

Code:

#!/bin/bash
systemctl stop watchdog-mux.service

Then install the service:

Code:

# systemctl daemon-reload
# systemctl enable StopWatchdog.service

Hope this can help others in the same situation.

Search

Search

PVE 4 with HA

shafeeks

Renowned Member

mir

Famous Member

shafeeks

Renowned Member

dietmar

Proxmox Staff Member

shafeeks

Renowned Member

dietmar

Proxmox Staff Member

stefws

Renowned Member

stefws

Renowned Member

cloudjumper

Active Member

heldigard

Member

We value your privacy