Sorry, I missed it was for pve4;-)@mir: we do not use rgmanager anymore. But we have a similar feature (ha/groups).
But I want to ask whether the VMs found should return to its original node once the reboot terminates or not? Because here it does not comes back to its original node.
Try to define a group with one node, add the VM to that group. Then the VM should move to that node as soon as the node is online.
I think it must be an issue with the motherboard (A Fujitsu Siemens PC)
I have 2 more questions:
1. I wanted to know when migrating evenly the VMs on a node that has problem to other nodes, does PVE cluster consider the available resources (CPU, RAM etc) on the Nodes before migrating the VMs? How does the cluster handle this?
2. When the migration occur between nodes, is there a downtime in the VMs?
Have similar issues currently after a network failure on one of our 7 nodes today, which caused brief network issues on various nodes, which they recovered from though. This also caused brief issues with our iSCSI SAN planJust before getting broken pipe of watchdog, I got this on the syslog:
Jul 24 14:07:56 node2 watchdog-mux[898]: client watchdog expired - disable watchdog updates
Jul 24 14:12:07 node2 watchdog-mux[898]: exit watchdog-mux with active connections
Jul 24 14:12:07 node2 kernel: [ 1441.792768] watchdog watchdog0: watchdog did not stop!
Jul 24 14:12:17 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe
Jul 24 14:12:27 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe
Jul 24 14:12:37 node2 pve-ha-lrm[1104]: watchdog update failed - Broken pipe
The only errors I found for watchdog
Thanks for you help
Shafeek
root@n1:~# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 10.45.71.1
status = ring 0 active with no faults
RING ID 1
id = 193.1xx.xxx.xxx
status = ring 1 active with no faults
root@n1:~# corosync-quorumtool
Quorum information
------------------
Date: Tue Dec 8 23:23:40 2015
Quorum provider: corosync_votequorum
Nodes: 6
Node ID: 1
Ring ID: 4464
Quorate: Yes
Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 6
Quorum: 4
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
1 1 n1 (local)
2 1 n2
3 1 n3
4 1 n4
6 1 n6
7 1 n7
PS! Current n5 has been taken down
root@n1:~# pveversion
pve-manager/4.0-57/cc7c2b53 (running kernel: 4.2.3-2-pve)
root@n1:~# systemctl status hp-asrd.service â hp-asrd.service - LSB: HP Advanced Server Recovery Daemon
Loaded: loaded (/etc/init.d/hp-asrd)
Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
Process: 4004 ExecStart=/etc/init.d/hp-asrd start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/hp-asrd.service
ââ4014 /opt/hp/hp-health/bin/hp-asrd -p 1
ââ4016 /opt/hp/hp-health/bin/hp-asrd -p 1
Dec 08 14:47:23 n1 hp-asrd[4004]: Starting HP Advanced Server Recovery Daemon.
Dec 08 14:47:52 n1 hpasrd[4016]: Starting with poll 1 and timeout 600
Dec 08 14:47:52 n1 hpasrd[4016]: Setting the watchdog timer.
Dec 08 14:47:52 n1 hpasrd[4016]: Found iLO memory at 0x92a8d000.
Dec 08 14:47:52 n1 hpasrd[4016]: Successfully mapped device.
root@n1:~# systemctl status watchdog-mux.service
â watchdog-mux.service - Proxmox VE watchdog multiplexer
Loaded: loaded (/lib/systemd/system/watchdog-mux.service; static)
Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
Main PID: 3831 (watchdog-mux)
CGroup: /system.slice/watchdog-mux.service
ââ3831 /usr/sbin/watchdog-mux
Dec 08 14:47:23 n1 watchdog-mux[3831]: Watchdog driver 'HP iLO2+ HW Watchdog Timer', version 0
root@n1:~# systemctl status watchdog-mux.socket
â watchdog-mux.socket - Proxmox VE watchdog multiplexer socket
Loaded: loaded (/lib/systemd/system/watchdog-mux.socket; enabled)
Active: active (running) since Tue 2015-12-08 14:47:23 CET; 8h ago
Listen: /run/watchdog-mux.sock (Stream)
root@n1:~# ha-manager status
quorum OK
master n2 (old timestamp - dead?, Tue Dec 8 22:24:00 2015)
lrm n1 (old timestamp - dead?, Tue Dec 8 22:20:04 2015)
lrm n2 (old timestamp - dead?, Tue Dec 8 22:24:05 2015)
lrm n3 (old timestamp - dead?, Tue Dec 8 22:24:05 2015)
lrm n4 (old timestamp - dead?, Tue Dec 8 22:24:00 2015)
lrm n5 (old timestamp - dead?, Tue Dec 8 22:24:00 2015)
lrm n6 (old timestamp - dead?, Tue Dec 8 22:23:56 2015)
lrm n7 (old timestamp - dead?, Tue Dec 8 22:24:00 2015)
service vm:203 (n7, started)
service vm:304 (n6, started)
service vm:307 (n5, started)
service vm:310 (n4, started)
service vm:328 (n7, started)
Ah, yes - seems watchdog-mux is still running. Try:
# systemctl stop watchdog-mux.service
# echo 1 >/dev/watchdog
# nano /etc/systemd/system/StopWatchdog.service
[Unit]
Description=Stop WatchDog
Wants=shutdown.target reboot.target poweroff.target halt.target
Before=shutdown.target reboot.target poweroff.target halt.target
[Service]
Type=oneshot
ExecStart=/root/stop_watchdog.sh
[Install]
WantedBy=reboot.target poweroff.target halt.target
# nano /root/stop_watchdog.sh
#!/bin/bash
systemctl stop watchdog-mux.service
# systemctl daemon-reload
# systemctl enable StopWatchdog.service