Node restart

shocker

Renowned Member
Jun 21, 2016
26
0
66
40
Hello,
I have a question regarding the node restart behaviour. I currently don't know if this is a bug of a feature :)

I have a cluster with 3 nodes upgraded to the latest version (pve-manager/4.4-5/c43015a5 (running kernel: 4.4.19-1-pve). Every time i'm rebooting one node, before it gets synchronised all the other nodes are going to reboot one by one until everything is in sync. Is this normal? I'm asking this because this is generating service loss of the VM's until it's getting synchronised for 5-10 min.

Thanks,
Alex
 
Thank you for the feedback! Usually when this is happening the node was hard restarted due to power loss or other things.
Any idea why I'm encountering this behaviour ?
Thanks,
Alex
 
HA on, right? i had the same problems with proxmox 4.1 and watchdog softdog :S we have HP proliant servers (hpwdt module is bug of course.)
 
HA on, right? i had the same problems with proxmox 4.1 and watchdog softdog :S we have HP proliant servers (hpwdt module is bug of course.)

Thank you for the feedback. Yes, I'm heaving the HA on and I'm using supermicro servers.
How did you solved your issue?

This is what I have received via email when the node crashed:
1st mail:
The node 'marge' failed and needs manual intervention.
The PVE HA manager tries to fence it and recover the
configured HA resources to a healthy node if possible.
Current fence status: FENCE
Try to fence node 'marge'

2nd mail:
The node 'marge' failed and needs manual intervention.
The PVE HA manager tries to fence it and recover the
configured HA resources to a healthy node if possible.
Current fence status: SUCCEED
fencing: acknowledged - got agent lock for node 'marge'

But if i'm powering up again the node "marge" then I'm encountering this issue with nodes restart until they are in sync.

Thanks,
Alex
 
Switched from the soft watchdog to ipmi_watchdog and I'm facing the same issue. If I'm restarting a node then all of them are getting restart one after another.
This is the last part of the syslog before restart:


Jan 29 17:31:40 homer watchdog-mux[3996]: client watchdog expired - disable watchdog updates
Jan 29 17:31:53 homer systemd[1]: Starting Synchronise Hardware Clock to System Clock...
Jan 29 17:31:53 homer systemd[1]: Stopping 102.scope.
Jan 29 17:31:53 homer systemd[1]: Stopped 102.scope.
Jan 29 17:31:53 homer systemd[1]: Stopping 104.scope.
Jan 29 17:31:53 homer systemd[1]: Stopped 104.scope.
Jan 29 17:31:53 homer systemd[1]: Stopping 106.scope.
Jan 29 17:31:53 homer systemd[1]: Stopped 106.scope.
Jan 29 17:31:53 homer systemd[1]: Stopping 101.scope.
Jan 29 17:31:53 homer systemd[1]: Stopped 101.scope.
Jan 29 17:31:53 homer systemd[1]: Stopping 110.scope.
Jan 29 17:31:53 homer systemd[1]: Stopped 110.scope.
Jan 29 17:31:53 homer systemd[1]: Stopping qemu.slice.
Jan 29 17:31:53 homer systemd[1]: Removed slice qemu.slice.
Jan 29 17:31:53 homer systemd[1]: Stopping Mail Transport Agent.
Jan 29 17:31:53 homer systemd[1]: Stopped target Mail Transport Agent.
Jan 29 17:31:53 homer systemd[1]: Stopping Graphical Interface.
Jan 29 17:31:53 homer systemd[1]: Stopped target Graphical Interface.
Jan 29 17:31:53 homer systemd[1]: Stopping Multi-User System.
Jan 29 17:31:53 homer systemd[1]: Stopped target Multi-User System.
Jan 29 17:31:53 homer systemd[1]: Stopping Kernel Samepage Merging (KSM) Tuning Daemon...
Jan 29 17:31:53 homer systemd[1]: Stopping Deferred execution scheduler...
Jan 29 17:31:53 homer systemd[1]: Stopping ZFS startup target.
Jan 29 17:31:53 homer systemd[1]: Stopped target ZFS startup target.
Jan 29 17:31:53 homer systemd[1]: Stopping ZFS Event Daemon (zed)...
Jan 29 17:31:53 homer systemd[1]: Stopping ZFS file system shares...
Jan 29 17:31:53 homer systemd[1]: Stopped ZFS file system shares.
Jan 29 17:31:53 homer systemd[1]: Stopping Regular background program processing daemon...
Jan 29 17:31:53 homer systemd[1]: Stopping PVE VM Manager...
Jan 29 17:31:53 homer systemd[1]: Stopping Self Monitoring and Reporting Technology (SMART) Daemon...
Jan 29 17:31:53 homer systemd[1]: Stopping Login Prompts.
Jan 29 17:31:53 homer systemd[1]: Stopped target Login Prompts.
Jan 29 17:31:53 homer systemd[1]: Stopping Getty on tty1...
Jan 29 17:31:53 homer systemd[1]: Stopping Login Service...
Jan 29 17:31:53 homer systemd[1]: Stopping D-Bus System Message Bus...
Jan 29 17:31:53 homer systemd[1]: Stopping LSB: Start and stop bmc-watchdog...
Jan 29 17:31:53 homer systemd[1]: Stopping LSB: Postfix Mail Transport Agent...
Jan 29 17:31:53 homer systemd[1]: Stopped Deferred execution scheduler.
Jan 29 17:31:53 homer systemd[1]: Stopped ZFS Event Daemon (zed).
Jan 29 17:31:53 homer systemd[1]: Stopped Self Monitoring and Reporting Technology (SMART) Daemon.
Jan 29 17:31:53 homer systemd[1]: Stopped D-Bus System Message Bus.
Jan 29 17:31:53 homer systemd[1]: Stopped Kernel Samepage Merging (KSM) Tuning Daemon.
Jan 29 17:31:53 homer smartd[3999]: smartd received signal 15: Terminated
Jan 29 17:31:53 homer smartd[3999]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.KINGSTON_SHFS37A240G-50026B776402D6AC.ata.state
Jan 29 17:31:53 homer smartd[3999]: Device: /dev/sdb [SAT], state written to /var/lib/smartmontools/smartd.KINGSTON_SHFS37A240G-50026B776402D759.ata.state
Jan 29 17:31:53 homer smartd[3999]: smartd is exiting (exit status 0)
Jan 29 17:31:53 homer zed[3981]: Exiting
Jan 29 17:31:53 homer bmc-watchdog[6816]: bmc-watchdog disabled, please adjust the configuration to your needs and then set RUN to 'yes' in /etc/default/bmc-watchdog to enable it. ... failed!
Jan 29 17:31:53 homer postfix/master[4186]: terminating on signal 15
Jan 29 17:31:53 homer systemd[1]: Stopped Getty on tty1.
Jan 29 17:31:53 homer systemd[1]: Stopped Regular background program processing daemon.
Jan 29 17:31:53 homer systemd[1]: Stopped LSB: Start and stop bmc-watchdog.
Jan 29 17:31:53 homer postfix[6819]: Stopping Postfix Mail Transport Agent: postfix.
Jan 29 17:31:53 homer systemd[1]: Stopped LSB: Postfix Mail Transport Agent.
Jan 29 17:31:53 homer systemd[1]: Stopping system-getty.slice.
Jan 29 17:31:53 homer systemd[1]: Removed slice system-getty.slice.
Jan 29 17:31:53 homer systemd[1]: Stopping /etc/rc.local Compatibility...
Jan 29 17:31:53 homer systemd[1]: Stopped /etc/rc.local Compatibility.
Jan 29 17:31:53 homer systemd[1]: Stopping Permit User Sessions...
Jan 29 17:31:53 homer systemd[1]: Stopped Permit User Sessions.
Jan 29 17:31:53 homer systemd[1]: Stopped Login Service.
Jan 29 17:31:54 homer hwclock[6808]: hwclock from util-linux 2.25.2
Jan 29 17:31:54 homer hwclock[6808]: Using the /dev interface to the clock.
Jan 29 17:31:54 homer hwclock[6808]: Last drift adjustment done at 1479970119 seconds after 1969
Jan 29 17:31:54 homer hwclock[6808]: Last calibration done at 1479970119 seconds after 1969
Jan 29 17:31:54 homer hwclock[6808]: Hardware clock is on UTC time
Jan 29 17:31:54 homer hwclock[6808]: Assuming hardware clock is kept in UTC time.
Jan 29 17:31:54 homer hwclock[6808]: Waiting for clock tick...
Jan 29 17:31:54 homer hwclock[6808]: ...got clock tick
Jan 29 17:31:54 homer hwclock[6808]: Time read from Hardware Clock: 2017/01/29 15:31:54
Jan 29 17:31:54 homer hwclock[6808]: Hw clock time : 2017/01/29 15:31:54 = 1485703914 seconds since 1969
Jan 29 17:31:54 homer hwclock[6808]: missed it - 1485703913.792965 is too far past 1485703913.500000 (0.292965 > 0.001000)
Jan 29 17:31:54 homer hwclock[6808]: 1485703914.500000 is close enough to 1485703914.500000 (0.000000 < 0.002000)
Jan 29 17:31:54 homer hwclock[6808]: Set RTC to 1485703914 (1485703913 + 1; refsystime = 1485703913.000000)
Jan 29 17:31:54 homer hwclock[6808]: Setting Hardware Clock to 15:31:54 = 1485703914 seconds since 1969
Jan 29 17:31:54 homer hwclock[6808]: ioctl(RTC_SET_TIME) was successful.
Jan 29 17:31:54 homer hwclock[6808]: Clock drifted -0.2 seconds in the past 5733794 seconds in spite of a drift factor of 0.000094 seconds/day.
Jan 29 17:31:54 homer hwclock[6808]: Adjusting drift factor by -0.003216 seconds/day
Jan 29 17:31:54 homer systemd[1]: Started Synchronise Hardware Clock to System Clock.
Jan 29 17:31:54 homer pve-manager[6861]: <root@pam> starting task UPID:homer:00001AD3:000105F4:588E0AEA:stopall::root@pam:
Jan 29 17:31:54 homer pve-manager[6868]: shutdown VM 110: UPID:homer:00001AD4:000105F7:588E0AEA:qmshutdown:110:root@pam:
Jan 29 17:31:54 homer pve-manager[6867]: <root@pam> starting task UPID:homer:00001AD4:000105F7:588E0AEA:qmshutdown:110:root@pam:
 
Hello,
I have a question regarding the node restart behaviour. I currently don't know if this is a bug of a feature :)

I have a cluster with 3 nodes upgraded to the latest version (pve-manager/4.4-5/c43015a5 (running kernel: 4.4.19-1-pve).
Hi Alex,
this isn't the latest version!
You should control your apt source-list. I assume you have defined the enterprise-repro only whitout having an subscription key.
If you don't have an subscrition, you should use pve-no-subscription - look here: https://pve.proxmox.com/wiki/Package_Repositories

The actual kernel is pve-kernel-4.4.35-2-pve_4.4.35-78

Udo
 
Hello,
Actually it was the latest one but I've prevented to reboot the nodes to start with the latest kernel due to my issue.

Now after I've started the 3rd node all of them restarted and the pveversion shows: pve-manager/4.4-5/c43015a5 (running kernel: 4.4.35-2-pve)

Thanks,
Alex
 
Thank you for the feedback. Yes, I'm heaving the HA on and I'm using supermicro servers.
How did you solved your issue?

This is what I have received via email when the node crashed:
1st mail:
The node 'marge' failed and needs manual intervention.
The PVE HA manager tries to fence it and recover the
configured HA resources to a healthy node if possible.
Current fence status: FENCE
Try to fence node 'marge'

2nd mail:
The node 'marge' failed and needs manual intervention.
The PVE HA manager tries to fence it and recover the
configured HA resources to a healthy node if possible.
Current fence status: SUCCEED
fencing: acknowledged - got agent lock for node 'marge'

But if i'm powering up again the node "marge" then I'm encountering this issue with nodes restart until they are in sync.

Thanks,
Alex

when i need to reboot a node, you need turn off HA and then i do.

Code:
rm -r /etc/pve/ha


i had to stop using HA, because we had randoms reboots of nodes by softdog. i didn´t test theses random reboots on proxmox4.4 but on proxmox 4.1 we had theses problems :S!
 
Well, sometimes a node just restart by itself due to power loss or other hw issues and I cannot control the HA.
So I need to turn off HA permanently in order to avoid this ?
 
Well, sometimes a node just restart by itself due to power loss or other hw issues and I cannot control the HA.
So I need to turn off HA permanently in order to avoid this ?
i only had bad experiences with HA like you, i think that it´s the best option. the other way could be if we could disable watchdog timer or stop the trigger., but i talked with the staff months ago and it´s not possible . so i had to stop using HA. i know that the people is working good with others hardware watchdog (Dell or Intel ) . i tried with HP Proliand and hpwdt module and it´s very buggy (i took kernel panics all time), and i had bad experiences with softdog.
so maybe you can try with iTCO Watchdog (module "iTCO_wdt"). what hardware do you have for proxmox?

please tell me if you test with ITCO Watchdog.


hope that this help you! and sorry for my bad english!
 
Thank you for this feedback!
I'm using supermicro hw with IPMI_watchdog module. I don't have any kernel panics just the nodes restart behaviour when I have the nodes back.
I'll check the ITCO_wdt to see what I can do with it if not I'll go with the HA removal.

Thanks,
Alex
 
I've checked the forum for similar issues, and seems that the only issues found with the watchdog were the ones when a node is restarting by itself, not the one that I currently have.
Currently my nodes are stable, the only behavior that i'm encountering is after one node is coming up from a reboot then all of them are going in reboot to synchronize.

Seems that everything is ok on the watchdog part except some possible errors on the syslog, is this normal?

~# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Power Cycle (0x03)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 10 sec
Present Countdown: 9 sec

~# dmesg | grep -i watchdog
[ 0.000000] Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.4.35-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet nmi_watchdog=0
[ 0.000000] Kernel command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-4.4.35-2-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet nmi_watchdog=0
[ 10.800217] IPMI Watchdog: driver initialized


~# cat /var/log/syslog* |grep Watch
Jan 29 16:12:31 homer kernel: [343157.496375] IPMI Watchdog: Unable to register misc device
Jan 29 16:12:31 homer kernel: [343157.496820] IPMI Watchdog: set timeout error: -22
Jan 29 16:12:31 homer kernel: [343157.496821] IPMI Watchdog: driver initialized
Jan 29 16:25:06 homer watchdog-mux[4003]: Watchdog driver 'IPMI', version 1
Jan 29 16:25:06 homer kernel: [ 10.954118] IPMI Watchdog: driver initialized
Jan 29 17:09:14 homer kernel: [ 2659.363383] IPMI Watchdog: Unexpected close, not stopping watchdog!
Jan 29 17:11:55 homer watchdog-mux[3993]: Watchdog driver 'IPMI', version 1
Jan 29 17:11:55 homer kernel: [ 10.644730] IPMI Watchdog: driver initialized
Jan 29 17:18:13 homer kernel: [ 389.658549] IPMI Watchdog: Unexpected close, not stopping watchdog!
Jan 29 17:20:54 homer watchdog-mux[3996]: Watchdog driver 'IPMI', version 1
Jan 29 17:20:54 homer kernel: [ 10.865642] IPMI Watchdog: driver initialized
Jan 29 17:34:18 homer watchdog-mux[3999]: Watchdog driver 'IPMI', version 1
Jan 29 17:34:18 homer kernel: [ 10.800217] IPMI Watchdog: driver initialized

~# cat /etc/default/pve-ha-manager
# select watchdog module (default is softdog)
WATCHDOG_MODULE=ipmi_watchdog

~# cat /etc/default/grub|grep LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT="quiet nmi_watchdog=0"

~# cat /etc/modprobe.d/ipmi_watchdog.conf
options ipmi_watchdog action=power_cycle panic_wdt_timeout=10

~# lsmod|grep -i watchdog
ipmi_watchdog 28672 1
ipmi_msghandler 49152 5 ipmi_ssif,ipmi_devintf,ipmi_poweroff,ipmi_watchdog,ipmi_si


Thanks,
Alex