Hi there,
We are having trouble with a 3 nodes cluster that is rebooting every node simultaneously every few days.
We are using OVH servers, so only unicast is available, OVH vracks being more expensive and we dont need more than 3 nodes.
We are using software RAID with SSD drives and our VMs are in HA with a distant NFS storage.
We also have checked the OVH servers status page and there were no maintenance at that moment.
This is the third time this happens and the only thing we could do is send an email on boot to check that everything is back to normal.
There is a reboot order shown with the following command, so something must have asked for it :
######
root@prdproxmox3:~# last reboot
reboot system boot 4.4.13-1-pve Thu Aug 4 15:21 - 18:43 (03:21)
#######
What could send that reboot order ? And how could it transmit to the other nodes ?
Our best guess is a problem with the cluster, a hardware problem is more unlikely.
If you need other informations, just ask.
Thanks.
Eliott G.
Following a sample of our syslog (feel free to ask for other logs) :
##### /var/log/syslog #####
Aug 4 15:17:01 prdproxmox3 CRON[30748]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:17:01 prdproxmox3 CRON[30750]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 4 15:18:01 prdproxmox3 CRON[31063]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:18:08 prdproxmox3 pvedaemon[7020]: worker exit
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 7020 finished
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: starting 1 worker(s)
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 31129 started
Aug 4 15:18:16 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3412) was formed. Members joined: 1 2 left: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [QUORUM] Members[3]: 3 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:01 prdproxmox3 CRON[31369]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:19:02 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:17 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:22 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3416) was formed. Members left: 1 2
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [dcdb] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] Members[1]: 3
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: node lost quorum
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change master => lost_manager_lock
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: watchdog closed (disabled)
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change lost_manager_lock => wait_for_quorum
Aug 4 15:19:35 prdproxmox3 pve-ha-lrm[1840]: status change active => lost_agent_lock
Aug 4 15:19:48 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:01 prdproxmox3 CRON[31672]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:19 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:21 prdproxmox3 watchdog-mux[1552]: client watchdog expired - disable watchdog updates
Aug 4 15:20:23 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:24 prdproxmox3 pvedaemon[20348]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:25 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
==============================================================================
==============================================================================
CRASH
==============================================================================
==============================================================================
Aug 4 15:21:59 prdproxmox3 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1545" x-info="http://www.rsyslog.com"] start
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'fuse' is builtin
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_devintf'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_poweroff'
Aug 4 15:21:59 prdproxmox3 hdparm[360]: Setting parameters of disc: (none).
Aug 4 15:21:59 prdproxmox3 mdadm-raid[362]: Generating udev events for MD arrays...done.
Aug 4 15:21:59 prdproxmox3 keyboard-setup[359]: Setting preliminary keymap...done.
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_si'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'ipmi_watchdog' is blacklisted
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'vhost_net'
Aug 4 15:21:59 prdproxmox3 lvm[643]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : récupération du journal
Aug 4 15:21:59 prdproxmox3 zpool[753]: no pools available to import
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : propre, 28/25206784 fichiers, 4514763/100796416 blocs
Aug 4 15:21:59 prdproxmox3 lvm[662]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 lvm[786]: 1 logical volume(s) in volume group "pve" monitored
Aug 4 15:21:59 prdproxmox3 mv[803]: /bin/mv: impossible d'évaluer « /etc/network/interfaces.new »: Aucun fichier ou dossier de ce type
Aug 4 15:21:59 prdproxmox3 pvepw-logger[820]: starting pvefw logger
Aug 4 15:21:59 prdproxmox3 networking[795]: Configuring network interfaces...interface dummy0 does not exist!
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr1 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 networking[795]: /bin/sh: /etc/pve/kvm-networking.sh: No such file or directory
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 4 15:21:59 prdproxmox3 networking[795]: Failed to bring up vmbr1.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Linux version 4.4.13-1-pve (root@nora) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Tue Jun 28 10:16:33 CEST 2016 ()
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr0 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.13-1-pve root=/dev/md1 ro rootdelay=15 quiet
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] KERNEL supported cpus:
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for DAD... Done
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Intel GenuineIntel
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] AMD AuthenticAMD
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Centaur CentaurHauls
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Aug 4 15:21:59 prdproxmox3 networking[795]: done.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
Aug 4 15:21:59 prdproxmox3 rpcbind[1296]: Starting rpcbind daemon....
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Using 'eager' FPU context switches.
Aug 4 15:21:59 prdproxmox3 iscsid: iSCSI logger with pid=1333 started!
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009abff] usable
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009ac00-0x000000000009ffff] reserved
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Aug 4 15:21:59 prdproxmox3 open-iscsi[1298]: Starting iSCSI initiator service: iscsid.
#########
We are having trouble with a 3 nodes cluster that is rebooting every node simultaneously every few days.
We are using OVH servers, so only unicast is available, OVH vracks being more expensive and we dont need more than 3 nodes.
We are using software RAID with SSD drives and our VMs are in HA with a distant NFS storage.
We also have checked the OVH servers status page and there were no maintenance at that moment.
This is the third time this happens and the only thing we could do is send an email on boot to check that everything is back to normal.
There is a reboot order shown with the following command, so something must have asked for it :
######
root@prdproxmox3:~# last reboot
reboot system boot 4.4.13-1-pve Thu Aug 4 15:21 - 18:43 (03:21)
#######
What could send that reboot order ? And how could it transmit to the other nodes ?
Our best guess is a problem with the cluster, a hardware problem is more unlikely.
If you need other informations, just ask.
Thanks.
Eliott G.
Following a sample of our syslog (feel free to ask for other logs) :
##### /var/log/syslog #####
Aug 4 15:17:01 prdproxmox3 CRON[30748]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:17:01 prdproxmox3 CRON[30750]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 4 15:18:01 prdproxmox3 CRON[31063]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:18:08 prdproxmox3 pvedaemon[7020]: worker exit
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 7020 finished
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: starting 1 worker(s)
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 31129 started
Aug 4 15:18:16 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3412) was formed. Members joined: 1 2 left: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [QUORUM] Members[3]: 3 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:01 prdproxmox3 CRON[31369]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:19:02 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:17 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:22 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3416) was formed. Members left: 1 2
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [dcdb] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] Members[1]: 3
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: node lost quorum
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change master => lost_manager_lock
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: watchdog closed (disabled)
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change lost_manager_lock => wait_for_quorum
Aug 4 15:19:35 prdproxmox3 pve-ha-lrm[1840]: status change active => lost_agent_lock
Aug 4 15:19:48 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:01 prdproxmox3 CRON[31672]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:19 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:21 prdproxmox3 watchdog-mux[1552]: client watchdog expired - disable watchdog updates
Aug 4 15:20:23 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:24 prdproxmox3 pvedaemon[20348]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:25 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
==============================================================================
==============================================================================
CRASH
==============================================================================
==============================================================================
Aug 4 15:21:59 prdproxmox3 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1545" x-info="http://www.rsyslog.com"] start
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'fuse' is builtin
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_devintf'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_poweroff'
Aug 4 15:21:59 prdproxmox3 hdparm[360]: Setting parameters of disc: (none).
Aug 4 15:21:59 prdproxmox3 mdadm-raid[362]: Generating udev events for MD arrays...done.
Aug 4 15:21:59 prdproxmox3 keyboard-setup[359]: Setting preliminary keymap...done.
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_si'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'ipmi_watchdog' is blacklisted
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'vhost_net'
Aug 4 15:21:59 prdproxmox3 lvm[643]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : récupération du journal
Aug 4 15:21:59 prdproxmox3 zpool[753]: no pools available to import
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : propre, 28/25206784 fichiers, 4514763/100796416 blocs
Aug 4 15:21:59 prdproxmox3 lvm[662]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 lvm[786]: 1 logical volume(s) in volume group "pve" monitored
Aug 4 15:21:59 prdproxmox3 mv[803]: /bin/mv: impossible d'évaluer « /etc/network/interfaces.new »: Aucun fichier ou dossier de ce type
Aug 4 15:21:59 prdproxmox3 pvepw-logger[820]: starting pvefw logger
Aug 4 15:21:59 prdproxmox3 networking[795]: Configuring network interfaces...interface dummy0 does not exist!
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr1 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 networking[795]: /bin/sh: /etc/pve/kvm-networking.sh: No such file or directory
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 4 15:21:59 prdproxmox3 networking[795]: Failed to bring up vmbr1.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Linux version 4.4.13-1-pve (root@nora) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Tue Jun 28 10:16:33 CEST 2016 ()
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr0 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.13-1-pve root=/dev/md1 ro rootdelay=15 quiet
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] KERNEL supported cpus:
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for DAD... Done
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Intel GenuineIntel
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] AMD AuthenticAMD
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Centaur CentaurHauls
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Aug 4 15:21:59 prdproxmox3 networking[795]: done.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
Aug 4 15:21:59 prdproxmox3 rpcbind[1296]: Starting rpcbind daemon....
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Using 'eager' FPU context switches.
Aug 4 15:21:59 prdproxmox3 iscsid: iSCSI logger with pid=1333 started!
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009abff] usable
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009ac00-0x000000000009ffff] reserved
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Aug 4 15:21:59 prdproxmox3 open-iscsi[1298]: Starting iSCSI initiator service: iscsid.
#########