Eliott G.

New Member
Aug 4, 2016
17
0
1
30
Hi there,
We are having trouble with a 3 nodes cluster that is rebooting every node simultaneously every few days.
We are using OVH servers, so only unicast is available, OVH vracks being more expensive and we dont need more than 3 nodes.

We are using software RAID with SSD drives and our VMs are in HA with a distant NFS storage.

We also have checked the OVH servers status page and there were no maintenance at that moment.

This is the third time this happens and the only thing we could do is send an email on boot to check that everything is back to normal.

There is a reboot order shown with the following command, so something must have asked for it :

######
root@prdproxmox3:~# last reboot
reboot system boot 4.4.13-1-pve Thu Aug 4 15:21 - 18:43 (03:21)

#######


What could send that reboot order ? And how could it transmit to the other nodes ?
Our best guess is a problem with the cluster, a hardware problem is more unlikely.

If you need other informations, just ask.

Thanks.

Eliott G.


Following a sample of our syslog (feel free to ask for other logs) :

##### /var/log/syslog #####

Aug 4 15:17:01 prdproxmox3 CRON[30748]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:17:01 prdproxmox3 CRON[30750]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 4 15:18:01 prdproxmox3 CRON[31063]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:18:08 prdproxmox3 pvedaemon[7020]: worker exit
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 7020 finished
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: starting 1 worker(s)
Aug 4 15:18:08 prdproxmox3 pvedaemon[1823]: worker 31129 started
Aug 4 15:18:16 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3412) was formed. Members joined: 1 2 left: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [QUORUM] Members[3]: 3 1 2
Aug 4 15:18:19 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:01 prdproxmox3 CRON[31369]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:19:02 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:17 prdproxmox3 pmxcfs[1756]: [status] notice: received log
Aug 4 15:19:22 prdproxmox3 corosync[1805]: [TOTEM ] A processor failed, forming new configuration.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] A new membership (3.3.3.3:3416) was formed. Members left: 1 2
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [TOTEM ] Failed to receive the leave message. failed: 1 2
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [dcdb] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: members: 3/1756
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [QUORUM] Members[1]: 3
Aug 4 15:19:26 prdproxmox3 corosync[1805]: [MAIN ] Completed service synchronization, ready to provide service.
Aug 4 15:19:26 prdproxmox3 pmxcfs[1756]: [status] notice: node lost quorum
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change master => lost_manager_lock
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: watchdog closed (disabled)
Aug 4 15:19:34 prdproxmox3 pve-ha-crm[1831]: status change lost_manager_lock => wait_for_quorum
Aug 4 15:19:35 prdproxmox3 pve-ha-lrm[1840]: status change active => lost_agent_lock
Aug 4 15:19:48 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:01 prdproxmox3 CRON[31672]: (root) CMD (/usr/local/rtm/bin/rtm 33 > /dev/null 2> /dev/null)
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:02 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:19 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[29455]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:20 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:21 prdproxmox3 watchdog-mux[1552]: client watchdog expired - disable watchdog updates
Aug 4 15:20:23 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:24 prdproxmox3 pvedaemon[20348]: <root@pam> successful auth for user 'shinken@pve'
Aug 4 15:20:25 prdproxmox3 pvedaemon[31129]: <root@pam> successful auth for user 'shinken@pve'
==============================================================================
==============================================================================
CRASH
==============================================================================
==============================================================================
Aug 4 15:21:59 prdproxmox3 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="1545" x-info="http://www.rsyslog.com"] start
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'fuse' is builtin
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_devintf'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_poweroff'
Aug 4 15:21:59 prdproxmox3 hdparm[360]: Setting parameters of disc: (none).
Aug 4 15:21:59 prdproxmox3 mdadm-raid[362]: Generating udev events for MD arrays...done.
Aug 4 15:21:59 prdproxmox3 keyboard-setup[359]: Setting preliminary keymap...done.
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'ipmi_si'
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Module 'ipmi_watchdog' is blacklisted
Aug 4 15:21:59 prdproxmox3 systemd-modules-load[327]: Inserted module 'vhost_net'
Aug 4 15:21:59 prdproxmox3 lvm[643]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : récupération du journal
Aug 4 15:21:59 prdproxmox3 zpool[753]: no pools available to import
Aug 4 15:21:59 prdproxmox3 systemd-fsck[688]: /var/lib/vz : propre, 28/25206784 fichiers, 4514763/100796416 blocs
Aug 4 15:21:59 prdproxmox3 lvm[662]: 1 logical volume(s) in volume group "pve" now active
Aug 4 15:21:59 prdproxmox3 lvm[786]: 1 logical volume(s) in volume group "pve" monitored
Aug 4 15:21:59 prdproxmox3 mv[803]: /bin/mv: impossible d'évaluer « /etc/network/interfaces.new »: Aucun fichier ou dossier de ce type
Aug 4 15:21:59 prdproxmox3 pvepw-logger[820]: starting pvefw logger
Aug 4 15:21:59 prdproxmox3 networking[795]: Configuring network interfaces...interface dummy0 does not exist!
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr1 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 networking[795]: /bin/sh: /etc/pve/kvm-networking.sh: No such file or directory
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Aug 4 15:21:59 prdproxmox3 networking[795]: Failed to bring up vmbr1.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpu
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Linux version 4.4.13-1-pve (root@nora) (gcc version 4.9.2 (Debian 4.9.2-10) ) #1 SMP Tue Jun 28 10:16:33 CEST 2016 ()
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for vmbr0 to get ready (MAXWAIT is 2 seconds).
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.13-1-pve root=/dev/md1 ro rootdelay=15 quiet
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] KERNEL supported cpus:
Aug 4 15:21:59 prdproxmox3 networking[795]: Waiting for DAD... Done
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Intel GenuineIntel
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] AMD AuthenticAMD
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] Centaur CentaurHauls
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Aug 4 15:21:59 prdproxmox3 networking[795]: done.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
Aug 4 15:21:59 prdproxmox3 rpcbind[1296]: Starting rpcbind daemon....
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] x86/fpu: Using 'eager' FPU context switches.
Aug 4 15:21:59 prdproxmox3 iscsid: iSCSI logger with pid=1333 started!
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009abff] usable
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009ac00-0x000000000009ffff] reserved
Aug 4 15:21:59 prdproxmox3 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Aug 4 15:21:59 prdproxmox3 open-iscsi[1298]: Starting iSCSI initiator service: iscsid.


#########
 
HA and the watchdog are both enabled, so it is probably the cause of our reboots when a node loose the quorum.

There is this line of log in each of our nodes right before they reboot :

# First node

Aug 4 15:19:11 prdproxmox1 watchdog-mux[1549]: client watchdog expired - disable watchdog updates
# Second node

Aug 4 15:19:15 prdproxmox2 watchdog-mux[1502]: client watchdog expired - disable watchdog updates
# Third node

Aug 4 15:20:21 prdproxmox3 watchdog-mux[1552]: client watchdog expired - disable watchdog updates

But we suppose that the loss of quorum is caused by a short network interruption on the OVH side. We can't do anything about that.
Is there any way to prevent watchdog daemon from rebooting ? Or at least extend the duration needed for a reboot ? Is it a good idea ?

We are trying to have HA but if the cluster nodes fall like dominoes at the first interruption...
 
But we suppose that the loss of quorum is caused by a short network interruption on the OVH side. We can't do anything about that.
Is there any way to prevent watchdog daemon from rebooting ? Or at least extend the duration needed for a reboot ? Is it a good idea ?

Take a vrack, use multicast. (at least you'll have a more direct connection between your servers)
 
The thing is that we are using re-installed servers that were hosting old standalone Proxmox 3.X and there is still 4 months left that are already payed. We could consider switching to vracks, but for the moment we can't afford to buy more servers.

Is there any alternate solution to this problem ?
 
i'v installed proxmox 2 weeks ago and running well, 2 days ago till now my machine (proxmox box) restart randomly, im newbie in proxmox. help please


syslog and messages log in attachment,

thanks in advanced




pveversion -v==================================================================
proxmox-ve: 4.2-48 (running kernel: 4.4.6-1-pve)
pve-manager: 4.2-2 (running version: 4.2-2/725d76f0)
pve-kernel-4.4.6-1-pve: 4.4.6-48
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-39
qemu-server: 4.0-72
pve-firmware: 1.1-8
libpve-common-perl: 4.0-59
libpve-access-control: 4.0-16
libpve-storage-perl: 4.0-50
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-14
pve-container: 1.0-62
pve-firewall: 2.0-25
pve-ha-manager: 1.0-28
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve9~jessie
 

Attachments

  • messages.txt
    69.1 KB · Views: 0
  • syslog.txt
    105.2 KB · Views: 4

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!