4th Node Nightly Rebooting

Ashley

Member
Jun 28, 2016
267
15
18
34
Hello,

Have had a 3 node cluster running perfectly fine for month's, recently added a 4th node to this cluster (Same hardware DL160 G9, and same configuration)

For the last few nights every night around 11:40-50 the server will reboot it self, looking in the log's I am struggling to see anything at this exact time apart from corosync retransmit notifications however these are going throughout the day.

pveversion -v
proxmox-ve: 5.0-20 (running kernel: 4.10.17-2-pve)
pve-manager: 5.0-30 (running version: 5.0-30/5ab26bc)
pve-kernel-4.10.17-2-pve: 4.10.17-20
libpve-http-server-perl: 2.0-6
lvm2: 2.02.168-pve3
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-12
qemu-server: 5.0-15
pve-firmware: 2.0-2
libpve-common-perl: 5.0-16
libpve-guest-common-perl: 2.0-11
libpve-access-control: 5.0-6
libpve-storage-perl: 5.0-14
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-2
pve-docs: 5.0-9
pve-qemu-kvm: 2.9.0-4
pve-container: 2.0-15
pve-firewall: 3.0-2
pve-ha-manager: 2.0-2
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.0.8-3
lxcfs: 2.0.7-pve4
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.6.5.11-pve17~bpo90
ceph: 12.2.0-1~bpo90+1

syslog at point of reboot


Sep 1 23:36:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:36:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:36:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:36:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:37:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:37:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:37:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:37:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:38:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:38:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:38:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:38:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:39:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:39:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:39:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:39:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:40:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:40:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:40:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:40:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:41:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:41:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:41:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:41:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:42:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:42:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:42:09 cn04 pmxcfs[1968]: [dcdb] notice: data verification successful
Sep 1 23:42:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:42:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:43:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:43:01 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:43:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:43:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:43:49 cn04 corosync[1985]: notice [TOTEM ] Retransmit List: 2a97c5
Sep 1 23:43:49 cn04 corosync[1985]: [TOTEM ] Retransmit List: 2a97c5
Sep 1 23:44:00 cn04 systemd[1]: Starting Proxmox VE replication runner...
Sep 1 23:44:03 cn04 systemd[1]: Started Proxmox VE replication runner.
Sep 1 23:44:11 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:44:16 cn04 pmxcfs[1968]: [status] notice: received log
Sep 1 23:47:00 cn04 systemd-modules-load[593]: Inserted module 'iscsi_tcp'
Sep 1 23:47:00 cn04 kernel: [ 0.000000] Linux version 4.10.17-2-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP PVE 4.10.17-20 (Mon, 14 Aug 2017 11:23:37 +0200) ()
Sep 1 23:47:00 cn04 systemd-modules-load[593]: Inserted module 'ib_iser'
Sep 1 23:47:00 cn04 kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.10.17-2-pve root=/dev/mapper/pve-root ro quiet
Sep 1 23:47:00 cn04 kernel: [ 0.000000] KERNEL supported cpus:
Sep 1 23:47:00 cn04 kernel: [ 0.000000] Intel GenuineIntel
Sep 1 23:47:00 cn04 kernel: [ 0.000000] AMD AuthenticAMD
Sep 1 23:47:00 cn04 systemd-modules-load[593]: Inserted module 'vhost_net'
Sep 1 23:47:00 cn04 kernel: [ 0.000000] Centaur CentaurHauls
Sep 1 23:47:00 cn04 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Sep 1 23:47:00 cn04 systemd-udevd[655]: Process '/bin/mount -t fusectl fusectl /sys/fs/fuse/connections' failed with exit code 32.
Sep 1 23:47:00 cn04 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Sep 1 23:47:00 cn04 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Sep 1 23:47:00 cn04 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Sep 1 23:47:00 cn04 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
Sep 1 23:47:00 cn04 kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Sep 1 23:47:00 cn04 keyboard-setup.sh[586]: cannot open file /tmp/tmpkbd.qZry3Q
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000092fff] usable
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000093000-0x0000000000093fff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000094000-0x000000000009ffff] usable
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000062fe0fff] usable
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000062fe1000-0x000000006b5e0fff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x000000006b5e1000-0x000000006b5e1fff] usable
Sep 1 23:47:00 cn04 systemd[1]: Starting Flush Journal to Persistent Storage...
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x000000006b5e2000-0x000000006b662fff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x000000006b663000-0x00000000784fefff] usable
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000784ff000-0x00000000788fefff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000788ff000-0x00000000790fefff] type 20
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000790ff000-0x00000000791fefff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000791ff000-0x000000007b5fefff] ACPI NVS
Sep 1 23:47:00 cn04 systemd[1]: Started Flush Journal to Persistent Storage.
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x000000007b5ff000-0x000000007b7fefff] ACPI data
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x000000007b7ff000-0x000000007b7fffff] usable
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000080000000-0x000000008fffffff] reserved
Sep 1 23:47:00 cn04 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000407fffffff] usable
 
do you have configured HA ressources on this node ?
remember if you have HA ressources on a node, and the node ist not in the corosync quorum partition, the ha manager will force a reboot on the host
you can track such a behaviour by inspecting the pve-ha-lrm log
with

journalctl -u pve-ha-lrm
 
No HA resources, just a 4 node cluster for management.

However has been fine the past 2 night's, will continue to monitor.

"journalctl -u pve-ha-lrm" shows nothing since the last reboot

Sep 01 23:47:04 cn04 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
Sep 01 23:47:04 cn04 pve-ha-lrm[2124]: starting server
Sep 01 23:47:04 cn04 pve-ha-lrm[2124]: status change startup => wait_for_agent_lock
Sep 01 23:47:04 cn04 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
 
I am having similar issues, but I only have a 2 node cluster. in your logs the first error that appears before the kernel goes crazy is:
Sep 1 23:47:00 cn04 systemd-modules-load[593]: Inserted module 'iscsi_tcp'
which is exactly what I got:

Nov 15 03:11:00 ovirt systemd[1]: Started Proxmox VE replication runner.
Nov 15 03:11:11 ovirt pvedaemon[1277]: <root@pam> successful auth for user 'root@pam'
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Nov 15 08:56:28 ovirt systemd-modules-load[387]: Inserted module 'iscsi_tcp'
Nov 15 08:56:28 ovirt systemd-modules-load[387]: Inserted module 'ib_iser'
Nov 15 08:56:28 ovirt systemd-modules-load[387]: Inserted module 'vhost_net'
Nov 15 08:56:28 ovirt systemd[1]: Starting Flush Journal to Persistent Storage...
Nov 15 08:56:28 ovirt systemd[1]: Mounted Configuration File System.
Nov 15 08:56:28 ovirt systemd[1]: Mounted FUSE Control File System.
Nov 15 08:56:28 ovirt kernel: [ 0.000000] random: get_random_bytes called from start_kernel+0x42/0x4f3 with crng_init=0
Nov 15 08:56:28 ovirt kernel: [ 0.000000] Linux version 4.13.4-1-pve (root@nora) (gcc version 6.3.0 20170516 (Debian 6.3.0-18)) #1 SMP PVE 4.13.4-26 (Mon, 6 Nov 2017 11:23:55 +0100) ()
Nov 15 08:56:28 ovirt systemd[1]: Started Apply Kernel Variables.
Nov 15 08:56:28 ovirt kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.13.4-1-pve root=/dev/mapper/pve-root ro quiet
Nov 15 08:56:28 ovirt kernel: [ 0.000000] KERNEL supported cpus:
Nov 15 08:56:28 ovirt kernel: [ 0.000000] Intel GenuineIntel
Nov 15 08:56:28 ovirt kernel: [ 0.000000] AMD AuthenticAMD
Nov 15 08:56:28 ovirt kernel: [ 0.000000] Centaur CentaurHauls
Nov 15 08:56:28 ovirt kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'

I have 2 machines doing this on occasion with no other logs, are you still having this problem or find a fix?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!