Node rebooting continually

CZappe

Member
Jan 25, 2021
32
9
13
Santa Fe, NM, USA
www.bti-usa.com
I recently migrated a number of VMs to another existing node on my network in order to perform hardware maintenance (See thread: Manually move VM to a different Node). The migrated VMs are running just fine, however, I've discovered that the new node is now automatically rebooting at decreasing intervals throughout the day. The task log on this node shows the below error "storage '####' is not online" but it seems that this occurs during the reboot, due to a VM attempting to start before a network interface has come online.

ZeusReboot-Screenshot 2021-01-29 162208.png

Unfortunately, the frequency of these reboots appears to be increasing from the output below:
~# last reboot | less
reboot system boot 5.4.60-1-pve Fri Jan 29 15:58 still running
reboot system boot 5.4.60-1-pve Fri Jan 29 13:40 still running
reboot system boot 5.4.60-1-pve Fri Jan 29 13:13 still running
reboot system boot 5.4.60-1-pve Fri Jan 29 11:06 still running
reboot system boot 5.4.60-1-pve Fri Jan 29 11:02 still running
reboot system boot 5.4.60-1-pve Fri Jan 29 02:42 still running
reboot system boot 5.4.60-1-pve Thu Jan 28 18:42 still running
reboot system boot 5.4.60-1-pve Wed Jan 27 18:46 still running

I checked the output of /var/log/syslog and /var/log/kern.log and I see confirmation of these reboots but no other indications of abnormal behavior that precipitates them. One thought that occurred to me was perhaps I was overloading this node with too many concurrent VMs. A check of the node stats shows that it's nowhere near breaking a sweat and the graph data bears this out, as well:
ZeusSpecsStatus-Screenshot 2021-01-29 162208.png

Is there anything else I can check in the web GUI or CLI interfaces to get some clarity here?

Thanks,
Chris
 
can you post the syslog from before such a reboot? maybe we can see something you missed
 
Here's an excerpt from the syslog for a reboot that began at 04:13 today. I notice a string of ascii NULL characters in the log at the time of reboot which I interpret as an error of the log file getting interrupted as it was writing new entries from memory. The most recent prior entry was about 8 minutes before the crash.

...
Feb 1 03:22:58 zeus systemd[1]: Starting Daily PVE download activities...
Feb 1 03:23:00 zeus pveupdate[43594]: <root@pam> starting task UPID:zeus:0000AA56:000EF5CF:6017D684:aptupdate::root@pam:
Feb 1 03:23:01 zeus pveupdate[43606]: update new package list: /var/lib/pve-manager/pkgupdates
Feb 1 03:23:02 zeus pveupdate[43594]: <root@pam> end task UPID:zeus:0000AA56:000EF5CF:6017D684:aptupdate::root@pam: OK
Feb 1 03:23:02 zeus systemd[1]: pve-daily-update.service: Succeeded.
Feb 1 03:23:02 zeus systemd[1]: Started Daily PVE download activities.
Feb 1 03:25:01 zeus CRON[44463]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:30:01 zeus CRON[45697]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:35:01 zeus CRON[46934]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:40:01 zeus CRON[48168]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:45:01 zeus CRON[49403]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:50:01 zeus CRON[50640]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:55:01 zeus CRON[51885]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 03:59:07 zeus pvestatd[2779]: got timeout
Feb 1 04:00:01 zeus CRON[53122]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Feb 1 04:00:01 zeus CRON[53123]: (root) CMD (vzdump 122 --mailto [redacted] --quiet 1 --mode snapshot --storage qnap1v1 --mailnotification failure --node zeus --compress lzo)
Feb 1 04:00:02 zeus vzdump[53124]: <root@pam> starting task UPID:zeus:0000CF85:001259DD:6017DF32:vzdump:122:root@pam:
Feb 1 04:00:02 zeus vzdump[53125]: INFO: starting new backup job: vzdump 122 --node zeus --mailto [redacted] --mailnotification failure --storage qnap1v1 --quiet 1 --mode snapshot --compress lzo
Feb 1 04:00:02 zeus vzdump[53125]: INFO: Starting Backup of VM 122 (qemu)
Feb 1 04:00:04 zeus prometheus[1333]: level=info ts=2021-02-01T11:00:04.016Z caller=compact.go:496 component=tsdb msg="write block" mint=1612166400000 maxt=1612173600000 ulid=01EXEKZRRKRD3YZKSH9BQD6FR2 duration=28.924489ms
Feb 1 04:00:04 zeus prometheus[1333]: level=info ts=2021-02-01T11:00:04.028Z caller=head.go:668 component=tsdb msg="head GC completed" duration=1.632841ms
Feb 1 04:05:01 zeus CRON[54373]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Feb 1 04:13:23 zeus systemd-modules-load[977]: Inserted module 'iscsi_tcp'
Feb 1 04:13:23 zeus kernel: [ 0.000000] Linux version 5.4.60-1-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.60-2 (Fri, 04 Sep 2020 10:24:50 +0200) ()
Feb 1 04:13:23 zeus systemd-modules-load[977]: Inserted module 'ib_iser'
Feb 1 04:13:23 zeus systemd-modules-load[977]: Inserted module 'vhost_net'
Feb 1 04:13:23 zeus kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.60-1-pve root=/dev/mapper/pve-root ro quiet
Feb 1 04:13:23 zeus kernel: [ 0.000000] KERNEL supported cpus:
Feb 1 04:13:23 zeus kernel: [ 0.000000] Intel GenuineIntel
Feb 1 04:13:23 zeus kernel: [ 0.000000] AMD AuthenticAMD
Feb 1 04:13:23 zeus kernel: [ 0.000000] Hygon HygonGenuine
Feb 1 04:13:23 zeus kernel: [ 0.000000] Centaur CentaurHauls
Feb 1 04:13:23 zeus kernel: [ 0.000000] zhaoxin Shanghai
Feb 1 04:13:23 zeus lvm[984]: 1 logical volume(s) in volume group "pve" monitored
Feb 1 04:13:23 zeus kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb 1 04:13:23 zeus kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb 1 04:13:23 zeus kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb 1 04:13:23 zeus kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Feb 1 04:13:23 zeus kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-provided physical RAM map:
Feb 1 04:13:23 zeus systemd[1]: Starting Flush Journal to Persistent Storage...
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000099bff] usable
Feb 1 04:13:23 zeus systemd[1]: Started Flush Journal to Persistent Storage.
Feb 1 04:13:23 zeus systemd-udevd[1022]: Using default interface naming scheme 'v240'.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000099c00-0x000000000009ffff] reserved
Feb 1 04:13:23 zeus systemd-udevd[1063]: Using default interface naming scheme 'v240'.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000073ffffff] usable
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000074000000-0x0000000074041fff] ACPI NVS
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000074042000-0x0000000075daffff] usable
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000075db0000-0x0000000075ffffff] reserved
Feb 1 04:13:23 zeus systemd-udevd[1023]: Using default interface naming scheme 'v240'.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000076000000-0x00000000a5797fff] usable
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000a5798000-0x00000000a763cfff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000a763d000-0x00000000a7727fff] ACPI data
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000a7728000-0x00000000a7ba8fff] ACPI NVS
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000a7ba9000-0x00000000a9095fff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000a9096000-0x00000000abffffff] usable
Feb 1 04:13:23 zeus systemd-udevd[1024]: Using default interface naming scheme 'v240'.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000ac000000-0x00000000afffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000b4000000-0x00000000b5ffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000be000000-0x00000000bfffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000c8000000-0x00000000c9ffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000f4000000-0x00000000f5ffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000000fe000000-0x00000000ffffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000204f1fffff] usable
Feb 1 04:13:23 zeus systemd-udevd[1024]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x000000204f200000-0x000000204fffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000002050000000-0x000000404fdfffff] usable
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x000000404fe00000-0x000000404fffffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x000000fd00000000-0x000000fd00000fff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000010000000000-0x00000100201fffff] reserved
Feb 1 04:13:23 zeus systemd-udevd[1022]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000020030000000-0x00000200403fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000020060000000-0x00000200801fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000038090000000-0x00000380a03fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000380c0000000-0x00000380e01fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x00000480f0000000-0x00000481003fffff] reserved
Feb 1 04:13:23 zeus systemd-udevd[1063]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000048120000000-0x00000481401fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] BIOS-e820: [mem 0x0000058150000000-0x00000581603fffff] reserved
Feb 1 04:13:23 zeus kernel: [ 0.000000] NX (Execute Disable) protection: active
Feb 1 04:13:23 zeus kernel: [ 0.000000] SMBIOS 3.2.0 present.
Feb 1 04:13:23 zeus kernel: [ 0.000000] DMI: Supermicro AS -1023US-TR4/H11DSU-iN, BIOS 2.1 02/21/2020
Feb 1 04:13:23 zeus kernel: [ 0.000000] tsc: Fast TSC calibration using PIT
Feb 1 04:13:23 zeus systemd-udevd[1023]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 1 04:13:23 zeus kernel: [ 0.000000] tsc: Detected 3000.075 MHz processor
Feb 1 04:13:23 zeus kernel: [ 0.001252] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Feb 1 04:13:23 zeus kernel: [ 0.001253] e820: remove [mem 0x000a0000-0x000fffff] usable
Feb 1 04:13:23 zeus systemd-udevd[1071]: Using default interface naming scheme 'v240'.
Feb 1 04:13:23 zeus kernel: [ 0.001260] last_pfn = 0x404fe00 max_arch_pfn = 0x400000000
Feb 1 04:13:23 zeus kernel: [ 0.001264] MTRR default type: uncachable
Feb 1 04:13:23 zeus kernel: [ 0.001265] MTRR fixed ranges enabled:
Feb 1 04:13:23 zeus kernel: [ 0.001265] 00000-9FFFF write-back
Feb 1 04:13:23 zeus kernel: [ 0.001266] A0000-BFFFF write-through
Feb 1 04:13:23 zeus kernel: [ 0.001267] C0000-D2FFF write-protect
Feb 1 04:13:23 zeus kernel: [ 0.001267] D3000-D7FFF uncachable
Feb 1 04:13:23 zeus systemd-udevd[1074]: Using default interface naming scheme 'v240'.
...
 
Last edited:
ok it seems you run a backup a few minutes before it reboots...

can you check the task log of that backup if there is anything? try test to disable that backup and see if it is the problem? maybe you can trigger it by manually backing up the vms
while you can check the syslog "live" with 'journalctl -f'
 
Not sure if the backup is the culprit, looking at the correlation between the backups scheduled and the times of the reboots, sometimes less than 5 minutes apart. I manually ran the backup scheduled at 04:00 and got a task failure notification in the task log, which makes sense as I've migrated the selected VMs for that task to another node. Still, the failed task does not seem to generally correlate with a system crash.

There are 4 scheduled daily backups from 00:00-04:00 but I see that the node crashed 21 times on Febrary 2nd between 00:00 and 15:27
...
reboot system boot 5.4.60-1-pve Mon Feb 1 04:22 - 14:30 (10:07)
reboot system boot 5.4.78-2-pve Mon Feb 1 14:33 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 19:27 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 19:31 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 20:32 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 20:36 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 20:54 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 22:31 still running
reboot system boot 5.4.78-2-pve Mon Feb 1 23:34 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 00:00 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 00:04 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 01:52 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 01:55 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 02:59 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 03:20 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 06:21 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 06:25 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 07:32 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 07:35 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 08:50 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 08:54 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 12:05 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 12:43 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 12:53 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 12:57 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 14:01 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 14:17 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 15:19 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 15:23 still running
reboot system boot 5.4.78-2-pve Tue Feb 2 15:27 still running

I checked the journalctl log but found that the default log retention setting in /etc/systemd/journald.conf was set to Storage=auto, preventing me from checking it any further back than the most recent crash at 15:27 on February 2nd. I've set it to Storage=persistent so that I can dig backward for any future crashes. I'll post back here when that happens.

In the meantime, I'm wondering if the issue may be stemming from either a bad memory module or a failing power supply and whether checking journalctl will provide any insights in those cases.
 
I'll post back here when that happens.
ok

In the meantime, I'm wondering if the issue may be stemming from either a bad memory module or a failing power supply and whether checking journalctl will provide any insights in those cases.
if it simply resets without warning, no the logs will not be helpful... if this is a server with an ilo/idrac/ikvm, are there any relevant logs/events?
 
if it simply resets without warning, no the logs will not be helpful... if this is a server with an ilo/idrac/ikvm, are there any relevant logs/events?
The server in question is a Supermicro 819UAS-10 and it does not appear to have any configured power management devices from what I can tell remotely. Supermicro does provide an offline diagnostics tool that includes memory and PSU testing that runs on a USB drive. It looks like I'm straying way out of Proxmox territory with this one, but I'll post back here with the diagnosis if I can confirm I've got some glitchy hardware that just needs a yank-and-replace.

Thanks!
 
I ran Supermicro offline diagnostics on this node and everything passed, including memory and PSUs. It looks like the hardware angle may have been a dead end.

After running diagnostics, I rebooted the node back into Proxmox and powered down all VMs to see if the problem resurfaced. As of today, the node has been running without VMs for over 80 hours continuously. I've powered up one VM and will monitor the status of this node to see if anything changes.
 
I think I've narrowed this issue down. The reboots started after I migrated a VM from a different node and stopped after I removed the VM. The two nodes have different hardware configurations. This Supermicro has an AMD EPYC CPU and the other node has an Intel Xeon CPU. It seemed like a mismatched hardware configuration could make things go haywire. I checked the VM configurations for both the migrated VM and a "native" VM that was previously running on this node, however, I see that they both use the default kvm64 CPU type, so they should be hardware agnostic.

Still, I wonder if there is a way to quantify exactly what the hangup was that caused the frequent rebooting. These reboot events seemed to be sudden, often interrupting the syslog during writes. The logs on the VM itself were no help. I know that there are a number of things that may go wrong when porting an existing VM to different hardware but I'm not sure where to look next. The problematic node is temporarily out of production, so I have the freedom to poke around now with a certain degree of impunity.

Thanks!
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!