Abrupt shutdown & reboot of PVE 5.1 earlier today

peterG · Jan 23, 2018

hi all

earlier today at 9:42am experienced a sudden abrupt shutdown & reboot, unexplained ..

any clues fr anyone?

the hardware is beefy, Dell T710 server grade chassis, Raid6, 2 Hexacores and 24GB of ram, it's always been reliable and ran great uninterrupted with PVE3.2 (4VM's, 2 kvm Win Server and 2 Linux, Ubuntu Server & Debian) since 2014, and since 1-1-2018 it was upgraded to PVE 5.1, clean fresh new install of 5.1 and fresh new Raid controller and new Hitachi HD's. It's been running great until earlier today.

here's the evidence gathered thus far:

pveversion -v

proxmox-ve: 5.1-32 (running kernel: 4.13.13-2-pve)
pve-manager: 5.1-41 (running version: 5.1-41/0b958203)
pve-kernel-4.13.13-2-pve: 4.13.13-32
libpve-http-server-perl: 2.0-8
lvm2: 2.02.168-pve6
corosync: 2.4.2-pve3
libqb0: 1.0.1-1
pve-cluster: 5.0-19
qemu-server: 5.0-18
pve-firmware: 2.0-3
libpve-common-perl: 5.0-25
libpve-guest-common-perl: 2.0-14
libpve-access-control: 5.0-7
libpve-storage-perl: 5.0-17
pve-libspice-server1: 0.12.8-3
vncterm: 1.5-3
pve-docs: 5.1-12
pve-qemu-kvm: 2.9.1-5
pve-container: 2.0-18
pve-firewall: 3.0-5
pve-ha-manager: 2.0-4
ksm-control-daemon: 1.2-2
glusterfs-client: 3.8.8-1
lxc-pve: 2.1.1-2
lxcfs: 2.0.8-1
criu: 2.11.1-1~bpo90
novnc-pve: 0.6-4
smartmontools: 6.5+svn4324-1
zfsutils-linux: 0.7.3-pve1~bpo9

The relevant portion extracted out of syslog at time of abrupt shutdown & reboot, at 9:42am

Jan 22 09:33:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:34:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:34:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:35:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:35:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:36:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:36:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/qmgr[1966]: 5E40E440027: from=<>, size=2685, nrcpt=1 (queue active)

Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/local[14982]: error: open database /etc/aliases.db: No such file or directory

Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/local[14982]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory

Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/local[14982]: warning: hash:/etc/aliases: lookup of 'root' failed

Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/local[14982]: 5E40E440027: to=<root@MainPVE-HypervisorChassis.Dyn.org>, relay=local, delay=25803, delays=25803/0.02/0/0.13, dsn=4.3.0, status=deferred (alias database unavailable)

Jan 22 09:36:31 MainPVE-HypervisorChassis postfix/local[14982]: using backwards-compatible default setting relay_domains=$mydestination to update fast-flush logfile for domain "MainPVE-HypervisorChassis
.Dyn.org"

Jan 22 09:37:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:37:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:38:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:38:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:39:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:39:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:40:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:40:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:41:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:41:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 09:42:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 09:42:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 22 09:47:19 MainPVE-HypervisorChassis systemd-modules-load[490]: Inserted module 'iscsi_tcp'
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Mounted RPC Pipe File System.
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Starting pNFS block layout mapping daemon...
Jan 22 09:47:19 MainPVE-HypervisorChassis blkmapd[533]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory

Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Started pNFS block layout mapping daemon.
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Started Flush Journal to Persistent Storage.
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Started udev Coldplug all Devices.
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Starting udev Wait for Complete Device Initialization...

Jan 22 09:47:19 MainPVE-HypervisorChassis keyboard-setup.sh[498]: cannot open file /tmp/tmpkbd.sfGy9k
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd-modules-load[490]: Inserted module 'ib_iser'
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Started udev Kernel Device Manager.
Jan 22 09:47:19 MainPVE-HypervisorChassis systemd[1]: Mounted NFSD configuration filesystem.

More clues, Earlier in the day, we see this at end of syslog.1:

Jan 22 06:22:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:23:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:23:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:24:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:24:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:25:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:25:01 MainPVE-HypervisorChassis CRON[27979]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))

Jan 22 06:25:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:25:02 MainPVE-HypervisorChassis systemd[1]: Reloading PVE API Proxy Server.
Jan 22 06:25:04 MainPVE-HypervisorChassis pveproxy[28055]: send HUP to 2176
Jan 22 06:25:04 MainPVE-HypervisorChassis systemd[1]: Reloaded PVE API Proxy Server.
Jan 22 06:25:04 MainPVE-HypervisorChassis systemd[1]: Reloading PVE SPICE Proxy Server.
Jan 22 06:25:04 MainPVE-HypervisorChassis pveproxy[2176]: received signal HUP
Jan 22 06:25:04 MainPVE-HypervisorChassis pveproxy[2176]: server closing
Jan 22 06:25:04 MainPVE-HypervisorChassis pveproxy[2176]: server shutdown (restart)
Jan 22 06:25:05 MainPVE-HypervisorChassis spiceproxy[28066]: send HUP to 2205
Jan 22 06:25:05 MainPVE-HypervisorChassis systemd[1]: Reloaded PVE SPICE Proxy Server.
Jan 22 06:25:05 MainPVE-HypervisorChassis systemd[1]: Stopping Proxmox VE firewall logger...
Jan 22 06:25:05 MainPVE-HypervisorChassis pvefw-logger[8126]: received terminate request (signal)
Jan 22 06:25:05 MainPVE-HypervisorChassis pvefw-logger[8126]: stopping pvefw logger
Jan 22 06:25:05 MainPVE-HypervisorChassis spiceproxy[2205]: received signal HUP
Jan 22 06:25:05 MainPVE-HypervisorChassis systemd[1]: Stopped Proxmox VE firewall logger.
Jan 22 06:25:05 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE firewall logger...
Jan 22 06:25:05 MainPVE-HypervisorChassis pvefw-logger[28088]: starting pvefw logger
Jan 22 06:25:05 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE firewall logger.
Jan 22 06:25:05 MainPVE-HypervisorChassis spiceproxy[2205]: server closing

Then in begining of Syslog we have this:

Jan 22 06:25:05 MainPVE-HypervisorChassis liblogging-stdlog: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1328" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

Jan 22 06:25:05 MainPVE-HypervisorChassis spiceproxy[2205]: server shutdown (restart)
Jan 22 06:25:06 MainPVE-HypervisorChassis pveproxy[2176]: restarting server
Jan 22 06:25:06 MainPVE-HypervisorChassis pveproxy[2176]: starting 3 worker(s)
Jan 22 06:25:06 MainPVE-HypervisorChassis pveproxy[2176]: worker 28120 started
Jan 22 06:25:06 MainPVE-HypervisorChassis pveproxy[2176]: worker 28121 started
Jan 22 06:25:06 MainPVE-HypervisorChassis pveproxy[2176]: worker 28122 started
Jan 22 06:25:07 MainPVE-HypervisorChassis spiceproxy[2205]: restarting server
Jan 22 06:25:07 MainPVE-HypervisorChassis spiceproxy[2205]: starting 1 worker(s)
Jan 22 06:25:07 MainPVE-HypervisorChassis spiceproxy[2205]: worker 28124 started
Jan 22 06:25:07 MainPVE-HypervisorChassis liblogging-stdlog: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1328" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

Jan 22 06:25:11 MainPVE-HypervisorChassis pveproxy[27962]: worker exit
Jan 22 06:25:11 MainPVE-HypervisorChassis pveproxy[30704]: worker exit
Jan 22 06:25:12 MainPVE-HypervisorChassis pveproxy[28836]: worker exit
Jan 22 06:25:13 MainPVE-HypervisorChassis spiceproxy[31519]: worker exit
Jan 22 06:25:16 MainPVE-HypervisorChassis pveproxy[2176]: worker 27962 finished
Jan 22 06:25:16 MainPVE-HypervisorChassis pveproxy[2176]: worker 28836 finished
Jan 22 06:25:16 MainPVE-HypervisorChassis pveproxy[2176]: worker 30704 finished
Jan 22 06:25:17 MainPVE-HypervisorChassis spiceproxy[2205]: worker 31519 finished
Jan 22 06:26:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:26:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:27:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:27:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:28:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:28:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:28:27 MainPVE-HypervisorChassis rrdcached[1978]: flushing old values
Jan 22 06:28:27 MainPVE-HypervisorChassis rrdcached[1978]: rotating journals
Jan 22 06:28:27 MainPVE-HypervisorChassis rrdcached[1978]: started new journal /var/lib/rrdcached/journal/rrd.journal.1516620507.654896

Jan 22 06:28:27 MainPVE-HypervisorChassis rrdcached[1978]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1516613307.654922

Jan 22 06:28:38 MainPVE-HypervisorChassis smartd[1313]: Device: /dev/bus/2 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 153 to 157

Jan 22 06:28:38 MainPVE-HypervisorChassis smartd[1313]: Device: /dev/bus/2 [megaraid_disk_03] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 166 to 171

Jan 22 06:28:40 MainPVE-HypervisorChassis smartd[1313]: Device: /dev/bus/2 [megaraid_disk_05] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 153 to 157

Jan 22 06:29:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:29:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.
Jan 22 06:30:00 MainPVE-HypervisorChassis systemd[1]: Starting Proxmox VE replication runner...
Jan 22 06:30:01 MainPVE-HypervisorChassis systemd[1]: Started Proxmox VE replication runner.

I did notice that the disks (disk 3 is highest ) are being reported as running 171 deg Celsius (that's gotta be wrong that is as hot as a pizza oven) which is like 340 deg F.. But nothing like that as reported around the time of the abrupt shutdown and reboot..

anyone have any advice , suggestions, guidance..

TY all -pg

LATER on same day:
Got curious about the drive temps being reported by smartd as way too high and i think something is wrong with smartd reporting bcs i just checked syslog again and found the drive temps as follows, being reported by smartd:

Device: /dev/bus/2 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 150 to 153

Device: /dev/bus/2 [megaraid_disk_04] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 162 to 166

Device: /dev/bus/2 [megaraid_disk_06] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 153 to 157

Device: /dev/bus/2 [megaraid_disk_07] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 166 to 171

so i checked the 4 disks being reported here (disks 2, 4, 6 & 7) with

smartctl -x -d megaraid,7 /dev/sdb

(substituting the 7 with 2 , 4 and 6 as needed) and i got well within the specified params of like 34 to 37 degrees C, and at no time did ANY of the raid array Hitachi drives exceed their 60 deg C max specified operating temp according to smartctl stats, they stay within 34-37 deg C as one would expect ..

LnxBil · Jan 23, 2018

Without proper logging post mortem analysis is very often impossible. You need the logging output of the kernel at the crash time, everything else is total useless for debugging. I'd enable netconsole or serial logging and consider installing and activating kdump/crashkernel.

peterG · Feb 7, 2018

TY LnxBil,
Just installed & configed kdump-tools, crash, kexec-tools and makedumpfile, BCS it happened again this morning, meaning the abrupt shutdown & restart .. and after poring over the /var/log contents and the ProxMox web gui task log output it seems that pveupdate &/or apt-get update is playing a role here, as t'seems that updates came in and were partially successfully applied, a reboot occured thereafter (updates happened in middle of night and reboot occured in the morning several hours later, with out the VM's being notified nor being shut down properly, inspite of the agents being installed inside the VM's to allow for gracefull shutdowns),, so the main (single node) Dell chassis rebooted itself..

We shall see by now doing a proper "post mortem", thx again, IF ths happens again.

In the meantime, this is a production machine and i will do what i can to prevent another abrupt shurtdown-reboot and thus on a hunch I'd rather prevent another episode, so on to that.

Question, as we would rather apply updates or dist-upgrade manually periodically ourselves

HOW DO WE DISABLE pveupdate & also the cron activated nightly apt-get update

would the following work?
in /etc/cron.d
can i simply rename or remove the pveupdate file itself OR alternatively would a better way be that inside the /etc/cron.d/pveupdate simply comment out the line as thus
# 8 2 * * * * root /usr/bin/pveupdate

even tho the file does note NOT to edit it directly..

Is there a better way to do this, (doesnt seem that any choices to disable pveupdate exist in the gui)

ALSO these are some of the repeated messages daily

stopped: command 'apt-get update' failed: exit code 100
task type: aptupdate

TY for guidance/advice/thots etc

LATER same day:

what i did for now to disable both pveupdate and apt-get update from nightly cron script is this, ( pending a better solution )

in /etc/cron.daily/apt-compat
commented out last line as thus
# exec /usr/lib/apt/apt.systemd.daily

in /etc/cron.d
commented out the pveupdate entry as thus
# 8 2 * * * root /usr/bin/pveupdate

Abrupt shutdown & reboot of PVE 5.1 earlier today

peterG

Member

LnxBil

Distinguished Member

peterG

Member

We value your privacy