Proxmox host crashed? Found in a powered down state...

den · Mar 7, 2015

hi guys.. my proxmox server somehow was found turned off. It was in a powered down state automatically. And I dont know how.

I suspect a crash. but if it was to crash, then it should be in a hung state or should reboot im thinking. but not powered down.

how can i check what happened?
do you want the syslog?
can i check elsewhere?
what else do you need to check?

im running v3.4, with ZFS softwear raid (raid1 - mirror) (via your build process).

dietmar · Mar 7, 2015

den said:
do you want the syslog?

You just need to analyze/read the syslog...

den · Mar 7, 2015

hi thanks for the reply..

below is the syslog. i cannot find anything.. well i dont understand how to read it. i really appreciate any help.

http://pastebin.com/raw.php?i=ui3J2Mpy

The last logged item before the issue was at 12:17:01
And the entry at 12:39:57 must be the start of the server coming back up.

den · Mar 9, 2015

any help please...

steffen · Mar 9, 2015

Sorry, can't help. I had the exact same issue a few days ago and thought it was a sudden fluke on the hardware. RAM and DISK CHecks brought nothing to light. My Proxmox Logs show nothing as well, looks like a complete freeze without error indicating the arrival of the sudden death.

But this box is running the latest from testing. None of my stable / subscription machine have every shown anything like it.

den · Mar 9, 2015

interesting.. thanks for the reply.
i'm using Proxmox VE 3.4. is that not the stable version?

steffen · Mar 9, 2015

Yes, 3.4 sound right, however you might want to read this thread in terms of which repo you are using: http://forum.proxmox.com/threads/17875-Confused-about-repositories

den · Mar 10, 2015

thanks..
i'm going to put the answers below for others.. if they hit the same question

Proxmox dev git ----> pvetest repo -------> pve-no-subscription -----> pve-enterprise
http://pve.proxmox.com/wiki/Package_repositories

bean · Apr 24, 2015

Hi,
I have proxmox VE 3.4 and I have exactly the same problem, I found anything abnormal in my syslog, but My server crashes every 3-4 days. anyone can help?

den · Apr 25, 2015

Try sticking to the original build from the iso without upgrading / updating. That I so build seems to be more stable. Only upgrade when a new image is out.
Else get a subscription... And upgrade.

longhair · Apr 25, 2015

How about some computer specs? (CPU, memory - ecc or non-ecc, motherboard, hard drives w/ age, etc.)

den · Apr 25, 2015

any help anyone...

bean · Apr 25, 2015

longhair said:
How about some computer specs? (CPU, memory - ecc or non-ecc, motherboard, hard drives w/ age, etc.)

hi, I have a server Dell PowerEdge R220 with 1 cpu intel xeon 8 cores, 32g memery ECC, 2x1T sata3, server run in proxmox ve 3.4 upgraded from 3.3 with 4 kvm machine.
My syslog reports no problems, server became offline about 17h23, and I reboot server in 17h42, here is my syslog:
Apr 24 16:07:28 sv391 kernel: tg3 0000:01:00.1: eth1: Link is down
Apr 24 16:09:30 sv391 kernel: tg3 0000:01:00.1: eth1: Link is up at 100 Mbps, full duplex
Apr 24 16:09:30 sv391 kernel: tg3 0000:01:00.1: eth1: Flow control is on for TX and on for RX
Apr 24 16:09:30 sv391 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Apr 24 16:17:01 sv391 /USR/SBIN/CRON[439567]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 24 16:23:56 sv391 rrdcached[2665]: flushing old values
Apr 24 16:23:56 sv391 rrdcached[2665]: rotating journals
Apr 24 16:23:56 sv391 rrdcached[2665]: started new journal /var/lib/rrdcached/journal/rrd.journal.1429885436.244686
Apr 24 16:23:56 sv391 rrdcached[2665]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1429878236.244697
Apr 24 16:33:50 sv391 kernel: tg3 0000:01:00.1: eth1: Link is down
Apr 24 16:35:52 sv391 kernel: tg3 0000:01:00.1: eth1: Link is up at 100 Mbps, full duplex
Apr 24 16:35:52 sv391 kernel: tg3 0000:01:00.1: eth1: Flow control is on for TX and on for RX
Apr 24 16:35:52 sv391 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Apr 24 16:50:47 sv391 kernel: tg3 0000:01:00.1: eth1: Link is down
Apr 24 16:52:48 sv391 kernel: tg3 0000:01:00.1: eth1: Link is up at 100 Mbps, full duplex
Apr 24 16:52:48 sv391 kernel: tg3 0000:01:00.1: eth1: Flow control is on for TX and on for RX
Apr 24 16:52:48 sv391 kernel: tg3 0000:01:00.1: eth1: EEE is disabled
Apr 24 17:17:01 sv391 /USR/SBIN/CRON[447538]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 24 17:23:56 sv391 rrdcached[2665]: flushing old values
Apr 24 17:23:56 sv391 rrdcached[2665]: rotating journals
Apr 24 17:23:56 sv391 rrdcached[2665]: started new journal /var/lib/rrdcached/journal/rrd.journal.1429889036.244711
Apr 24 17:23:56 sv391 rrdcached[2665]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1429881836.244704
Apr 24 17:42:29 sv391 kernel: imklog 5.8.11, log source = /proc/kmsg started.
Apr 24 17:42:29 sv391 rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2553" x-info="http://www.rsyslog.com"] start
Apr 24 17:42:29 sv391 kernel: Initializing cgroup subsys cpuset
Apr 24 17:42:29 sv391 kernel: Initializing cgroup subsys cpu
Apr 24 17:42:29 sv391 kernel: Linux version 2.6.32-37-pve (root@lola) (gcc version 4.7.2 (Debian 4.7.2-5) ) #1 SMP Wed Mar 18 08:19:56 CET 2015
Apr 24 17:42:29 sv391 kernel: Command line: BOOT_IMAGE=/vmlinuz-2.6.32-37-pve root=UUID=83340bac-8016-4e54-be1a-4f6d73b7d5cf ro quiet
Apr 24 17:42:29 sv391 kernel: KERNEL supported cpus:
Apr 24 17:42:29 sv391 kernel: Intel GenuineIntel
Apr 24 17:42:29 sv391 kernel: AMD AuthenticAMD
Apr 24 17:42:29 sv391 kernel: Centaur CentaurHauls
Apr 24 17:42:29 sv391 kernel: BIOS-provided physical RAM map:

Thanks

den · Apr 28, 2015

So looks like I spoke too quick. it went for months without a crash.. and all of a sudden, looks like it froze.
the server summary shows monitoring has stopped as well.

at Apr 27 22:17:01 was the last event before it was powered back on.

Any help anyone?? I have the full log if anyone needs it.

SYSLOG

Code:

[COLOR=#000000][FONT=tahoma]Apr 27 20:10:03 lion rrdcached[2848]: flushing old values[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 20:10:03 lion rrdcached[2848]: rotating journals[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 20:10:03 lion rrdcached[2848]: started new journal /var/lib/rrdcached/journal/rrd.journal.1430129403.008559[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 20:10:03 lion rrdcached[2848]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1430122203.008555[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 20:17:01 lion /USR/SBIN/CRON[480300]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:10:03 lion rrdcached[2848]: flushing old values[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:10:03 lion rrdcached[2848]: rotating journals[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:10:03 lion rrdcached[2848]: started new journal /var/lib/rrdcached/journal/rrd.journal.1430133003.008564[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:10:03 lion rrdcached[2848]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1430125803.008547[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:13:28 lion pvedaemon[438615]: <root@pam> successful auth for user 'dilan.jayawardana@pve'[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:13:55 lion pvedaemon[442108]: <dilan.jayawardana@pve> starting task UPID:lion:00077706:19E123A2:553E19F3:vncproxy:313:dilan.jayawardana@pve:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:13:55 lion pvedaemon[489222]: starting vnc proxy UPID:lion:00077706:19E123A2:553E19F3:vncproxy:313:dilan.jayawardana@pve:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:17:02 lion /USR/SBIN/CRON[489715]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:21 lion kernel: vmbr0: port 5(tap313i0) entering disabled state[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:21 lion halevt: Running: halevt-umount -u /org/freedesktop/Hal/devices/net_0a_89_d6_a1_bd_78; halevt-umount -s[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:22 lion pvedaemon[491374]: starting vnc proxy UPID:lion:00077F6E:19E25EC1:553E1D1A:vncproxy:313:dilan.jayawardana@pve:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:22 lion pvedaemon[438615]: <dilan.jayawardana@pve> starting task UPID:lion:00077F6E:19E25EC1:553E1D1A:vncproxy:313:dilan.jayawardana@pve:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:23 lion ntpd[2777]: Deleting interface #190 tap313i0, fe80::889:d6ff:fea1:bd78#123, interface stats: received=0, sent=0, dropped=0, active_time=44653 secs[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:23 lion ntpd[2777]: peers refreshed[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:23 lion qm[491376]: VM 313 qmp command failed - VM 313 not running[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 21:27:23 lion pvedaemon[491374]: command '/bin/nc -l -p 5902 -w 10 -c '/usr/sbin/qm vncproxy 313 2>/dev/null'' failed: exit code 255[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 22:10:03 lion rrdcached[2848]: flushing old values[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 27 22:10:03 lion rrdcached[2848]: rotating journals[/FONT][/COLOR]
[B][COLOR=#0000cd][FONT=tahoma]Apr 27 22:10:03 lion rrdcached[2848]: started new journal /var/lib/rrdcached/journal/rrd.journal.1430136603.008637[/FONT]
[FONT=tahoma]Apr 27 22:10:03 lion rrdcached[2848]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1430129403.008559[/FONT][/COLOR][/B]
[COLOR=#ff0000][B][FONT=tahoma]Apr 27 22:17:01 lion /USR/SBIN/CRON[499119]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)[/FONT][/B][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: imklog 5.8.11, log source = /proc/kmsg started.[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2685" x-info="http://www.rsyslog.com"] start[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: Initializing cgroup subsys cpuset[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: Initializing cgroup subsys cpu[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: Linux version 2.6.32-37-pve (root@lola) (gcc version 4.7.2 (Debian 4.7.2-5) ) #1 SMP Wed Feb 11 10:00:27 CET 2015[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: Command line: BOOT_IMAGE=/ROOT/pve-1@/boot/vmlinuz-2.6.32-37-pve root=ZFS=rpool/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: KERNEL supported cpus:[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel:  Intel GenuineIntel[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel:  AMD AuthenticAMD[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel:  Centaur CentaurHauls[/FONT][/COLOR]
[COLOR=#000000][FONT=tahoma]Apr 28 09:27:48 lion kernel: BIOS-provided physical RAM map:
[/FONT][/COLOR]...

VERSION

Code:

proxmox-ve-2.6.32: 3.3-147 (running kernel: 2.6.32-37-pve)
pve-manager: 3.4-1 (running version: 3.4-1/3f2d890e)
pve-kernel-2.6.32-37-pve: 2.6.32-147
lvm2: 2.02.98-pve4
clvm: 2.02.98-pve4
corosync-pve: 1.4.7-1
openais-pve: 1.1.4-3
libqb0: 0.11.1-2
redhat-cluster-pve: 3.2.0-2
resource-agents-pve: 3.9.2-4
fence-agents-pve: 4.0.10-2
pve-cluster: 3.0-16
qemu-server: 3.3-20
pve-firmware: 1.1-3
libpve-common-perl: 3.0-24
libpve-access-control: 3.0-16
libpve-storage-perl: 3.0-31
pve-libspice-server1: 0.12.4-3
vncterm: 1.1-8
vzctl: 4.0-1pve6
vzprocps: 2.0.11-2
vzquota: 3.1-2
pve-qemu-kvm: 2.1-12
ksm-control-daemon: 1.1-1
glusterfs-client: 3.5.2-1

den · Apr 28, 2015

Hmm... looks like these servers crashed after the same event.

bean said:
Apr 24 17:23:56 sv391 rrdcached[2665]: started new journal /var/lib/rrdcached/journal/rrd.journal.1429889036.244711
Apr 24 17:23:56 sv391 rrdcached[2665]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1429881836.244704

den said:
Apr 27 22:10:03 lion rrdcached[2848]: started new journal /var/lib/rrdcached/journal/rrd.journal.1430136603.008637
Apr 27 22:10:03 lion rrdcached[2848]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1430129403.008559

marhat · Apr 29, 2016

Hello, I have the same problem - the automatic restart of the server, or someone did it solve?

den · Apr 29, 2016

not solved. well I've rebuild mine to 4.2 now, so i hope the problem goes away.

adgenet · May 8, 2016

I'm seeing what I think are these same hangs/crashes every 24 to 48 hours or so too.
My machine stays powered on but everything is hung and unresponsive.
Neither syslog or the BMC logs show anything interesting.
This is running 4.2...

I originally thought this was a memory issue, but I swapped to different sticks (ecc reg) and ran memtest for quite a few passes without issues or errors.

The other thing I thought could cause it was high IOH temperature (my board is an X8DTH-if which has two IOH chips right next to each other that get really hot.).
To combat this I stuck an 80mm fan directly on top, but this hang/crash seems to still occur.
I'm going to attempt a fresh install from the latest ISO, but if that doesn't fix it, I'm just going to something else

The last entry in syslog always is the bit about

Code:

rrdcached[4234]: removing old journal /var/lib/rrdcached/journal/rrd.journal.1462685004.325075

I see this task runs every hour, so it might just be that there is nothing else that may be hitting syslog between this and the hang, but it's the only thing that I have to work with.

den · May 8, 2016

All the best. Let us know how you go. It's very annoying...

marhat · May 8, 2016

I have the same problem, and also all been checked. One thing that has not checked a RAID. Do you have a solution hardware or software?

Proxmox host crashed? Found in a powered down state...

Member

Proxmox Staff Member

Member

Member

Member

Member

Member

Member

New Member

Member

Member

Member

New Member

Member

Member

New Member

Member

New Member

Member

New Member