Proxmox node randomly restarting itself

MaverickZA

New Member
Jun 10, 2011
7
0
1
Hi All,

I recently added our second proxmox server into a cluster. The problem is that the new slave keeps randomly rebooting itself. It will somtimes stay up for 7 days + and other times only a few hours.

The box in question used to be a Windows server with no problems whatsoever, I am hesitant to go to my employer and tell them that it's hardware since it was running for months with aforementioned Windows Server 2003. I am currently running Proxmox v1.8 on both master and slave.

Is there any sort of "feature" or the like in Proxmox that would trigger this sort of behaviour?

I am currently running a hardware stress test and it's being going fine for 20mins now.

What is also interesting to note is that the behaviour started when I moved a CentOS KVM VM from the master to this node via the migration tool, the VM is running fine on the box, barring the random reboots of course, but I am not sure if this is actually relevant to the issue.

I have been looking through the logs trying to see if I can find any sort of error message before it reboots, but none that I can see, if someone can give me a pointer as to what i should be looking for that would be great.

Thanks for any help!
 
go through the logs (var/log/syslog) and see if you find any error.

also make sure you run the latest version - to verify, post the output of 'pveversion -v'.

and just to remember, a faulty hardware works almost always before it gets faulty ...
 
Hi Tom,

Thanks for your prompt reply -

SISA9:~# pveversion -v
pve-manager: 1.8-18 (pve-manager/1.8/6070)
running kernel: 2.6.32-4-pve
proxmox-ve-2.6.32: 1.8-33
pve-kernel-2.6.32-4-pve: 2.6.32-33
qemu-server: 1.1-30
pve-firmware: 1.0-11
libpve-storage-perl: 1.0-17
vncterm: 0.9-2
vzctl: 3.0.27-1pve1
vzdump: 1.2-13
vzprocps: 2.0.11-2
vzquota: 3.0.11-1
pve-qemu-kvm: 0.14.1-1
ksm-control-daemon: 1.0-6

I grepped the syslog looking for error, failure failed and fail - nothing.

The last reported messages before the system restarted at 9.54am were;

Oct 13 09:52:02 SISA9 snmpd[2790]: Connection from UDP: [192.168.106.17]:2160
Oct 13 09:52:02 SISA9 snmpd[2790]: Connection from UDP: [192.168.106.17]:2160
Oct 13 09:52:02 SISA9 snmpd[2790]: Connection from UDP: [192.168.106.17]:2160
Oct 13 09:52:02 SISA9 snmpd[2790]: Connection from UDP: [192.168.106.17]:2160
Oct 13 09:52:02 SISA9 snmpd[2790]: Connection from UDP: [192.168.106.17]:2160
Oct 13 09:52:06 SISA9 pvemirror[2956]: starting cluster syncronization
Oct 13 09:52:06 SISA9 pvemirror[2956]: syncing master configuration from '192.168.106.28'
Oct 13 09:52:06 SISA9 pvemirror[2956]: syncing templates
Oct 13 09:52:06 SISA9 pvemirror[2956]: cluster syncronization finished (0.27 seconds (files 0.00, config 0.12))

Then there is a 2min gap in the log (assuming here this is when it restarted) I see this;-

Oct 13 09:54:49 SISA9 kernel: imklog 3.18.6, log source = /proc/kmsg started.
Oct 13 09:54:49 SISA9 rsyslogd: [origin software="rsyslogd" swVersion="3.18.6" x-pid="2685" x-info="http://www.rsyslog.com"] restart
Oct 13 09:54:49 SISA9 kernel: Linux version 2.6.32-4-pve (unknown) (root@oahu) (gcc version 4.3.2 (Debian 4.3.2-1.1) ) #1 SMP Mon May 9 12:59:57 CEST 2011
Oct 13 09:54:49 SISA9 kernel: Command line: root=/dev/mapper/pve-root ro
Oct 13 09:54:49 SISA9 kernel: KERNEL supported cpus:
Oct 13 09:54:49 SISA9 kernel: Intel GenuineIntel
Oct 13 09:54:49 SISA9 kernel: AMD AuthenticAMD
Oct 13 09:54:49 SISA9 kernel: Centaur CentaurHauls
Oct 13 09:54:49 SISA9 kernel: BIOS-provided physical RAM map:
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 0000000000000000 - 000000000009f400 (usable)
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 000000000009f400 - 00000000000a0000 (reserved)
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 0000000000100000 - 00000000df62f000 (usable)
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 00000000df62f000 - 00000000df63c000 (ACPI data)
Oct 13 09:54:49 SISA9 kernel: BIOS-e820: 00000000df63c000 - 00000000df63d000 (usable)
Oct 13 09:54:49 SISA9 kernel: DMI 2.6 present.
Oct 13 09:54:49 SISA9 kernel: last_pfn = 0x11ffff max_arch_pfn = 0x400000000
Oct 13 09:54:49 SISA9 kernel: MTRR default type: write-back
Oct 13 09:54:49 SISA9 kernel: MTRR fixed ranges enabled:
Oct 13 09:54:49 SISA9 kernel: 00000-9FFFF write-back
Oct 13 09:54:49 SISA9 kernel: A0000-BFFFF uncachable
Oct 13 09:54:49 SISA9 kernel: C0000-FFFFF write-protect
Oct 13 09:54:49 SISA9 kernel: MTRR variable ranges enabled:
Oct 13 09:54:49 SISA9 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Oct 13 09:54:49 SISA9 kernel: 1 disabled
Oct 13 09:54:49 SISA9 kernel: 2 disabled
Oct 13 09:54:49 SISA9 kernel: 3 disabled
Oct 13 09:54:49 SISA9 kernel: 4 disabled
Oct 13 09:54:49 SISA9 kernel: 5 disabled
Oct 13 09:54:49 SISA9 kernel: 6 disabled
Oct 13 09:54:49 SISA9 kernel: 7 disabled
Oct 13 09:54:49 SISA9 kernel: x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Oct 13 09:54:49 SISA9 kernel: last_pfn = 0xdf63d max_arch_pfn = 0x400000000
Oct 13 09:54:49 SISA9 kernel: initial memory mapped : 0 - 20000000
Oct 13 09:54:49 SISA9 kernel: init_memory_mapping: 0000000000000000-00000000df63d000
Oct 13 09:54:49 SISA9 kernel: 0000000000 - 00df600000 page 2M
Oct 13 09:54:49 SISA9 kernel: 00df600000 - 00df63d000 page 4k
Oct 13 09:54:49 SISA9 kernel: kernel direct mapping tables up to df63d000 @ 8000-e000
Oct 13 09:54:49 SISA9 kernel: init_memory_mapping: 0000000100000000-000000011ffff000
Oct 13 09:54:49 SISA9 kernel: 0100000000 - 011fe00000 page 2M
Oct 13 09:54:49 SISA9 kernel: 011fe00000 - 011ffff000 page 4k
Oct 13 09:54:49 SISA9 kernel: kernel direct mapping tables up to 11ffff000 @ c000-13000
Oct 13 09:54:49 SISA9 kernel: RAMDISK: 37572000 - 37fefd87
Oct 13 09:54:49 SISA9 kernel: ACPI: RSDP 00000000000f4f00 00024 (v02 HP )
Oct 13 09:54:49 SISA9 kernel: ACPI: XSDT 00000000df630040 000B4 (v01 HP ProLiant 00000002 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: FACP 00000000df630140 000F4 (v03 HP ProLiant 00000002 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI Warning: Invalid length for Pm1aControlBlock: 32, using default 16 (20090903/tbfadt-607)
Oct 13 09:54:49 SISA9 kernel: ACPI: DSDT 00000000df630240 02005 (v01 HP DSDT 00000001 INTL 20030228)
Oct 13 09:54:49 SISA9 kernel: ACPI: FACS 00000000df62f100 00040
Oct 13 09:54:49 SISA9 kernel: ACPI: SPCR 00000000df62f140 00050 (v01 HP SPCRRBSU 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: MCFG 00000000df62f1c0 0003C (v01 HP ProLiant 00000001 00000000)
Oct 13 09:54:49 SISA9 kernel: ACPI: HPET 00000000df62f200 00038 (v01 HP ProLiant 00000002 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: FFFF 00000000df62f240 00064 (v02 HP ProLiant 00000002 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: SPMI 00000000df62f2c0 00040 (v05 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: ERST 00000000df62f300 001D0 (v01 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: APIC 00000000df62f500 0015E (v01 HP ProLiant 00000002 00000000)
Oct 13 09:54:49 SISA9 kernel: ACPI: SRAT 00000000df62f680 00570 (v01 HP Proliant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: FFFF 00000000df62fc00 00176 (v01 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: BERT 00000000df62fd80 00030 (v01 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: HEST 00000000df62fdc0 000BC (v01 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: DMAR 00000000df62fe80 0011C (v01 HP ProLiant 00000001 Ò? 0000162E)
Oct 13 09:54:49 SISA9 kernel: ACPI: SSDT 00000000df632280 00125 (v03 HP CRSPCI0 00000002 HP 00000001)
Oct 13 09:54:49 SISA9 kernel: ACPI: SSDT 00000000df6323c0 00255 (v03 HP riser1a 00000002 INTL 20061109)

To me it looks just like a normal boot log, although I am no expert.

The stress test has been running for 2 hours and still no issues.
 
Hi,

you should send your syslog to a second server, maybe the last entries aren´t stored to disk. If you are using a raidcontroller, take also a look at the logs there.


Sven
 
Hi,

Thanks for the reply. Yep it is running hardware raid 5. No logs for it though. I am assuming you think there is an issue with the disks, hence the suggestion to log to an external server rather than local disk? I would think this would be unlikely because of raid 5 and I would see I/O errors in the log before it crashes, I have seen plenty disks on linux servers go raid and non-raid and there is always warning signs.

I am really baffled on this one gents, any other ideas?

Regards
 
...

I am really baffled on this one gents, any other ideas?

Regards

as already mentioned, always use the latest stable packages, follow the upgrade instructions.
 
If updating to the latest version does not help, run memetest one complete pass.

I have had a few machines that randomly reboot leaving no log information.
Each one of them was traced to a RAM issue.
When I ran memtest, it too caused the server to reboot.

memtest is now the first diagnostic I run when a machine restarts for no reason.
 
If updating to the latest version does not help, run memetest one complete pass.

I have had a few machines that randomly reboot leaving no log information.
Each one of them was traced to a RAM issue.
When I ran memtest, it too caused the server to reboot.

memtest is now the first diagnostic I run when a machine restarts for no reason.
Hi,
another possibility is the PSU - if you have chance to switch the power supply you should do that.

Udo