Server boot problem after shutdown

Emanuele Rizzo

New Member
Jun 19, 2018
9
1
3
41
Hello, I installed Proxmox VE 5.1 on three nodes (SunFire x4140).
They worked until I shutdowned them. After the shutdown they don't boot and reboot continuosly.
I reinstalled Proxmox VE on a node. If I reboot it, the node boots correctly but If I shutdow it, a boot problem recurs.

I tried to start the node with live linux but there wasn't any bootlog file in /var/log to check.
Someone can help me, please?
 
Last edited:
Syslog doesn't record anything. The last line refers to the last shutdown after which the server is no longer boots
 
Have you the possibility to analyse the hardware for malfunction? Like ILO, IDRAC, IPMI?
After the shutdown they don't boot and reboot continuosly.
If this issue starts, the server reboot like "boot and start proxmox, if is up reboot again", or "boot and reboot during the bios only"? Have you tested newer PVEversion 5.2 what have kernel 4.15?
 
New Have you the possibility to analyse the hardware for malfunction? Like ILO, IDRAC, IPMI?
I tried to install Ubuntu on these nodes and they work correctly, so I excluded hardware faults.

The server boots (I can see grub list) and starts proxmox. Just before the login console appears, it restarts
 
Strange, Proxmox use the Ubuntukernel. Please attach the syslog, kernlog.. for the specific time. Also dmesg.
 
Strange, Proxmox use the Ubuntukernel. Please attach the syslog, kernlog.. for the specific time. Also dmesg.

As I wrote in the first post there isn't any record in logfile (syslog, kern.log - there aren't bootlog and dmesg in /var/log directory).

The last record in syslog is:
"Jun 21 11:40:30 pvenode8 systemd[1]: Stopped PVE Cluster Ressource Manager Daemon."

The last record in kern.log is:
"Jun 21 11:40:26 pvenode8 pve-guests[2673]: <root@pam> end task UPID:pvenode8:00000A79:0000E484:5B2B728A:stopall::root@pam: OK"

I know that it is very strange, but it is so and I do not know what to do
 
I installed proxmox 5.2 and run commands you suggested after first boot


root@pvenode8:~# dmesg -l warn
[ 0.052103] #2 #3 #4 #5
[ 0.172791] mtrr: your CPUs had inconsistent variable MTRR settings
[ 0.310070] ACPI: PCI Interrupt Link [LUB0] enabled at IRQ 23
[ 0.884531] ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 22
[ 1.547231] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[ 1.768600] ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 21
[ 1.769997] ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 20
[ 1.772070] ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 23
[ 1.772479] ACPI: PCI Interrupt Link [LNEB] enabled at IRQ 19
[ 1.773024] aacraid 0000:07:00.0: can't disable ASPM; OS doesn't have ASPM control
[ 1.773744] aacraid: Comm Interface enabled
[ 1.778118] ACPI: PCI Interrupt Link [LE3B] enabled at IRQ 43
[ 1.937066] ACPI: PCI Interrupt Link [LNEA] enabled at IRQ 18
[ 2.097267] ACPI: PCI Interrupt Link [LNED] enabled at IRQ 17
[ 2.257192] ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 16
[ 2.320858] ACPI: PCI Interrupt Link [LMAD] enabled at IRQ 22
[ 2.864919] ACPI: PCI Interrupt Link [LSA2] enabled at IRQ 21
[ 3.374308] mlx4_core 0000:83:00.0: Requested number of MACs is too much for port 1, reducing to 1
[ 3.374311] mlx4_core 0000:83:00.0: Requested number of VLANs is too much for port 1, reducing to 1
[ 4.099758] ACPI: PCI Interrupt Link [IIM0] enabled at IRQ 47
[ 4.656916] ACPI: PCI Interrupt Link [IIM1] enabled at IRQ 46
[ 5.200881] ACPI: PCI Interrupt Link [ISI0] enabled at IRQ 45
[ 5.202480] ACPI: PCI Interrupt Link [ISI1] enabled at IRQ 44
[ 5.203432] ACPI: PCI Interrupt Link [ISI2] enabled at IRQ 47
[ 14.116531] spl: loading out-of-tree module taints kernel.
[ 14.121195] znvpair: module license 'CDDL' taints kernel.
[ 14.121197] Disabling lock debugging due to kernel taint
[ 14.771197] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 19
[ 16.534663] device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
[ 17.716772] new mount options do not match the existing superblock, will be ignored

-------------
root@pvenode8:~# dmesg -l err
[ 1.547131] ERST: Failed to get Error Log Address Range.
[ 1.663332] APEI: Can not request [mem 0xd7fb8830-0xd7fb8883] for APEI BERT registers
[ 3.374321] mlx4_core 0000:83:00.0: command 0x34 failed: fw status = 0x2
[ 3.374387] mlx4_core 0000:83:00.0: Fail to get port 1 uplink guid
[ 3.374452] mlx4_core 0000:83:00.0: command 0x34 failed: fw status = 0x2
[ 3.374513] mlx4_core 0000:83:00.0: Fail to get port 2 uplink guid
[ 3.374571] mlx4_core 0000:83:00.0: Fail to get physical port id
[ 8.730955] print_req_error: I/O error, dev sr0, sector 1252960
[ 10.362073] print_req_error: I/O error, dev sr0, sector 1252964
[ 10.362132] Buffer I/O error on dev sr0, logical block 313241, async page read
[ 10.998074] print_req_error: I/O error, dev sr0, sector 248
[ 11.288948] print_req_error: I/O error, dev sr0, sector 248
[ 11.289004] Buffer I/O error on dev sr0, logical block 62, async page read
[ 11.295197] print_req_error: I/O error, dev sr0, sector 252
[ 11.295254] Buffer I/O error on dev sr0, logical block 63, async page read
[ 11.581948] print_req_error: I/O error, dev sr0, sector 512
[ 11.593198] print_req_error: I/O error, dev sr0, sector 512
[ 11.593280] Buffer I/O error on dev sr0, logical block 128, async page read
[ 12.436948] print_req_error: I/O error, dev sr0, sector 516
[ 12.437030] Buffer I/O error on dev sr0, logical block 129, async page read
[ 12.444073] print_req_error: I/O error, dev sr0, sector 4096
[ 12.444155] Buffer I/O error on dev sr0, logical block 1024, async page read
[ 12.465948] print_req_error: I/O error, dev sr0, sector 4100
[ 12.466029] Buffer I/O error on dev sr0, logical block 1025, async page read
[ 12.488064] Buffer I/O error on dev sr0, logical block 313238, async page read
[ 12.508938] Buffer I/O error on dev sr0, logical block 313239, async page read
[ 12.530938] Buffer I/O error on dev sr0, logical block 313238, async page read
[ 14.661600] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[ 14.661773] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[ 14.807763] Error: Driver 'pcspkr' is already registered, aborting...
 
I'am not a hw-specialist, but that doesn't looks good.
mtrr: your CPUs had inconsistent variable MTRR settings
Look like there are some missconfiguration in bios.

znvpair: module license 'CDDL' taints kernel.
What filesystem do you use?

APEI: Can not request [mem 0xd7fb8830-0xd7fb8883] for APEI BERT registers
Look like there are some missconfiguration in bios.

[ 3.374321] mlx4_core 0000:83:00.0: command 0x34 failed: fw status = 0x2
Problem with your networkcard, please update Firmware.

[ 10.362073] print_req_error: I/O error, dev sr0, sector 1252964
[ 10.362132] Buffer I/O error on dev sr0, logical block 313241, async page read
[ 10.998074] print_req_error: I/O error, dev sr0, sector 248
[ 11.288948] print_req_error: I/O error, dev sr0, sector 248
[ 11.289004] Buffer I/O error on dev sr0, logical block 62, async page read
[ 11.295197] print_req_error: I/O error, dev sr0, sector 252
Damaged cdrom inserted?

Your Serverfirmware is up do date? If not, update the whole firmware. Contact Support.
 
I installed Debian 9.4 on the same node.
dmesg returns the same errors and warnings but it works!


root@debiantest:~# dmesg -l warn
[ 0.422406] mtrr: your CPUs had inconsistent variable MTRR settings
[ 0.539405] ACPI: PCI Interrupt Link [LUB0] enabled at IRQ 23
[ 1.112527] ACPI: PCI Interrupt Link [LUB2] enabled at IRQ 22
[ 1.483315] [Firmware Warn]: GHES: Poll interval is 0 for generic hardware error source: 1, disabled.
[ 2.625012] ACPI: PCI Interrupt Link [LMAC] enabled at IRQ 21
[ 2.629656] ACPI: PCI Interrupt Link [LNEB] enabled at IRQ 19
[ 2.634881] ACPI: PCI Interrupt Link [LE3B] enabled at IRQ 43
[ 2.805082] ACPI: PCI Interrupt Link [LNEA] enabled at IRQ 18
[ 2.977047] ACPI: PCI Interrupt Link [LNED] enabled at IRQ 17
[ 3.157248] ACPI: PCI Interrupt Link [LNEC] enabled at IRQ 16
[ 3.168951] ACPI: PCI Interrupt Link [LMAD] enabled at IRQ 20
[ 3.713145] ACPI: PCI Interrupt Link [IIM0] enabled at IRQ 47
[ 3.729204] ACPI: PCI Interrupt Link [LSA0] enabled at IRQ 23
[ 3.730207] ACPI: PCI Interrupt Link [LSA1] enabled at IRQ 22
[ 3.731000] ACPI: PCI Interrupt Link [LSA2] enabled at IRQ 21
[ 4.222497] mlx4_core 0000:83:00.0: Requested number of MACs is too much for port 1, reducing to 1
[ 4.222499] mlx4_core 0000:83:00.0: Requested number of VLANs is too much for port 1, reducing to 1
[ 4.256930] ACPI: PCI Interrupt Link [IIM1] enabled at IRQ 46
[ 4.800928] ACPI: PCI Interrupt Link [ISI0] enabled at IRQ 45
[ 4.802669] ACPI: PCI Interrupt Link [ISI1] enabled at IRQ 44
[ 4.803490] ACPI: PCI Interrupt Link [ISI2] enabled at IRQ 47
[ 9.235647] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 19
root@debiantest:~#
root@debiantest:~#
root@debiantest:~# dmesg -l err
[ 1.483221] ERST: Failed to get Error Log Address Range.
[ 2.533838] BERT: Can't request iomem region <00000000d7fb8830-00000000d7fb8883>.
[ 4.222509] mlx4_core 0000:83:00.0: command 0x34 failed: fw status = 0x2
[ 4.222560] mlx4_core 0000:83:00.0: Fail to get port 1 uplink guid
[ 4.222611] mlx4_core 0000:83:00.0: command 0x34 failed: fw status = 0x2
[ 4.222658] mlx4_core 0000:83:00.0: Fail to get port 2 uplink guid
[ 4.222702] mlx4_core 0000:83:00.0: Fail to get physical port id
[ 9.092652] k10temp 0000:00:18.3: unreliable CPU thermal sensor; monitoring disabled
[ 9.092743] k10temp 0000:00:19.3: unreliable CPU thermal sensor; monitoring disabled
[ 395.137777] [drm:drm_edid_block_valid [drm]] *ERROR* EDID checksum is invalid, remainder is 198
[ 395.137785] Raw EDID:
[ 395.137791] 00 ff ff ff ff ff ff 00 1d a3 72 00 00 00 00 00
[ 395.137794] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137797] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137799] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137802] 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137804] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137807] 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 395.137809] 00 00 00 00 00 00 00 00 9f ff ff ff ff ff ff ff
root@debiantest:~#
 
Strange. The Server do not boot completely, only the kernel and some services. Maybe it works if you install first a Debian? https://pve.proxmox.com/wiki/Install_Proxmox_VE_on_Debian_Stretch
Another way can be to boot with an older kernel like 4.10. Your Controller/Bios look like from 2008, maybe to old for the PVE-Kernel.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!