Proxmox died and didnt reboot

TechHome

Active Member
Apr 12, 2020
40
1
28
Proxmox died today and it didnt reboot. The server was still on, but proxmox wasn't active anymore(black screen)

Here are the last syslogs:
Code:
c 16 03:52:03 pve vzdump[16817]: INFO: Starting Backup of VM 140 (qemu)
Dec 16 03:52:48 pve pvestatd[1453]: status update time (5.363 seconds)
Dec 16 03:53:10 pve pvestatd[1453]: status update time (7.641 seconds)
Dec 16 03:53:28 pve pvestatd[1453]: status update time (5.830 seconds)
Dec 16 03:53:37 pve pvestatd[1453]: status update time (5.137 seconds)
Dec 16 03:54:29 pve pvestatd[1453]: status update time (6.871 seconds)
Dec 16 03:54:47 pve pvestatd[1453]: status update time (5.064 seconds)
Dec 16 03:55:33 pve kernel: [2630773.403678] nfs: server 192.168.1.46 not responding, still trying
Dec 16 03:55:33 pve kernel: [2630773.404322] nfs: server 192.168.1.46 not responding, still trying
Dec 16 03:55:33 pve kernel: [2630773.404650] nfs: server 192.168.1.46 not responding, still trying
Dec 16 03:55:33 pve kernel: [2630773.404962] nfs: server 192.168.1.46 not responding, still trying
Dec 16 03:55:33 pve kernel: [2630773.405273] nfs: server 192.168.1.46 not responding, still trying
Dec 16 03:55:33 pve kernel: [2630773.407307] nfs: server 192.168.1.46 OK
Dec 16 03:55:34 pve kernel: [2630773.854498] nfs: server 192.168.1.46 OK
Dec 16 03:55:34 pve kernel: [2630773.922401] nfs: server 192.168.1.46 OK
Dec 16 03:55:34 pve kernel: [2630773.990255] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve kernel: [2630778.913894] call_decode: 175 callbacks suppressed
Dec 16 03:55:39 pve kernel: [2630778.913958] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve kernel: [2630778.914201] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve kernel: [2630778.914402] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve kernel: [2630778.914512] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve kernel: [2630778.914723] nfs: server 192.168.1.46 OK
Dec 16 03:55:39 pve pvestatd[1453]: status update time (6.832 seconds)
Dec 16 03:55:49 pve vzdump[16817]: INFO: Finished Backup of VM 140 (00:03:46)
Dec 16 03:55:49 pve vzdump[16817]: INFO: Backup job finished successfully
Dec 16 03:55:49 pve postfix/pickup[26553]: D2076811E4: uid=0 from=<root>
Dec 16 03:55:49 pve vzdump[16816]: <root@pam> end task UPID:pve:000041B1:0F9E64E8:5FD94E02:vzdump::root@pam: OK
Dec 16 03:55:49 pve postfix/cleanup[1392]: D2076811E4: message-id=<20201216025549.D2076811E4@lin.unifi>
Dec 16 03:55:49 pve postfix/qmgr[1435]: D2076811E4: from=<root@pannfi>, size=230226, nrcpt=1 (queue active)
Dec 16 03:55:51 pve postfix/smtp[1394]: D2076811E4:.tk>, relay=smtp.gmail.com[142.250.102.108]:587, delay=2, delays=0.02/0.02/0.34/1.6, dsn=2.0.0, status=sent (250 2.0.0 OK  1608087351 d14sm19902920edu.63 - gsmtp)
Dec 16 03:55:51 pve postfix/qmgr[1435]: D2076811E4: removed
Dec 16 04:05:39 pve systemd[1]: Starting Daily PVE download activities...
Dec 16 04:05:41 pve pveupdate[4248]: <root@pam> starting task UPID:pve:000010A3:0FAF640B:5FD97985:aptupdate::root@pam:
Dec 16 04:05:42 pve pveupdate[4259]: update new package list: /var/lib/pve-manager/pkgupdates
Dec 16 04:05:44 pve pveupdate[4248]: <root@pam> end task UPID:pve:000010A3:0FAF640B:5FD97985:aptupdate::root@pam: OK
Dec 16 04:05:44 pve systemd[1]: pve-daily-update.service: Succeeded.
Dec 16 04:05:44 pve systemd[1]: Started Daily PVE download activities.
Dec 16 04:08:26 pve kernel: [2631546.001506] mce: [Hardware Error]: Machine check events logged
Dec 16 04:09:01 pve kernel: [2631581.090561] audit: type=1400 audit(1608088141.686:1530): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-100_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=5648 comm="(ionclean)" srcname="/" flags="rw, rbind"
Dec 16 04:17:01 pve CRON[7872]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 16 04:40:35 pve kernel: [2633474.586335] mce: [Hardware Error]: Machine check events logged
Dec 16 04:41:37 pve kernel: [2633537.200253] nvme 0000:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec 16 04:42:30 pve kernel: [2633589.604433] mce: [Hardware Error]: Machine check events logged
Dec 16 05:08:44 pve kernel: [2635163.720854] mce: [Hardware Error]: Machine check events logged
Dec 16 05:09:01 pve kernel: [2635180.914441] audit: type=1400 audit(1608091741.713:1532): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-100_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=22051 comm="(ionclean)" srcname="/" flags="rw, rbind"
Dec 16 05:17:01 pve CRON[24276]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 16 05:39:01 pve kernel: [2636980.829000] audit: type=1400 audit(1608093541.731:1533): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-100_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=30195 comm="(ionclean)" srcname="/" flags="rw, rbind"
Dec 16 05:44:16 pve kernel: [2637295.225761] pcieport 0000:00:02.2: AER: Multiple Corrected error received: 0000:00:02.2
Dec 16 05:44:16 pve kernel: [2637295.228674] pcieport 0000:00:02.2: AER:   device [8086:6f06] error status/mask=00001100/00002000
Dec 16 05:44:16 pve kernel: [2637295.229440] pcieport 0000:00:02.2: AER:    [ 8] Rollover           
Dec 16 05:44:16 pve kernel: [2637295.230189] pcieport 0000:00:02.2: AER:    [12] Timeout             
Dec 16 05:44:16 pve kernel: [2637295.230965] pcieport 0000:00:02.2: AER:   Error of this Agent is reported first
Dec 16 05:44:16 pve kernel: [2637295.234157] pcieport 0000:00:02.2: AER: Multiple Corrected error received: 0000:00:02.2
Dec 16 06:15:34 pve systemd[1]: Starting Daily apt upgrade and clean activities...
Dec 16 06:15:35 pve systemd[1]: apt-daily-upgrade.service: Succeeded.
Dec 16 06:15:35 pve systemd[1]: Started Daily apt upgrade and clean activities.
Dec 16 06:17:01 pve CRON[8598]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 16 06:25:01 pve CRON[10768]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))


Code:
root@pve:~# pveversion -v
proxmox-ve: 6.2-2 (running kernel: 5.4.65-1-pve)
pve-manager: 6.2-12 (running version: 6.2-12/b287dd27)
pve-kernel-5.4: 6.2-7
pve-kernel-helper: 6.2-7
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.5
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.0-3
libpve-common-perl: 6.2-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.0-6
libpve-storage-perl: 6.2-8
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 0.9.0-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.3-1
pve-cluster: 6.2-1
pve-container: 3.2-2
pve-docs: 6.2-6
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-1
pve-qemu-kvm: 5.1.0-3
pve-xtermjs: 4.7.0-2
qemu-server: 6.2-15
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.4-pve2
 
Last edited:
It again crashed with the last syslog messages;

Code:
Dec 16 12:58:27 pve pvedaemon[1416]: <root@pam> successful auth for user 'root@pam'
Dec 16 13:11:30 pve systemd[1]: Starting Cleanup of Temporary Directories...
Dec 16 13:11:30 pve systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Dec 16 13:11:30 pve systemd[1]: Started Cleanup of Temporary Directories.
Dec 16 13:17:01 pve CRON[10020]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Dec 16 14:17:01 pve CRON[26164]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
 
hi,

can you post output from journalctl ?
 
please run mkdir -p /var/log/journal to enable persistent journals. that way you can check the output from the last boot when system freezes again.
 
i took a look at your logs. before the reboot there's a bunch of these:
Code:
Dec 16 19:21:06 pve kernel: mce: [Hardware Error]: Machine check events logged
which is a machine check exception, so your server is likely freezing because of the hardware errors.

https://wiki.archlinux.org/index.php/Machine-check_exception
 
Mcelog:
Code:
mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 4 BANK 0 TSC 2123decdcd77
RIP !INEXACT! 10:ffffffffac0ff117
TIME 1608143383 Wed Dec 16 19:29:43 2020
MCG status:RIPV MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal parity error
STATUS f200000000010005 MCGSTATUS 5
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 4 BANK 0 TSC 1b4a8d78c7e7
RIP !INEXACT! 10:ffffffff994fe7bb
TIME 1598731578 Sat Aug 29 22:06:18 2020
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal parity error
STATUS b200000000010005 MCGSTATUS 5
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 4 BANK 0 TSC 1676a66f6b56b0
RIP !INEXACT! 10:ffffffff914bceab
TIME 1608097127 Wed Dec 16 06:38:47 2020
MCG status:RIPV MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal parity error
STATUS f200000000010005 MCGSTATUS 5
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
not finished?
CPU 4 BANK 0 TSC 11a479c1216c
RIP !INEXACT! 10:ffffffff86c3b4a4
TIME 1608127696 Wed Dec 16 15:08:16 2020
MCG status:RIPV MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal parity error
STATUS f200000000010005 MCGSTATUS 5
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: failed to prefill DIMM database from DMI data
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 4a104b614e0
TIME 1608309894 Fri Dec 18 17:44:54 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 4a9516a49b8
TIME 1608309909 Fri Dec 18 17:45:09 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 61d048d38a8
TIME 1608310576 Fri Dec 18 17:56:16 2020
MCG status:
MCi status:
Error overflow
Corrected error
Error enabled
MCA: Internal parity error
STATUS d000008000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 6570e615890
TIME 1608310680 Fri Dec 18 17:58:00 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 67e8b83b2d0
TIME 1608310751 Fri Dec 18 17:59:11 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 6cb6d3f5cd8
TIME 1608310889 Fri Dec 18 18:01:29 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
mcelog: warning: 32 bytes ignored in each record
mcelog: consider an update
Hardware event. This is not a software error.
MCE 0
CPU 4 BANK 0 TSC 7064b01f978
TIME 1608310994 Fri Dec 18 18:03:14 2020
MCG status:
MCi status:
Corrected error
Error enabled
MCA: Internal parity error
STATUS 9000004000010005 MCGSTATUS 0
MCGCAP 7000c16 APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 79
 
1608547011726.png

take the STATUS codes from the log and convert them to binary:
Code:
echo d000008000010005 | xxd -p -r | xxd -b -c1 | cut -d' ' -f2 | tr -d '\n'
1101000000000000000000001000000000000000000000010000000000000101

the rightmost bit is the 0th bit and the leftmost one 63rd.

you can then interpret all of these by checking the figure (from the intel software developer's manual)

also notice you're always getting the error from the same CPU 4 BANK 0 TSC 1676a66f6b56b0


either way, mcelog has been deprecated in favor of rasdaemon. ras-mc-ctl --errors could give you more reasonable output
 
Last edited:
I booted VMWare ESXi up and hoped for easier understandable logs. I found out that only PCPU8 and PCPU9 are faulty.
VMWare gives me these logs:
Code:
MCA: 150: UC Excp G5 B0 Sb20000000010005 A0 M0 P0/0 Internal Parity Error.
Now the question is if I can disable/ignore core 8 and 9 in Proxmox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!