Intermittent crashing

urbanator

Member
Nov 29, 2019
18
2
8
34
Hi,

I'm having a problem with my proxmox server crashing intermittently (no CLI is shown just the kernel log), from anywhere between 90mins to 24hours after a reboot.

I've checked the logs in /var/logs and have not found anything that stands out in them in the lead up to a crash. The first crash happened after the server had been stable and running for 6 days, no updates were applied/configuration changed etc during this time.

I am going to perform a memtest tomorrow to check the RAM for faults, and I've already checked that the CPU temps look normal. Other than that does anybody have an ideas as to what could be causing this (or any other logs/files to look at that could help find out the cause)?

sys.log at the moment of a crash: (the white section with symbols appear to be a corruption in the log perhaps as it crashes?):
VirtualBoxVM_2019-12-21_18-06-27.png

kern.log (the only entries before the crash in this log is when scheduled daily backups ran at 5am. The server crashed at around 2:40pm later in the day, before somebody could reboot at around 5:15pm):
VirtualBoxVM_2019-12-21_18-05-54.png

The package versions being run:
proxmox-ve: 6.1-2 (running kernel: 5.3.13-1-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-1
pve-kernel-helper: 6.1-1
pve-kernel-5.0: 6.0-11
pve-kernel-5.3.13-1-pve: 5.3.13-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-3-pve: 5.0.21-7
pve-kernel-5.0.21-2-pve: 5.0.21-7
pve-kernel-5.0.15-1-pve: 5.0.15-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-15
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191127-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-4
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
Just an update on this, a memtest found no errors. I re-installed the latest kernel and headers to rule out any problems with that.
I have installed kdump-tools but as suggested here, but it didn't create a log (I assume we need to give it longer to create the log and reboot).

The only noticable error in the sys.log I can find is this which is shown on startup:
Dec 23 07:29:52 charlie kernel: [ 10.616896] ------------[ cut here ]------------
Dec 23 07:29:52 charlie kernel: [ 10.616897] General protection fault in user access. Non-canonical address?
Dec 23 07:29:52 charlie kernel: [ 10.616904] WARNING: CPU: 7 PID: 864 at arch/x86/mm/extable.c:126 ex_handler_uaccess+0x52/0x60
Dec 23 07:29:52 charlie kernel: [ 10.616904] Modules linked in: edac_mce_amd kvm_amd kvm irqbypass zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio nouveau mxm_wmi video ttm snd_hda_intel drm_kms_helper snd_hda_codec snd_hda_core drm snd_hwdep joydev i2c_algo_bit input_leds snd_pcm fb_sys_fops pcspkr syscopyarea sysfillrect snd_timer sysimgblt wmi_bmof ccp snd soundcore k10temp mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc nct6775 hwmon_vid ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c usbmouse hid_generic usbkbd usbhid hid i2c_piix4 r8169 realtek ahci libahci wmi gpio_amdpt gpio_generic
Dec 23 07:29:52 charlie kernel: [ 10.616933] CPU: 7 PID: 864 Comm: kworker/u32:6 Tainted: P O 5.3.13-1-pve #1
Dec 23 07:29:52 charlie kernel: [ 10.616934] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M Pro4, BIOS P5.10 10/05/2018
Dec 23 07:29:52 charlie kernel: [ 10.616935] RIP: 0010:ex_handler_uaccess+0x52/0x60
Dec 23 07:29:52 charlie kernel: [ 10.616937] Code: c4 08 b8 01 00 00 00 5b 5d c3 80 3d 85 d6 78 01 00 75 db 48 c7 c7 58 10 34 8e 48 89 75 f0 c6 05 71 d6 78 01 01 e8 ff a1 01 00 <0f> 0b 48 8b 75 f0 eb bc 66 0f 1f 44 00 00 0f 1f 44 00 00 55 80 3d
Dec 23 07:29:52 charlie kernel: [ 10.616937] RSP: 0018:ffff9c0080fc7cc0 EFLAGS: 00010282
Dec 23 07:29:52 charlie kernel: [ 10.616938] RAX: 0000000000000000 RBX: ffffffff8de02448 RCX: 0000000000000000
Dec 23 07:29:52 charlie kernel: [ 10.616939] RDX: 0000000000000007 RSI: ffffffff8eb83f7f RDI: 0000000000000246
Dec 23 07:29:52 charlie kernel: [ 10.616939] RBP: ffff9c0080fc7cd0 R08: ffffffff8eb83f40 R09: 0000000000029fc0
Dec 23 07:29:52 charlie kernel: [ 10.616940] R10: 00000019c9e71dd8 R11: ffffffff8eb83f40 R12: 000000000000000d
Dec 23 07:29:52 charlie kernel: [ 10.616940] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Dec 23 07:29:52 charlie kernel: [ 10.616941] FS: 0000000000000000(0000) GS:ffff88fdd6bc0000(0000) knlGS:0000000000000000
Dec 23 07:29:52 charlie kernel: [ 10.616941] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 23 07:29:52 charlie kernel: [ 10.616942] CR2: 00005575fbb30bf0 CR3: 000000020edf6000 CR4: 00000000003406e0
Dec 23 07:29:52 charlie kernel: [ 10.616942] Call Trace:
Dec 23 07:29:52 charlie kernel: [ 10.616946] fixup_exception+0x4a/0x61
Dec 23 07:29:52 charlie kernel: [ 10.616948] do_general_protection+0x4e/0x150
Dec 23 07:29:52 charlie kernel: [ 10.616950] general_protection+0x28/0x30
Dec 23 07:29:52 charlie kernel: [ 10.616952] RIP: 0010:strnlen_user+0x4c/0x110
Dec 23 07:29:52 charlie kernel: [ 10.616953] Code: f8 0f 86 e1 00 00 00 48 29 f8 45 31 c9 0f 01 cb 0f ae e8 48 39 c6 49 89 fa 48 0f 46 c6 41 83 e2 07 48 83 e7 f8 31 c9 4c 01 d0 <4c> 8b 1f 85 c9 0f 85 96 00 00 00 42 8d 0c d5 00 00 00 00 41 b8 01
Dec 23 07:29:52 charlie kernel: [ 10.616953] RSP: 0018:ffff9c0080fc7de8 EFLAGS: 00050206
Dec 23 07:29:52 charlie kernel: [ 10.616954] RAX: 0000000000020000 RBX: a8a2ed5c956fbe00 RCX: 0000000000000000
Dec 23 07:29:52 charlie kernel: [ 10.616954] RDX: a8a2ed5c956fbe00 RSI: 0000000000020000 RDI: a8a2ed5c956fbe00
Dec 23 07:29:52 charlie kernel: [ 10.616955] RBP: ffff9c0080fc7df8 R08: 8080808080808080 R09: 0000000000000000
Dec 23 07:29:52 charlie kernel: [ 10.616955] R10: 0000000000000000 R11: 0000000000000000 R12: 00007fffffffefe6
Dec 23 07:29:52 charlie kernel: [ 10.616955] R13: ffff88fdd5caefe6 R14: 0000000000000000 R15: fffff52388572b80
Dec 23 07:29:52 charlie kernel: [ 10.616958] ? _copy_from_user+0x3e/0x60
Dec 23 07:29:52 charlie kernel: [ 10.616960] copy_strings.isra.35+0x92/0x380
Dec 23 07:29:52 charlie kernel: [ 10.616961] __do_execve_file.isra.42+0x5b5/0x9d0
Dec 23 07:29:52 charlie kernel: [ 10.616963] ? kmem_cache_alloc+0x110/0x220
Dec 23 07:29:52 charlie kernel: [ 10.616964] do_execve+0x25/0x30
Dec 23 07:29:52 charlie kernel: [ 10.616966] call_usermodehelper_exec_async+0x188/0x1b0
Dec 23 07:29:52 charlie kernel: [ 10.616967] ? call_usermodehelper+0xb0/0xb0
Dec 23 07:29:52 charlie kernel: [ 10.616968] ret_from_fork+0x22/0x40
Dec 23 07:29:52 charlie kernel: [ 10.616970] ---[ end trace 6a445521f65566dd ]---
 
Just an update on this, a memtest found no errors. I re-installed the latest kernel and headers to rule out any problems with that.
I have installed kdump-tools but as suggested here, but it didn't create a log (I assume we need to give it longer to create the log and reboot).

The only noticable error in the sys.log I can find is this which is shown on startup:


I had the same issues for months, i ended up disabling C-States and not had a crash for a couple of months now.
 
Another update on this; I disabled the "Cool'n'quiet" setting in the BIOS (using an AMD Ryzen setup) around 7 days ago and the system seems to be slightly more stable. It was going well for just over 4 days and then crashed again, and then 3 days after that before again crashing. Which is an improvement over it crashing almost every 24 hours. I'm going to try disabling "C6 mode" in the BIOS when I get the chance as I think this could still possibly be enabled.

There is still nothing in any of the logs that I can see immedietly before it crashes, but I did se this in the kernel log at startup:
I'm unsure if this is related though, as it the system can stay stable for days without issue after boot-up?
VirtualBoxVM_2020-01-09_15-57-53.png
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!