Issue!! Proxmox crashing randomly!

pawanosman · Dec 18, 2021

this is more than 6 months since I have had this issue, proxmox server crashing randomly. and the only thing I can do is reboot the system by pressing "Reset Button"

the output of "pveversion -v"

Code:

proxmox-ve: 6.4-1 (running kernel: 5.4.157-1-pve)
pve-manager: 6.4-13 (running version: 6.4-13/9f411e79)
pve-kernel-5.4: 6.4-11
pve-kernel-helper: 6.4-11
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.143-1-pve: 5.4.143-1
ceph-fuse: 15.2.15-pve1~bpo10
corosync: 3.1.5-pve2~bpo10+1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 3.0.0-1+pve4~bpo10
libjs-extjs: 6.0.1-10
libknet1: 1.22-pve2~bpo10+1
libproxmox-acme-perl: 1.1.0
libproxmox-backup-qemu0: 1.1.0-1
libpve-access-control: 6.4-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.4-4
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.2-3
libpve-storage-perl: 6.4-1
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.1.13-2
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.6-1
pve-cluster: 6.4-1
pve-container: 3.3-6
pve-docs: 6.4-2
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-4
pve-firmware: 3.3-2
pve-ha-manager: 3.1-1
pve-i18n: 2.3-1
pve-qemu-kvm: 5.2.0-6
pve-xtermjs: 4.7.0-3
qemu-server: 6.4-2
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2

after reboot, I just see these logs in the system log

Code:

Dec 18 13:06:37 Proxmox-VE kernel:  do_syscall_64+0x58/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? exit_to_user_mode_prepare+0x170/0x1c0
Dec 18 13:06:37 Proxmox-VE kernel:  ? syscall_exit_to_user_mode+0x1c/0x30
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? sysvec_apic_timer_interrupt+0x4b/0xa0
Dec 18 13:06:37 Proxmox-VE kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 18 13:06:37 Proxmox-VE kernel: RIP: 0033:0x7f6f3e4b6413
Dec 18 13:06:37 Proxmox-VE kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55
Dec 18 13:06:37 Proxmox-VE kernel: RSP: 002b:00007f6ef54e49d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: RAX: ffffffffffffffda RBX: 00007f6ef54e6b30 RCX: 00007f6f3e4b6413
Dec 18 13:06:37 Proxmox-VE kernel: RDX: 0000000000000000 RSI: 00007f6ef54e4a60 RDI: 00007f6ef54e4a60
Dec 18 13:06:37 Proxmox-VE kernel: RBP: 00007f6ef8e1c790 R08: 0000000000000000 R09: 0000000000000000
Dec 18 13:06:37 Proxmox-VE kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: R13: 0000000000000002 R14: 00007f6f0a992820 R15: 00007f6ef8e43780
Dec 18 13:06:37 Proxmox-VE kernel: Call Trace:
Dec 18 13:06:37 Proxmox-VE kernel:  switch_to_sld+0x33/0x40
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to_xtra+0x120/0x510
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to+0x35a/0x430
Dec 18 13:06:37 Proxmox-VE kernel:  ? __switch_to_asm+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  __schedule+0xbd7/0x1250
Dec 18 13:06:37 Proxmox-VE kernel:  ? tick_program_event+0x44/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_reprogram+0x9a/0xa0
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_start_range_ns+0x121/0x300
Dec 18 13:06:37 Proxmox-VE kernel:  schedule+0x3e/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  do_nanosleep+0x90/0x170
Dec 18 13:06:37 Proxmox-VE kernel:  hrtimer_nanosleep+0x94/0x130
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_init_sleeper+0x80/0x80
Dec 18 13:06:37 Proxmox-VE kernel:  __x64_sys_nanosleep+0x99/0xd0
Dec 18 13:06:37 Proxmox-VE kernel:  do_syscall_64+0x58/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? syscall_exit_to_user_mode+0x1c/0x30
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? sysvec_apic_timer_interrupt+0x4b/0xa0
Dec 18 13:06:37 Proxmox-VE kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 18 13:06:37 Proxmox-VE kernel: RIP: 0033:0x7f6f3e4b6413
Dec 18 13:06:37 Proxmox-VE kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55
Dec 18 13:06:37 Proxmox-VE kernel: RSP: 002b:00007f6ef54e49d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: RAX: ffffffffffffffda RBX: 00007f6ef54e6b30 RCX: 00007f6f3e4b6413
Dec 18 13:06:37 Proxmox-VE kernel: RDX: 0000000000000000 RSI: 00007f6ef54e4a60 RDI: 00007f6ef54e4a60
Dec 18 13:06:37 Proxmox-VE kernel: RBP: 00007f6ef8e1c790 R08: 0000000000000000 R09: 0000000000000000
Dec 18 13:06:37 Proxmox-VE kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: R13: 0000000000000002 R14: 00007f6f0a992820 R15: 00007f6ef8e43780
Dec 18 13:06:37 Proxmox-VE kernel: Call Trace:
Dec 18 13:06:37 Proxmox-VE kernel:  switch_to_sld+0x33/0x40
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to_xtra+0x120/0x510
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to+0x35a/0x430
Dec 18 13:06:37 Proxmox-VE kernel:  ? __switch_to_asm+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  __schedule+0xbd7/0x1250
Dec 18 13:06:37 Proxmox-VE kernel:  ? timerqueue_add+0x62/0x90
Dec 18 13:06:37 Proxmox-VE kernel:  ? enqueue_hrtimer+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_start_range_ns+0x121/0x300
Dec 18 13:06:37 Proxmox-VE kernel:  schedule+0x3e/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  do_nanosleep+0x90/0x170
Dec 18 13:06:37 Proxmox-VE kernel:  hrtimer_nanosleep+0x94/0x130
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_init_sleeper+0x80/0x80
Dec 18 13:06:37 Proxmox-VE kernel:  __x64_sys_nanosleep+0x99/0xd0
Dec 18 13:06:37 Proxmox-VE kernel:  do_syscall_64+0x58/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? switch_fpu_return+0x56/0xc0
Dec 18 13:06:37 Proxmox-VE kernel:  ? exit_to_user_mode_prepare+0x170/0x1c0
Dec 18 13:06:37 Proxmox-VE kernel:  ? syscall_exit_to_user_mode+0x1c/0x30
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? syscall_exit_to_user_mode+0x1c/0x30
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 18 13:06:37 Proxmox-VE kernel: RIP: 0033:0x7fc2958d0413
Dec 18 13:06:37 Proxmox-VE kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55
Dec 18 13:06:37 Proxmox-VE kernel: RSP: 002b:00007fc278169a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: RAX: ffffffffffffffda RBX: 00007fc27816bb30 RCX: 00007fc2958d0413
Dec 18 13:06:37 Proxmox-VE kernel: RDX: 0000000000000000 RSI: 00007fc278169ab0 RDI: 00007fc278169ab0
Dec 18 13:06:37 Proxmox-VE kernel: RBP: 00007fc27ae18730 R08: 0000000000000000 R09: 0000000000000000
Dec 18 13:06:37 Proxmox-VE kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: R13: 0000000000000001 R14: 00007fc27b179b10 R15: 00007fc27ae3f720
Dec 18 13:06:37 Proxmox-VE kernel: Call Trace:
Dec 18 13:06:37 Proxmox-VE kernel:  switch_to_sld+0x33/0x40
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to_xtra+0x120/0x510
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to+0x35a/0x430
Dec 18 13:06:37 Proxmox-VE kernel:  ? __switch_to_asm+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  __schedule+0xbd7/0x1250
Dec 18 13:06:37 Proxmox-VE kernel:  ? __schedule+0xbdf/0x1250
Dec 18 13:06:37 Proxmox-VE kernel:  ? timerqueue_add+0x62/0x90
Dec 18 13:06:37 Proxmox-VE kernel:  ? enqueue_hrtimer+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_start_range_ns+0x121/0x300
Dec 18 13:06:37 Proxmox-VE kernel:  schedule+0x3e/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  do_nanosleep+0x90/0x170
Dec 18 13:06:37 Proxmox-VE kernel:  hrtimer_nanosleep+0x94/0x130
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_init_sleeper+0x80/0x80
Dec 18 13:06:37 Proxmox-VE kernel:  __x64_sys_nanosleep+0x99/0xd0
Dec 18 13:06:37 Proxmox-VE kernel:  do_syscall_64+0x58/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  ? do_syscall_64+0x67/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 18 13:06:37 Proxmox-VE kernel: RIP: 0033:0x7fc2958d0413
Dec 18 13:06:37 Proxmox-VE kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55
Dec 18 13:06:37 Proxmox-VE kernel: RSP: 002b:00007fc277d64a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: RAX: ffffffffffffffda RBX: 00007fc277d66b30 RCX: 00007fc2958d0413
Dec 18 13:06:37 Proxmox-VE kernel: RDX: 0000000000000000 RSI: 00007fc277d64ab0 RDI: 00007fc277d64ab0
Dec 18 13:06:37 Proxmox-VE kernel: RBP: 00007fc27adf1740 R08: 0000000000000000 R09: 0000000000000000
Dec 18 13:06:37 Proxmox-VE kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000023
Dec 18 13:06:37 Proxmox-VE kernel: R13: 0000000000000001 R14: 00007fc27b179cb0 R15: 00007fc27ae18730
Dec 18 13:06:37 Proxmox-VE kernel: Call Trace:
Dec 18 13:06:37 Proxmox-VE kernel:  switch_to_sld+0x33/0x40
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to_xtra+0x120/0x510
Dec 18 13:06:37 Proxmox-VE kernel:  __switch_to+0x35a/0x430
Dec 18 13:06:37 Proxmox-VE kernel:  ? __switch_to_asm+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  __schedule+0xbd7/0x1250
Dec 18 13:06:37 Proxmox-VE kernel:  ? timerqueue_add+0x62/0x90
Dec 18 13:06:37 Proxmox-VE kernel:  ? enqueue_hrtimer+0x36/0x70
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_start_range_ns+0x121/0x300
Dec 18 13:06:37 Proxmox-VE kernel:  schedule+0x3e/0xb0
Dec 18 13:06:37 Proxmox-VE kernel:  do_nanosleep+0x90/0x170
Dec 18 13:06:37 Proxmox-VE kernel:  hrtimer_nanosleep+0x94/0x130
Dec 18 13:06:37 Proxmox-VE kernel:  ? hrtimer_init_sleeper+0x80/0x80
Dec 18 13:06:37 Proxmox-VE kernel:  __x64_sys_nanosleep+0x99/0xd0

I don't know what to do even I reinstalled the system to fix this problem!!

please help!

oguz · Dec 20, 2021

hi,

you can try installing a newer kernel (maybe 5.11 or above, we provide this as a package pve-kernel-5.11), or just upgrade your whole setup to PVE7 [0]

[0]: https://pve.proxmox.com/wiki/Upgrade_from_6.x_to_7.0

MrPete · Dec 30, 2021

I'm having a very similar issue, on PVE 7.1 (proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve) )

Sadly, I can only grab crash traces on-screen (and it scrolls off).
I have some modules that need updating... but it tends to crash during any remote-controlled operation through the console.

Will attempt "real" console now to see if that helps

oguz · Jan 4, 2022

MrPete said:
I'm having a very similar issue, on PVE 7.1 (proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve) )

Sadly, I can only grab crash traces on-screen (and it scrolls off).
I have some modules that need updating... but it tends to crash during any remote-controlled operation through the console.

Will attempt "real" console now to see if that helps

could you try the 5.15 kernel too? [0]
please report back whether it fixes the issue for you

[0]: https://forum.proxmox.com/threads/opt-in-linux-kernel-5-15-for-proxmox-ve-7-x-available.100936/

MrPete · Jan 4, 2022

@oguz, I installed 5.15 on all. Not fixed: 12GB into a 32GB VM migration, got another hard crash.

oguz · Jan 5, 2022

MrPete said:
@oguz, I installed 5.15 on all. Not fixed: 12GB into a 32GB VM migration, got another hard crash.

okay, please post the trace from the crash here if you can and we'll take a look

MrPete · Jan 5, 2022

All I have is photos from the console. Is there a way to better capture a crash trace?

MrPete · Jan 31, 2022

*ping*... Anybody have a pointer on how to capture a hard host crash? I'm getting reliable crashes *every* time I attempt a backup on this host... or when proxmox attempts to update (presumably due to trying to backup.)

Right now I can watch a huge amount of console log stuff spin by, but it is not captured anywhere.

Hints and pointers appreciated.

kalasnikov · Jan 31, 2022

It may be a long shot, but have you ran a memory test? I've had some strange issues because either ram wasn't seated properly or I had bad ram.

Cheers!!!

oguz · Jan 31, 2022

MrPete said:
All I have is photos from the console. Is there a way to better capture a crash trace?

you can set up kdump-tools and send us the crashdump file.

alternatively you can try to use a serial console for the output.

MrPete · Feb 1, 2022

Hmmm... looks like kdump-tools and zfs-root are not quite happy together, although there are patches. Working on it.

MrPete · Feb 2, 2022

Current instructions for enabling kdump-tools in proxmox with zfs
(NOTE: apt update etc are NOT currently supported with kdump-tools installed. See below for uninstall instructions for when done testing!)

# apt install kdump-tools

Select no for kexec reboots

Select yes for enabling kdump-tools

# $EDIT /usr/share/initramfs-tools/conf-hooks.d/zfs

add this line, as a bug workaround for now:

MODULES=most

# $EDIT /etc/kernel/cmdline

tack this to the end of the line:

nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M

the crashkernel ram values are [size range]:[crashkernel ram reserved],... (It's not clear how much is actually required. Could be that reserving no more than 256M would work in most cases. In one place, RHEL says 896M is the max that ever works. I don't have time to wait for a bunch of crashes to fail, to test it . Here's kernel.org doc)
You may want to keep nmi_watchdog=1 even without kdump. It stabilized my reboot sequence

# apt install pve-headers-$(uname -r) # This is supposed to help make system updates/upgrades function normally. Not yet tested by me.

# kdump-config unload

# pve-efiboot-tool refresh (maybe not as good as the next one?)
# proxmox-boot-tool refresh

# kdump-config load

Now status should be a-ok, so you can reboot and be ready for the next crash:
# kdump-config status (brief)
# kdump-config show (detailed)

eg

Code:

DUMP_MODE:              kdump
USE_KDUMP:              1
KDUMP_COREDIR:          /var/crash
crashkernel addr: 0x9d000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.15.7-1-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-5.15.7-1-pve
current state:    ready to kdump

kexec command:
  /sbin/kexec -p --command-line="initrd=\EFI\proxmox.15.7-1-pve\initrd.img-5.15.7-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs quiet intel_iommu=on iommu=pt nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

Untested instructions for ensuring system update/upgrade functions with kdump-tools installed and zfs on /

apt install pve-headers-$(uname -r) # do this before the update/upgrade. It's needed for each proxmox/linux version.

Current instructions for removing kdump-tools in proxmox with zfs (incomplete AFAIK)

apt purge kdump-tools (if you use apt remove, it leaves enough config junk around to cause trouble with other updates. )
apt autoremove
manually restore /etc/kernel/cmdline (see above)
verify that the temp workaround is still LOADED. Do not remove "MODULES=most" for now ( /usr/share/initramfs-tools/conf-hooks.d/zfs
apt autoremove again

Not sure on the following:

I had to edit /etc/default/kdump-tools to change to USE_KDUMP=0
A lot of kdump-tools scripts are still present. It appears that kdump-tools remove/purge is buggy.

MrPete · Feb 2, 2022

oguz said:
you can set up kdump-tools and send us the crashdump file.

Done!
First, the dmesg.* lines...
A) Some early lines showing some kind of (minor?) issues

Code:

[  136.170773] audit: type=1400 audit(1643760906.256:20): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" na
me="lxc-402_</var/lib/lxc>" pid=7260 comm="apparmor_parser"
[  136.777523] vmbr0: port 2(veth402i0) entered blocking state
[  136.777528] vmbr0: port 2(veth402i0) entered disabled state
[  136.777588] device veth402i0 entered promiscuous mode
[  136.804265] eth0: renamed from vethX6iMzT
[  137.267730] audit: type=1400 audit(1643760907.353:21): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 pr
ofile="lxc-402_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=7482 comm="(networkd)" srcname="/" flags="rw, rbind"
[  137.319333] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  137.319369] vmbr0: port 2(veth402i0) entered blocking state
[  137.319371] vmbr0: port 2(veth402i0) entered forwarding state
[  137.399721] audit: type=1400 audit(1643760907.485:22): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 pr
ofile="lxc-402_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=7507 comm="(resolved)" srcname="/" flags="rw, rbind"
[  137.403272] audit: type=1400 audit(1643760907.489:23): apparmor="STATUS" operation="profile_replace" info="not policy admin" erro
r=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7502 comm="apparmor_parser"
[  137.480920] audit: type=1400 audit(1643760907.569:24): apparmor="STATUS" operation="profile_replace" info="not policy admin" erro
r=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7503 comm="apparmor_parser"
[  137.567776] audit: type=1400 audit(1643760907.653:25): apparmor="STATUS" operation="profile_replace" info="not policy admin" erro
r=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7511 comm="apparmor_parser"
[  137.625694] audit: type=1400 audit(1643760907.713:26): apparmor="STATUS" operation="profile_replace" info="not policy admin" erro
r=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7512 comm="apparmor_parser"
[  137.625939] audit: type=1400 audit(1643760907.713:27): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7512 comm="apparmor_parser"
[  137.652633] audit: type=1400 audit(1643760907.741:28): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7510 comm="apparmor_parser"
[  137.714574] audit: type=1400 audit(1643760907.801:29): apparmor="STATUS" operation="profile_replace" info="not policy admin" error=-13 label="lxc-402_</var/lib/lxc>//&:lxc-402_<-var-lib-lxc>:unconfined" pid=7532 comm="apparmor_parser"
[ 7667.571564] systemd-journald[1120]: Failed to set ACL on /var/log/journal/11a1412fbea44604ad5713fcd79c8892/user-1000.journal, ignoring: Operation not supported

Then the actual OOPS:
This was while doing a backup of a shutdown and stopped LC:

Code:

[14760.659446] vmbr0: port 2(veth402i0) entered disabled state
[14760.659679] device veth402i0 left promiscuous mode
[14760.659683] vmbr0: port 2(veth402i0) entered disabled state
[14760.959432] kauditd_printk_skb: 7 callbacks suppressed
[14760.959444] audit: type=1400 audit(1643775530.716:37): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-402_</var/lib/lxc>" pid=210503 comm="apparmor_parser"
[14800.931548] loop0: detected capacity change from 0 to 31457280
[14800.973622] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[14810.065455] blk_update_request: I/O error, dev loop0, sector 13938960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[14810.193600] blk_update_request: I/O error, dev loop0, sector 13939536 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[14810.195175] BUG: unable to handle page fault for address: 0000003900000000
[14810.195229] #PF: supervisor read access in kernel mode
[14810.195269] #PF: error_code(0x0000) - not-present page
[14810.195299] PGD 0 P4D 0
[14810.195320] Oops: 0000 [#1] SMP PTI
[14810.195344] CPU: 7 PID: 152 Comm: kworker/u17:0 Kdump: loaded Tainted: P           O      5.15.7-1-pve #1
[14810.195395] Hardware name: Dell Inc. OptiPlex 7010/0GY6Y8, BIOS A28 02/22/2018
[14810.195435] Workqueue: fsverity_read_queue verity_work
[14810.195470] RIP: 0010:fsverity_verify_bio+0x41/0x200
[14810.195503] Code: ec 20 f6 47 14 02 0f 85 c8 01 00 00 49 8b 46 78 be 40 0c 00 00 48 8b 00 48 8b 40 18 48 8b 00 4c 8b b8 68 02 00 00 48 89 45 d0 <49> 8b 3f e8 a7 e5 ff ff 49 89 c2 41 f6 46 12 08 0f 84 14 01 00 00
[14810.195572] RSP: 0018:ffffb655c06cfe10 EFLAGS: 00010246
[14810.195591] RAX: ffff9b5888170ee0 RBX: ffff9b5991110b40 RCX: ffffffffa2fcc495
[14810.195614] RDX: 0000000080550055 RSI: 0000000000000c40 RDI: ffff9b5b54186a00
[14810.195636] RBP: ffffb655c06cfe58 R08: 0000000000000001 R09: 0000000000000001
[14810.195659] R10: 0000000000000001 R11: ffff9b59857b92a0 R12: ffff9b5b54186a00
[14810.195681] R13: ffff9b5980207900 R14: ffff9b5b54186a00 R15: 0000003900000000
[14810.195703] FS:  0000000000000000(0000) GS:ffff9b5b941c0000(0000) knlGS:0000000000000000
[14810.195728] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14810.195748] CR2: 0000003900000000 CR3: 000000013f610004 CR4: 00000000001726e0
[14810.195770] Call Trace:
[14810.195783]  <TASK>
[14810.195795]  verity_work+0x2f/0x40
[14810.195811]  process_one_work+0x22b/0x3d0
[14810.195829]  worker_thread+0x53/0x420
[14810.195844]  ? process_one_work+0x3d0/0x3d0
[14810.195861]  kthread+0x12a/0x150
[14810.195877]  ? set_kthread_struct+0x50/0x50
[14810.195903]  ret_from_fork+0x22/0x30
[14810.195922]  </TASK>
[14810.195939] Modules linked in: binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel snd_hda_intel crypto_simd ttm snd_intel_dspcfg cryptd snd_intel_sdw_acpi snd_hda_codec snd_hda_core drm_kms_helper rapl snd_hwdep dell_wmi intel_cstate snd_pcm cec ledtrig_audio rc_core snd_timer dell_smbios snd fb_sys_fops syscopyarea dcdbas input_leds sysfillrect at24 soundcore sysimgblt serio_raw efi_pstore sparse_keymap pcspkr dell_wmi_descriptor wmi_bmof mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd irqbypass
[14810.196011]  vfio_iommu_type1 vfio drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress hid_generic usbkbd usbhid hid raid6_pq libcrc32c xhci_pci crc32_pclmul xhci_pci_renesas lpc_ich i2c_i801 psmouse i2c_smbus ahci e1000 igb libahci ehci_pci i2c_algo_bit e1000e xhci_hcd ehci_hcd dca wmi video
[14810.196460] CR2: 0000003900000000

The dump is 1.8GB... how shall I send it?

Great. Tried to Gzip... which caused another crash

oguz · Feb 2, 2022

thanks for posting the steps for kdump setup as well, might make sense to add this in our wiki (i'll look into that)

MrPete said:
Then the actual OOPS:

hmmm...
to me it kind of looks like you might have a problem with your filesystem (or the disk holding it).

see here:

Code:

[14800.973622] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[14810.065455] blk_update_request: I/O error, dev loop0, sector 13938960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[14810.193600] blk_update_request: I/O error, dev loop0, sector 13939536 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[14810.195175] BUG: unable to handle page fault for address: 0000003900000000
[14810.195229] #PF: supervisor read access in kernel mode
[14810.195269] #PF: error_code(0x0000) - not-present page
[14810.195299] PGD 0 P4D 0
[14810.195320] Oops: 0000 [#1] SMP PTI
[14810.195344] CPU: 7 PID: 152 Comm: kworker/u17:0 Kdump: loaded Tainted: P           O      5.15.7-1-pve #1
[14810.195395] Hardware name: Dell Inc. OptiPlex 7010/0GY6Y8, BIOS A28 02/22/2018
[14810.195435] Workqueue: fsverity_read_queue verity_work
[14810.195470] RIP: 0010:fsverity_verify_bio+0x41/0x200

maybe you could do the regular checks with smartctl and/or fsck (but make sure the disk isn't mounted! might make sense to do it over a recovery ISO)

MrPete · Feb 2, 2022

Thanks!
(I updated the instructions with a more general crashkernel param, and a link to some documentation. Have not found ubuntu-specific info, nor any details of what actually affects the reserved RAM size. This works for me...)

I know how to handle disk issues; not used to reading Oops traces

Will be back...
(Most likely fsck - this is dual high quality SSD's)
Doesn't Linux have auto-fsck at boot time as at least an option? I'm surprised those errors haven't triggered something. I'll look into it.

MrPete · Feb 2, 2022

@oguz, The good news: fixing zfs filesystem error took care of crash-on-gzip. (There is bad news in another message

)
The following info is not currently listed in Proxmox documentation that I can see... So I wrote it up

Current HowTo: Managing filesystem issues with ZFS on Proxmox

One advantage of ZFS is its ability to detect and correct bitrot and other filesystem issues on a live filesystem, without reboot. Much nicer than fsck. Combined with smartctl for low level analysis of storage failures, this provides powerful tools for detecting and correcting issues in a ZFS-based filesystem.

By default, ZFS-based storage pools are TRIM'd the first Sunday of each month, and are SCRUB'd the second Sunday.
(See /etc/cron.d/zfsutils-linux)

TRIM is an SSD-related function; look elsewhere for more on that.
SCRUB is the ZFS equivalent to fsck, but actually verifies the content checksum of every storage block, and corrects any errors if possible.

To discover current ZFS pool names and status: zpool status

Includes a list of all hardware storage devices (use smartctl for low-level checking of devices; see below for a brief intro)
Includes current scrub status if any
Lists past errors detected (eg 3 checksum errors)
Also explains if the ZFS system has new features, not yet enabled

To initiate a manual scrub operation: zpool scrub rpool (where rpool is the name of a storage pool)

To clear error counts, assuming you've verified the hardware is ok: zpool clear

To enable new ZFS zpool features (in general, enabling does not implement them in ZFS. They just become available):

zpool upgrade (lists what is new and compatible)
zpool upgrade -a (enable what is new and compatible)

Before scrubbing a filesystem, it is worth verifying there are no underlying hardware issues.
smartctl accesses diagnostics built into every hard drive or SSD for the last few decades:

Discover available devices: smartctl --scan

Show relatively short device diagnostics: smartctl -a /dev/sda (where /dev/sda is the device name. Sometimes an additional switch will be needed to define the device type. This is rare except for USB and other unusual devices)

Show comprehensive device diagnostics: smartctl -x /dev/sda

Perform a two minute device self-test: smartctl -t short /dev/sda

Perform a full scan of every sector on device: smartctl -t long /dev/sda

List just the self-test capabilities and time estimates (also seen in -a and -x): smartctl -c /dev/sda

MrPete · Feb 2, 2022

Unfortunately, fixing my filesystem did not fix the original crash-on-backup issue.

Not only that, I saw on the console some disturbing messages fly by, about kdump not supporting the current kernel, so the dump wasn't complete...

(...and another error saying something about a BIOS bug relating to PMU resources and "MSR 38d". I suspect that's a distraction. Looks like I need to disable power management in the BIOS, if possible. AFAIK that shouldn't impact normal system operation like this.)

Here's the dmesg dump. I could make the (3.6GB) dump file available if helpful.

Code:

[ 2294.645644] systemd-journald[1084]: Failed to set ACL on /var/log/journal/11a1412fbea44604ad5713fcd79c8892/user-1000.journal, ignoring: Operation not supported
[ 8102.147991] kauditd_printk_skb: 7 callbacks suppressed
[ 8102.147994] audit: type=1400 audit(1643785230.596:37): apparmor="DENIED" operation="mount" info="failed flags match" error=-13 profile="lxc-402_</var/lib/lxc>" name="/run/systemd/unit-root/" pid=123446 comm="(ogrotate)" srcname="/" flags="rw, rbind"
[28299.083020] perf: interrupt took too long (2505 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
[40359.318528] perf: interrupt took too long (3147 > 3131), lowering kernel.perf_event_max_sample_rate to 63500
[41856.717246] BUG: unable to handle page fault for address: ffff96431920f050
[41856.717280] #PF: supervisor read access in kernel mode
[41856.717299] #PF: error_code(0x0000) - not-present page
[41856.717317] PGD 0 P4D 0
[41856.717330] Oops: 0000 [#1] SMP PTI
[41856.717346] CPU: 4 PID: 1028470 Comm: task UPID:pve2: Kdump: loaded Tainted: P           O      5.15.7-1-pve #1
[41856.717374] Hardware name: Dell Inc. OptiPlex 7010/0GY6Y8, BIOS A28 02/22/2018
[41856.717396] RIP: 0010:vfs_getattr_nosec+0x53/0xd0
[41856.717416] Code: 00 00 48 29 f9 48 89 e5 49 c7 81 88 00 00 00 00 00 00 00 81 c1 90 00 00 00 c1 e9 03 f3 48 ab 41 c7 01 ff 07 00 00 48 8b 46 28 <48> 8b 40 50 25 00 04 00 00 48 83 f8 01 19 c0 83 e0 20 05 df 07 00
[41856.717464] RSP: 0018:ffffa44272a63d10 EFLAGS: 00010206
[41856.717482] RAX: ffff96431920f000 RBX: ffffa44272a63d78 RCX: 0000000000000000
[41856.717504] RDX: 00000000000007ff RSI: ffff96800d772148 RDI: ffffa44272a63e08
[41856.717525] RBP: ffffa44272a63d10 R08: 0000000000000900 R09: ffffa44272a63d78
[41856.717546] R10: ffffa44272a63d28 R11: ffff968101aa9000 R12: 0000000000000000
[41856.717567] R13: 00000000ffffff9c R14: 000055a0af0ab4e0 R15: 0000000000000900
[41856.717589] FS:  00007fa83f3d0280(0000) GS:ffff968314100000(0000) knlGS:0000000000000000
[41856.717613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[41856.717631] CR2: ffff96431920f050 CR3: 000000010a24e003 CR4: 00000000001726e0
[41856.717652] Call Trace:
[41856.717664]  <TASK>
[41856.717675]  vfs_statx+0x9d/0x120
[41856.717691]  __do_sys_newlstat+0x3e/0x70
[41856.717707]  __x64_sys_newlstat+0x16/0x20
[41856.717722]  do_syscall_64+0x5c/0xc0
[41856.717739]  ? putname+0x55/0x60
[41856.717754]  ? do_unlinkat+0x83/0x2b0
[41856.717768]  ? exit_to_user_mode_prepare+0x37/0x1b0
[41856.717787]  ? syscall_exit_to_user_mode+0x27/0x50
[41856.717804]  ? do_syscall_64+0x69/0xc0
[41856.717821]  ? do_syscall_64+0x69/0xc0
[41856.717836]  ? do_syscall_64+0x69/0xc0
[41856.717852]  ? __x64_sys_newlstat+0x16/0x20
[41856.717868]  ? do_syscall_64+0x69/0xc0
[41856.717884]  ? do_syscall_64+0x69/0xc0
[41856.717899]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[41856.717917] RIP: 0033:0x7fa83f4fa446
[41856.717932] Code: fa 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 29 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 02 c3 90 48 8b 15 19 fa 0c 00 f7 d8 64 89 02
[41856.717988] RSP: 002b:00007ffdb7c8ece8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
[41856.718012] RAX: ffffffffffffffda RBX: 000055a0af1b7f70 RCX: 00007fa83f4fa446
[41856.718043] RDX: 000055a0a7f714b8 RSI: 000055a0a7f714b8 RDI: 000055a0af0ab4e0
[41856.718082] RBP: 000055a0a7f712a0 R08: 0000000000000001 R09: 000055a0a7f712a0
[41856.718103] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a0af16e540
[41856.718124] R13: 000055a0af0ab4e0 R14: 000055a0a6d606e6 R15: 0000000000000000
[41856.718146]  </TASK>
[41856.718156] Modules linked in: binfmt_misc veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel nf_tables bonding tls softdog nfnetlink_log nfnetlink snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul ghash_clmulni_intel aesni_intel snd_hda_core ttm crypto_simd snd_hwdep cryptd snd_pcm drm_kms_helper rapl intel_cstate snd_timer at24 input_leds cec snd rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt soundcore dell_wmi ledtrig_audio serio_raw dell_smbios dcdbas efi_pstore sparse_keymap pcspkr dell_wmi_descriptor wmi_bmof mac_hid vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfio_pci vfio_pci_core vfio_virqfd drm irqbypass
[41856.718199]  vfio_iommu_type1 vfio sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress hid_generic usbkbd usbhid hid raid6_pq libcrc32c crc32_pclmul xhci_pci e1000 xhci_pci_renesas psmouse ehci_pci ahci i2c_i801 igb i2c_smbus libahci lpc_ich e1000e i2c_algo_bit xhci_hcd ehci_hcd dca wmi video
[41856.723239] CR2: ffff96431920f050

oguz · Feb 3, 2022

thank you for the detailed steps and pointers, surely it'll help someone!

MrPete said:
Unfortunately, fixing my filesystem did not fix the original crash-on-backup issue

that's unfortunate. though it looks like you're having a different call trace now, but still in the VFS code.

it's possible that there are different corrupted parts and the fsck helped with the first part...

i'd suggest you to try running a memtest on your server to rule out any memory issues (since the filesystem should be "fixed")

if no errors come out from the memtest, please upload the crash dumps to somewhere (maybe cloud provider or filesharing service)

MrPete · Feb 3, 2022

Will do. FWIW, there are a number of new distracting error messages-on-boot-or-crash that appear to be just that: distractions not worth focus...

...no irq handler for vector
...BIOS has corrupted hw-PMU resources (MSR 38d...
...DMA failed to allocate 128 KiB GFP_KERNEL [GFP_DMA pool for atomic allocation]
...and quote possibly: kdump-tools...the kernel version is not supported.

Don't you love logfile errors that look scary but ought to be ignored

MrPete · Feb 4, 2022

SOLVED.... what a relief!
I had already run memtest, and the RAM had passed.
Yet... memtest86 has been radically strengthened since I last made my bootable copy.
Using the latest version? Bad RAM. Pull that RAM (still have 8GB, enough for now) and all is wonderfully well.

H/T @kalasnikov for the "long shot" thought, and @oguz for the reminder. I highly recommend getting a current version of the USB-bootable memtest86 dot com version.

And, I got to learn and document some important aspects of managing my Proxmox cluster

Issue!! Proxmox crashing randomly!

New Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Proxmox Retired Staff

Active Member

Active Member

New Member

Proxmox Retired Staff

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Active Member

Active Member

Proxmox Retired Staff

Active Member

Active Member

We value your privacy