Error crashing kernel

ricardolanes · Sep 22, 2024

Hello, I have a problem. My PVE is crashing randomly, at the moment everything is working, and at any moment without even touching it, this error message appears on the display and nothing else works (GUI, SHELL...), the VM/CT freezes, I have to force the server to shut down with the button.

Virtual Environment 8.2.6
Linux pve 6.8.12-2-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-2 (2024-09-05T10:03Z) x86_64

System Log (PVE)

Sep 22 15:33:00 pve postfix/qmgr[1018]: 7757F28033F: from=<>, size=41663, nrcpt=1 (queue active)
Sep 22 15:33:00 pve postfix/qmgr[1018]: 4CE9D2802DC: from=<>, size=9643, nrcpt=1 (queue active)
Sep 22 15:33:00 pve postfix/qmgr[1018]: BE2E4280372: from=<>, size=29077, nrcpt=1 (queue active)
Sep 22 15:33:00 pve postfix/local[72470]: error: open database /etc/aliases.db: No such file or directory
Sep 22 15:33:00 pve postfix/local[72470]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Sep 22 15:33:00 pve postfix/local[72470]: warning: hash:/etc/aliases: lookup of 'root' failed
Sep 22 15:33:00 pve postfix/local[72471]: error: open database /etc/aliases.db: No such file or directory
Sep 22 15:33:00 pve postfix/local[72471]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Sep 22 15:33:00 pve postfix/local[72471]: warning: hash:/etc/aliases: lookup of 'root' failed
Sep 22 15:33:00 pve postfix/local[72470]: 7757F28033F: to=<root@pve.intra>, relay=local, delay=46598, delays=46598/0.01/0/0.01, dsn=4.3.0, status=deferred (alias database unavailable)
Sep 22 15:33:00 pve postfix/local[72470]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Sep 22 15:33:00 pve postfix/local[72470]: warning: hash:/etc/aliases: lookup of 'root' failed
Sep 22 15:33:00 pve postfix/local[72471]: 4CE9D2802DC: to=<root@pve.intra>, relay=local, delay=63372, delays=63372/0.01/0/0, dsn=4.3.0, status=deferred (alias database unavailable)
Sep 22 15:33:00 pve postfix/local[72470]: BE2E4280372: to=<root@pve.intra>, relay=local, delay=538, delays=538/0.01/0/0.01, dsn=4.3.0, status=deferred (alias database unavailable)
Sep 22 15:43:00 pve postfix/qmgr[1018]: BE2E4280372: from=<>, size=29077, nrcpt=1 (queue active)
Sep 22 15:43:00 pve postfix/local[78706]: error: open database /etc/aliases.db: No such file or directory
Sep 22 15:43:00 pve postfix/local[78706]: warning: hash:/etc/aliases is unavailable. open database /etc/aliases.db: No such file or directory
Sep 22 15:43:00 pve postfix/local[78706]: warning: hash:/etc/aliases: lookup of 'root' failed
Sep 22 15:43:00 pve postfix/local[78706]: BE2E4280372: to=<root@pve.intra>, relay=local, delay=1139, delays=1139/0.01/0/0.01, dsn=4.3.0, status=deferred (alias database unavailable)
Sep 22 15:49:38 pve kernel: perf: interrupt took too long (3972 > 3913), lowering kernel.perf_event_max_sample_rate to 50000
Sep 22 15:51:50 pve kernel: general protection fault, probably for non-canonical address 0xffd09d6bc1e6ce08: 0000 [#1] PREEMPT SMP NOPTI
Sep 22 15:51:50 pve kernel: CPU: 1 PID: 76488 Comm: kworker/u8:5 Tainted: P O 6.8.12-2-pve #1
Sep 22 15:51:50 pve kernel: Hardware name: Default string Default string/Default string, BIOS GF1264NP126LV11R003 11/16/2023
Sep 22 15:51:50 pve kernel: Workqueue: dm-thin do_worker [dm_thin_pool]
Sep 22 15:51:50 pve kernel: RIP: 0010:do_worker+0x815/0xd60 [dm_thin_pool]
Sep 22 15:51:50 pve kernel: Code: 0e fb 4d 85 e4 75 e4 e8 e9 11 b5 fa 4d 8b 65 00 4c 39 a5 48 ff ff ff 0f 84 ad fe ff ff 49 8d bc 24 88 00 00 00 b8 01 00 00 00 <f0> 41 0f c1 84 24 88 00 00 00 85 c0 0f 84 8b 03 00 00 8d 50 01 09
Sep 22 15:51:50 pve kernel: RSP: 0018:ffffaf4a0fbd7d68 EFLAGS: 00010202
Sep 22 15:51:50 pve kernel: RAX: 0000000000000001 RBX: ffffaf4a0fbd7de8 RCX: 0000000000000000
Sep 22 15:51:50 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffd09d6bc1e6ce08
Sep 22 15:51:50 pve kernel: RBP: ffffaf4a0fbd7e40 R08: 0000000000000000 R09: 0000000000000000
Sep 22 15:51:50 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffd09d6bc1e6cd80
Sep 22 15:51:50 pve kernel: R13: ffff9d6bc3192c00 R14: ffff9d6bd1d75a05 R15: ffff9d6bc3192c4c
Sep 22 15:51:50 pve kernel: FS: 0000000000000000(0000) GS:ffff9d6f2fa80000(0000) knlGS:0000000000000000
Sep 22 15:51:50 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 22 15:51:50 pve kernel: CR2: 00002cac02b20000 CR3: 000000031e636002 CR4: 0000000000f72ef0
Sep 22 15:51:50 pve kernel: PKRU: 55555554
Sep 22 15:51:50 pve kernel: Call Trace:
Sep 22 15:51:50 pve kernel: <TASK>
Sep 22 15:51:50 pve kernel: ? show_regs+0x6d/0x80
Sep 22 15:51:50 pve kernel: ? die_addr+0x37/0xa0
Sep 22 15:51:50 pve kernel: ? exc_general_protection+0x1db/0x480
Sep 22 15:51:50 pve kernel: ? asm_exc_general_protection+0x27/0x30
Sep 22 15:51:50 pve kernel: ? do_worker+0x815/0xd60 [dm_thin_pool]
Sep 22 15:51:50 pve kernel: ? do_worker+0x7f7/0xd60 [dm_thin_pool]
Sep 22 15:51:50 pve kernel: ? finish_task_switch.isra.0+0x8c/0x310
Sep 22 15:51:50 pve kernel: process_one_work+0x16a/0x350
Sep 22 15:51:50 pve kernel: worker_thread+0x306/0x440
Sep 22 15:51:50 pve kernel: ? __pfx_worker_thread+0x10/0x10
Sep 22 15:51:50 pve kernel: kthread+0xef/0x120
Sep 22 15:51:50 pve kernel: ? __pfx_kthread+0x10/0x10
Sep 22 15:51:50 pve kernel: ret_from_fork+0x44/0x70
Sep 22 15:51:50 pve kernel: ? __pfx_kthread+0x10/0x10
Sep 22 15:51:50 pve kernel: ret_from_fork_asm+0x1b/0x30
Sep 22 15:51:50 pve kernel: </TASK>
Sep 22 15:51:50 pve kernel: Modules linked in: tcp_diag inet_diag nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat ip_set_hash_net bluetooth ecdh_generic ecc cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables bonding tls softdog sunrpc nfnetlink_log binfmt_misc nfnetlink xe drm_gpuvm drm_exec gpu_sched drm_suballoc_helper drm_ttm_helper snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda kvm snd_sof_pci snd_sof_xtensa_dsp snd_sof irqbypass snd_sof_utils snd_soc_hdac_hda crct10dif_pclmul polyval_clmulni snd_hda_ext_core polyval_generic ghash_clmulni_intel snd_soc_acpi_intel_match sha256_ssse3 snd_soc_acpi sha1_ssse3 soundwire_generic_allocation aesni_intel soundwire_bus crypto_simd snd_soc_core cryptd snd_compress ac97_bus
Sep 22 15:51:50 pve kernel: snd_pcm_dmaengine i915 snd_hda_intel snd_intel_dspcfg rapl snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm drm_buddy intel_cstate cmdlinepart ttm snd_timer pcspkr wmi_bmof spi_nor snd drm_display_helper mtd cec soundcore rc_core i2c_algo_bit igen6_edac serial_multi_instantiate intel_pmc_core intel_vsec pmt_telemetry pmt_class acpi_pad acpi_tad mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c uas usb_storage nvme spi_intel_pci i2c_i801 xhci_pci xhci_pci_renesas ahci i2c_smbus spi_intel libahci crc32_pclmul sdhci_pci xhci_hcd nvme_core cqhci nvme_auth sdhci igc video wmi
Sep 22 15:51:50 pve kernel: ---[ end trace 0000000000000000 ]---
Sep 22 15:51:52 pve kernel: pstore: backend (efi_pstore) writing error (-5)
Sep 22 15:51:52 pve kernel: RIP: 0010:do_worker+0x815/0xd60 [dm_thin_pool]
Sep 22 15:51:52 pve kernel: Code: 0e fb 4d 85 e4 75 e4 e8 e9 11 b5 fa 4d 8b 65 00 4c 39 a5 48 ff ff ff 0f 84 ad fe ff ff 49 8d bc 24 88 00 00 00 b8 01 00 00 00 <f0> 41 0f c1 84 24 88 00 00 00 85 c0 0f 84 8b 03 00 00 8d 50 01 09
Sep 22 15:51:52 pve kernel: RSP: 0018:ffffaf4a0fbd7d68 EFLAGS: 00010202
Sep 22 15:51:52 pve kernel: RAX: 0000000000000001 RBX: ffffaf4a0fbd7de8 RCX: 0000000000000000
Sep 22 15:51:52 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffd09d6bc1e6ce08
Sep 22 15:51:52 pve kernel: RBP: ffffaf4a0fbd7e40 R08: 0000000000000000 R09: 0000000000000000
Sep 22 15:51:52 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffd09d6bc1e6cd80
Sep 22 15:51:52 pve kernel: R13: ffff9d6bc3192c00 R14: ffff9d6bd1d75a05 R15: ffff9d6bc3192c4c
Sep 22 15:51:52 pve kernel: FS: 0000000000000000(0000) GS:ffff9d6f2fa80000(0000) knlGS:0000000000000000
Sep 22 15:51:52 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 22 15:51:52 pve kernel: CR2: 00002cac02b20000 CR3: 0000000325112005 CR4: 0000000000f72ef0
Sep 22 15:51:52 pve kernel: PKRU: 55555554

Can someone help me find out what it is and how to fix it?

Thanks!

GerhardK · Sep 23, 2024

Absolutely same behavior here. I am investigating this since a week and for now I have decided to pin 6.5.13-6-pve.

This thread might help you aswell: https://forum.proxmox.com/threads/proxmox-freeze-nach-kernel-update-to-6-8-4-2-pve.145920/page-4

ricardolanes · Sep 23, 2024

GerhardK said:
Absolutamente o mesmo comportamento aqui. Estou investigando isso há uma semana e, por enquanto, decidi fixar 6.5.13-6-pve.

Este tópico pode ajudar você também: https://forum.proxmox.com/threads/proxmox-freeze-nach-kernel-update-to-6-8-4-2-pve.145920/page-4

Hey my friend, thank you very much for the topic, I went back to just 1 version 6.8.12-1-pve, I'll check the behavior, and if it's still bad, I'll go back to more versions.

thank you very much!

sorry my english for google translate

ricardolanes · Sep 23, 2024

Kernel 6.8.12.1-pve crashing

Sep 22 22:36:36 pve kernel: get_swap_device: Bad swap offset entry 3bfffffffffff
Sep 22 22:36:36 pve kernel: BUG: Bad page map in process pvestatd pte:80000000000000 pmd:30430e067
Sep 22 22:36:36 pve kernel: addr:00007e1d0cbdc000 vm_flags:08000071 anon_vma:0000000000000000 mapping:ffff98f8022b99c0 index:1dc
Sep 22 22:36:36 pve kernel: file:locale-archive fault:filemap_fault mmap:ext4_file_mmap read_folio:ext4_read_folio
Sep 22 22:36:36 pve kernel: CPU: 2 PID: 82267 Comm: pvestatd Tainted: P O 6.8.12-1-pve #1
Sep 22 22:36:36 pve kernel: Hardware name: Default string Default string/Default string, BIOS GF1264NP126LV11R003 11/16/2023
Sep 22 22:36:36 pve kernel: Call Trace:
Sep 22 22:36:36 pve kernel: <TASK>
Sep 22 22:36:36 pve kernel: dump_stack_lvl+0x76/0xa0
Sep 22 22:36:36 pve kernel: dump_stack+0x10/0x20
Sep 22 22:36:36 pve kernel: print_bad_pte+0x1b2/0x270
Sep 22 22:36:36 pve kernel: unmap_page_range+0xfa4/0x1180
Sep 22 22:36:36 pve kernel: unmap_single_vma+0x89/0xf0
Sep 22 22:36:36 pve kernel: unmap_vmas+0xb5/0x190
Sep 22 22:36:36 pve kernel: exit_mmap+0x10a/0x3f0
Sep 22 22:36:36 pve kernel: __mmput+0x41/0x140
Sep 22 22:36:36 pve kernel: mmput+0x31/0x40
Sep 22 22:36:36 pve kernel: do_exit+0x324/0xae0
Sep 22 22:36:36 pve kernel: ? __count_memcg_events+0x6f/0xe0
Sep 22 22:36:36 pve kernel: do_group_exit+0x35/0x90
Sep 22 22:36:36 pve kernel: __x64_sys_exit_group+0x18/0x20
Sep 22 22:36:36 pve kernel: x64_sys_call+0x1822/0x24b0
Sep 22 22:36:36 pve kernel: do_syscall_64+0x81/0x170
Sep 22 22:36:36 pve kernel: ? irqentry_exit+0x43/0x50
Sep 22 22:36:36 pve kernel: ? exc_page_fault+0x94/0x1b0
Sep 22 22:36:36 pve kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
Sep 22 22:36:36 pve kernel: RIP: 0033:0x7e1d0ce62349
Sep 22 22:36:36 pve kernel: Code: Unable to access opcode bytes at 0x7e1d0ce6231f.
Sep 22 22:36:36 pve kernel: RSP: 002b:00007ffc2d9ec0b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
Sep 22 22:36:36 pve kernel: RAX: ffffffffffffffda RBX: 000059a0bd26f2a0 RCX: 00007e1d0ce62349
Sep 22 22:36:36 pve kernel: RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
Sep 22 22:36:36 pve kernel: RBP: 0000000000000001 R08: ffffffffffffff78 R09: 0000000000000000
Sep 22 22:36:36 pve kernel: R10: 00007e1d0cda0200 R11: 0000000000000206 R12: 0000000000000009
Sep 22 22:36:36 pve kernel: R13: 0000000000000000 R14: 000059a0c2e80f70 R15: 000059a0bd7ca7d8
Sep 22 22:36:36 pve kernel: </TASK>
Sep 22 22:36:36 pve kernel: BUG: Bad rss-counter state mm:0000000052279baa type:MM_SWAPENTS val:-1

GerhardK · Sep 24, 2024

it also crashed on 6.5 for me. I contacted the support of my server provider and asked for a BIOS update. Until now it is working. Before, I could trigger the error with my wireguard in a lxc container. As in the thread I have referenced the BIOS update really seems to help.

boxcee · Sep 24, 2024

Another thread here: https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/post-705478.

Mine is showing the same symptoms.

GerhardK · Sep 24, 2024

I am so desperate I opened my own thread: https://forum.proxmox.com/threads/pveproxy-worker-pvestatd-pvedaemon-segfault.154908/

ricardolanes · Sep 25, 2024

Hello my friends, yesterday (09/24) I made some changes to proxmox that seem to have solved the problem so far.

1. I turned off a crontab that I used to read the temperature sensors.

2. I removed some VMs (ubuntu desktop and kali shell) both as host CPU

3. I activated VLAN aware on the LAN, this was causing some errors in the system log:
..entered blocking state
..entered forwarding state
..Link is down

I believe this could be a compatibility issue with VM/CT, if possible, turn off some, or if you are using host, switch to virtual CPU, it is just a suggestion.

Well, these were the changes I made, I have been 12 hours without crashing (I was not online for 4 to 5 hours), it seems to be solved, I will report back within the next few hours.

We are trying..

GerhardK · Sep 25, 2024

Thank you for reporting back. Indeed I had the same messages in the journal regarding the network interfaces.
And ~~i can confirm~~ making the bridges VLAN aware ~~made these messages vanish.~~ did not have any effect.
I am not sure regarding the cronjob you talked about, since it seems useful to me... - lets see if the vlan setting fixed it already.

ricardolanes · Sep 25, 2024

Uptime 19h and counting, very happy

ricardolanes · Sep 26, 2024

Uptime 24h!

It really seems that it has been solved, now I will activate my crontab, and wait another 24 hours and then I will return with my Ubuntu Desktop VM... all with patience and time to analyze the behavior.

I have faith!

DevUser · Sep 26, 2024

ricardolanes said:
Uptime 24h!

It really seems that it has been solved, now I will activate my crontab, and wait another 24 hours and then I will return with my Ubuntu Desktop VM... all with patience and time to analyze the behavior.

I have faith!

does not work for me.

I have replied to this post about this with some image and how this error can be reproduced.
https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/post-706206

ricardolanes · Sep 26, 2024

DevUser said:
does not work for me.

I have replied to this post about this with some image and how this error can be reproduced.
https://forum.proxmox.com/threads/proxmox-kernel-6-8-12-2-freezes-again.154875/post-706206

Hi, I saw that you still have a problem, I'm working hard on this to find the reason for the problem. Yesterday I turned my crontab back on and less than 10 minutes later the problem started again.
So I went to see if there was anything related to the lm-sensors package I was using, and I found this.
https://community.frame.work/t/resp...ror-causes-missing-sensors-on-linux-6-7/47767

Now I turned off the crontab and my PVE is back to normal, I now have 12 hours of Uptime again.

This could be one of the problems, but it could be just this one.

Check if you have this package (lm-sensors) installed, if so, remove it.

apt remove lm-sensors

DevUser · Sep 26, 2024

ricardolanes said:
Hi, I saw that you still have a problem, I'm working hard on this to find the reason for the problem. Yesterday I turned my crontab back on and less than 10 minutes later the problem started again.
So I went to see if there was anything related to the lm-sensors package I was using, and I found this.
https://community.frame.work/t/resp...ror-causes-missing-sensors-on-linux-6-7/47767

Now I turned off the crontab and my PVE is back to normal, I now have 12 hours of Uptime again.

This could be one of the problems, but it could be just this one.

Check if you have this package (lm-sensors) installed, if so, remove it.

apt remove lm-sensors

I don't have any crontab, and I can reproduce this error by just clicking 'Refresh' (pve web). ^^

I have even stopped the crontab completely, and it still happens.

ricardolanes · Sep 26, 2024

oh it got complicated

GerhardK · Sep 26, 2024

DevUser said:
I don't have any crontab, and I can reproduce this error by just clicking 'Refresh' (pve web). ^^

I have even stopped the crontab completely, and it still happens.

Me neither. Does your system completely crash? After a bios update my system is running for 3 days without totally freezing. However, I still have random segmentation faults with pvestatd.
A simple

Code:

service restart pvestatd

makes the webinterface available again.

DevUser · Sep 26, 2024

GerhardK said:
Me neither. Does your system completely crash? After a bios update my system is running for 3 days without totally freezing. However, I still have random segmentation faults with pvestatd.
A simple

Code:

service restart pvestatd

makes the webinterface available again.

when I play with pve web manager... there is the problem, if I don't touch anything, everything works fine... but the error is still there.

and crashes can occur at any time

Does your system completely crash?

if you see the images in the other post, you can see that sometimes it crashes completely, and sometimes it's just the error but the kernel still works.

GerhardK · Sep 26, 2024

DevUser said:
when I play with pve web manager... there is the problem, if I don't touch anything, everything works fine... but the error is still there.

and crashes can occur at any time

I understand that, but do you get CPU softlocks/hardlocks so you have to restart the machine, or is it sufficient to restart the service.

DevUser · Sep 26, 2024

GerhardK said:
I understand that, but do you get CPU softlocks/hardlocks so you have to restart the machine, or is it sufficient to restart the service.

Not always, but in most cases I have to manually restart the entire server.

- At this point I am going to reinstall proxmox on the same hardware, different disks and without upgrading anything from proxmox to see what happens.

I have had the server active for more than 7 months and nothing has happened, that it is happening now seems very strange to me.

GerhardK · Sep 27, 2024

DevUser said:
Not always, but in most cases I have to manually restart the entire server.

- At this point I am going to reinstall proxmox on the same hardware, different disks and without upgrading anything from proxmox to see what happens.

I have had the server active for more than 7 months and nothing has happened, that it is happening now seems very strange to me.

But it happens after an upgrade, right? If yes I doubt it is your hardware. Have you direct access to the server (e.g. homeserver) and can you conduct a BIOS update?

Error crashing kernel

New Member

Member

New Member

New Member

Member

New Member

Member

New Member

Member

New Member

New Member

New Member

New Member

New Member

New Member

Member

New Member

Member

New Member

Member