Proxmox keeps freezing, out of options

Sessix

New Member
Aug 5, 2023
4
1
3
HI Guys,

My Proxmox is freezing and need a reboot every 4 / 5 hours, even with low cpu load.

What i've tried so far:
- Switched the NVME's (they are in raid1 2x 2TB, also tried without raid, same issue)
- Switched the RAM modules
- Reinstalled proxmox with 7.4
- Reinstalled proxmox with 8.0
- Memtest complete without any issues

Code:
Aug 05 15:36:24 versuvio systemd[1]: user@0.service: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: Stopped User Manager for UID 0.
Aug 05 15:36:24 versuvio systemd[1]: Stopping User Runtime Directory /run/user/0...
Aug 05 15:36:24 versuvio systemd[1]: run-user-0.mount: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: user-runtime-dir@0.service: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: Stopped User Runtime Directory /run/user/0.
Aug 05 15:36:24 versuvio systemd[1]: Removed slice User Slice of UID 0.
Aug 05 15:36:56 versuvio pveproxy[157020]: worker exit
Aug 05 15:36:56 versuvio pveproxy[1715]: worker 157020 finished
Aug 05 15:36:56 versuvio pveproxy[1715]: starting 1 worker(s)
Aug 05 15:36:56 versuvio pveproxy[1715]: worker 234914 started
Aug 05 15:40:16 versuvio pveproxy[177526]: worker exit
Aug 05 15:40:16 versuvio pveproxy[1715]: worker 177526 finished
Aug 05 15:40:16 versuvio pveproxy[1715]: starting 1 worker(s)
Aug 05 15:40:16 versuvio pveproxy[1715]: worker 237604 started
Aug 05 15:48:33 versuvio smartd[1154]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 74
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 100
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 59
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 41
Aug 05 16:11:00 versuvio pvedaemon[164508]: <root@pam> successful auth for user 'root@pam'
Aug 05 16:17:01 versuvio CRON[267083]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 05 16:17:01 versuvio CRON[267084]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 05 16:17:01 versuvio CRON[267083]: pam_unix(cron:session): session closed for user root
Aug 05 17:15:18 versuvio pvedaemon[160535]: <root@pam> successful auth for user 'root@pam'
Aug 05 17:17:01 versuvio CRON[305614]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 05 17:17:01 versuvio CRON[305615]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 05 17:17:01 versuvio CRON[305614]: pam_unix(cron:session): session closed for user root
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 64
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 54
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 46
Aug 05 17:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53
Aug 05 17:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 47
Aug 05 18:02:21 versuvio pvedaemon[160535]: <root@pam> successful auth for user 'root@pam'
-- Reboot --

The errors on the screenshot won't appear on the syslog, i'm out of options i hope somebody can explain what to try next.

-> https://imageupload.io/ib/3VOQuJ1Z73ZaRIC_1691254665.jpg
 
Last edited:
Now its showing inside the syslog but without crashing, can't see whats triggers it:

Code:
Aug 05 22:40:29 versuvio kernel: ------------[ cut here ]------------
Aug 05 22:40:29 versuvio kernel: WARNING: CPU: 2 PID: 44 at kernel/locking/rwsem.c:240 down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Modules linked in: tcp_diag inet_diag xt_nat xt_tcpudp nf_conntrack_netlink xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm_intel ppdev mei_hdcp kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd snd_hda_codec_hdmi cryptd eeepc_wmi rapl intel_cstate snd_hda_intel asus_wmi snd_intel_dspcfg platform_profile sparse_keymap wmi_bmof nouveau snd_intel_sdw_acpi i915 snd_hda_codec pcspkr drm_ttm_helper snd_hda_core efi_pstore ttm snd_hwdep input_leds ee1004 mxm_wmi snd_pcm snd_timer
Aug 05 22:40:29 versuvio kernel:  drm_kms_helper cec joydev snd rc_core i2c_algo_bit fb_sys_fops syscopyarea apple_mfi_fastcharge sysfillrect soundcore mei_me sysimgblt mei parport_pc mac_hid parport acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb hid_apple hid_generic usbmouse usbkbd usbhid hid r8169 crc32_pclmul realtek nvme xhci_pci ahci i2c_i801 xhci_pci_renesas i2c_smbus nvme_core xhci_hcd libahci wmi video
Aug 05 22:40:29 versuvio kernel: CPU: 2 PID: 44 Comm: ksmd Tainted: P           O      5.15.108-1-pve #1
Aug 05 22:40:29 versuvio kernel: Hardware name: System manufacturer System Product Name/PRIME B250M-A, BIOS 1205 05/11/2018
Aug 05 22:40:29 versuvio kernel: RIP: 0010:down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Code: cc cc cc 49 8b 44 24 08 65 48 8b 14 25 c0 fb 01 00 83 e0 02 48 09 d0 48 83 c8 01 49 89 44 24 08 4c 8b 65 f8 c9 c3 cc cc cc cc <0f> 0b 49 8b 44 24 08 a8 01 74 b7 a8 02 75 b3 48 89 c2 48 83 ca 02
Aug 05 22:40:29 versuvio kernel: RSP: 0018:ffffb1a3401dbe30 EFLAGS: 00010282
Aug 05 22:40:29 versuvio kernel: RAX: 0000000000000000 RBX: ffff9317dd664968 RCX: 00000000000000c3
Aug 05 22:40:29 versuvio kernel: RDX: fffff84d35dc0000 RSI: ffffffffb7feb480 RDI: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: RBP: ffffb1a3401dbe38 R08: ffff9324b70fd000 R09: 00000000000005f0
Aug 05 22:40:29 versuvio kernel: R10: fffff84d05697ee8 R11: 0000000000000080 R12: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: R13: ffff9318401ae238 R14: 00007ef8af9fd11e R15: ffff93244c392ef8
Aug 05 22:40:29 versuvio kernel: FS:  0000000000000000(0000) GS:ffff932746d00000(0000) knlGS:0000000000000000
Aug 05 22:40:29 versuvio kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 05 22:40:29 versuvio kernel: CR2: 0000153b176d6000 CR3: 0000000cc4210006 CR4: 00000000003726e0
Aug 05 22:40:29 versuvio kernel: Call Trace:
Aug 05 22:40:29 versuvio kernel:  <TASK>
Aug 05 22:40:29 versuvio kernel:  ksm_scan_thread+0xb57/0x1c30
Aug 05 22:40:29 versuvio kernel:  ? wait_woken+0x70/0x70
Aug 05 22:40:29 versuvio kernel:  ? try_to_merge_with_ksm_page+0xd0/0xd0
Aug 05 22:40:29 versuvio kernel:  kthread+0x127/0x150
Aug 05 22:40:29 versuvio kernel:  ? set_kthread_struct+0x50/0x50
Aug 05 22:40:29 versuvio kernel:  ret_from_fork+0x1f/0x30
Aug 05 22:40:29 versuvio kernel:  </TASK>
Aug 05 22:40:29 versuvio kernel: ---[ end trace 6bb26abdbee71c07 ]---
 
Now its showing inside the syslog but without crashing, can't see whats triggers it:

Code:
Aug 05 22:40:29 versuvio kernel: ------------[ cut here ]------------
Aug 05 22:40:29 versuvio kernel: WARNING: CPU: 2 PID: 44 at kernel/locking/rwsem.c:240 down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Modules linked in: tcp_diag inet_diag xt_nat xt_tcpudp nf_conntrack_netlink xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm_intel ppdev mei_hdcp kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd snd_hda_codec_hdmi cryptd eeepc_wmi rapl intel_cstate snd_hda_intel asus_wmi snd_intel_dspcfg platform_profile sparse_keymap wmi_bmof nouveau snd_intel_sdw_acpi i915 snd_hda_codec pcspkr drm_ttm_helper snd_hda_core efi_pstore ttm snd_hwdep input_leds ee1004 mxm_wmi snd_pcm snd_timer
Aug 05 22:40:29 versuvio kernel:  drm_kms_helper cec joydev snd rc_core i2c_algo_bit fb_sys_fops syscopyarea apple_mfi_fastcharge sysfillrect soundcore mei_me sysimgblt mei parport_pc mac_hid parport acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb hid_apple hid_generic usbmouse usbkbd usbhid hid r8169 crc32_pclmul realtek nvme xhci_pci ahci i2c_i801 xhci_pci_renesas i2c_smbus nvme_core xhci_hcd libahci wmi video
Aug 05 22:40:29 versuvio kernel: CPU: 2 PID: 44 Comm: ksmd Tainted: P           O      5.15.108-1-pve #1
Aug 05 22:40:29 versuvio kernel: Hardware name: System manufacturer System Product Name/PRIME B250M-A, BIOS 1205 05/11/2018
Aug 05 22:40:29 versuvio kernel: RIP: 0010:down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Code: cc cc cc 49 8b 44 24 08 65 48 8b 14 25 c0 fb 01 00 83 e0 02 48 09 d0 48 83 c8 01 49 89 44 24 08 4c 8b 65 f8 c9 c3 cc cc cc cc <0f> 0b 49 8b 44 24 08 a8 01 74 b7 a8 02 75 b3 48 89 c2 48 83 ca 02
Aug 05 22:40:29 versuvio kernel: RSP: 0018:ffffb1a3401dbe30 EFLAGS: 00010282
Aug 05 22:40:29 versuvio kernel: RAX: 0000000000000000 RBX: ffff9317dd664968 RCX: 00000000000000c3
Aug 05 22:40:29 versuvio kernel: RDX: fffff84d35dc0000 RSI: ffffffffb7feb480 RDI: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: RBP: ffffb1a3401dbe38 R08: ffff9324b70fd000 R09: 00000000000005f0
Aug 05 22:40:29 versuvio kernel: R10: fffff84d05697ee8 R11: 0000000000000080 R12: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: R13: ffff9318401ae238 R14: 00007ef8af9fd11e R15: ffff93244c392ef8
Aug 05 22:40:29 versuvio kernel: FS:  0000000000000000(0000) GS:ffff932746d00000(0000) knlGS:0000000000000000
Aug 05 22:40:29 versuvio kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 05 22:40:29 versuvio kernel: CR2: 0000153b176d6000 CR3: 0000000cc4210006 CR4: 00000000003726e0
Aug 05 22:40:29 versuvio kernel: Call Trace:
Aug 05 22:40:29 versuvio kernel:  <TASK>
Aug 05 22:40:29 versuvio kernel:  ksm_scan_thread+0xb57/0x1c30
Aug 05 22:40:29 versuvio kernel:  ? wait_woken+0x70/0x70
Aug 05 22:40:29 versuvio kernel:  ? try_to_merge_with_ksm_page+0xd0/0xd0
Aug 05 22:40:29 versuvio kernel:  kthread+0x127/0x150
Aug 05 22:40:29 versuvio kernel:  ? set_kthread_struct+0x50/0x50
Aug 05 22:40:29 versuvio kernel:  ret_from_fork+0x1f/0x30
Aug 05 22:40:29 versuvio kernel:  </TASK>
Aug 05 22:40:29 versuvio kernel: ---[ end trace 6bb26abdbee71c07 ]---
ksm_scan_thread might indicate KSM. You can try disabling it. If might also indicate a memory corruption (run memtest for a long time?) or other faulty hardware.
 
ksm_scan_thread might indicate KSM. You can try disabling it. If might also indicate a memory corruption (run memtest for a long time?) or other faulty hardware.

Today it crashed again with different errors underneath <TASK>

It was stable for 80+ days before i added the following hardware:

2x G.Skill Aegis F4-2666C19S-16GIS
2x Corsair MP600 NH 2TB (NVME)
2x Seagate Exos X20 - 20 TB
1x PNY Nvidia Quadro P2000

Before:
2x Samsung EVO 970 SSD's
4x Seagate Exos X20 - 20 TB
2x G.Skill Aegis F4-2666C19S-16GIS

After:
2x Corsair MP600 NH 2TB
6x Seagate Exos X20 - 20 TB
4x G.Skill Aegis F4-2666C19S-16GIS
1x PNY Nvidia Quadro P2000

Memtest is running, first test passed after a hour with 0 issues, how many runs do you suggest? And if you see the list above can you point out which can trigger the errors?
 
t was stable for 80+ days before i added the following hardware:
...
1x PNY Nvidia Quadro P2000
I got a 4-port Quadro in my workstation. The benefit I have of four screens hardly outweighs the amount X-restarts, Plasma freezes and kernel crashes. It's my first nVidia card in years, having stayed with manufacturers that do provide proper open source drivers. I had not imagined the state to be so bad.

Your nVidia would be my first suspect!
 
In my case the freezes are happening on Intel NUC with UHD gfx though :D
 
Have you ever resolved the freezes?
I got a 4-port Quadro in my workstation. The benefit I have of four screens hardly outweighs the amount X-restarts, Plasma freezes and kernel crashes. It's my first nVidia card in years, having stayed with manufacturers that do provide proper open source drivers. I had not imagined the state to be so bad.

Your nVidia would be my first suspect!

The problem was a faulty RAM module, unless MEMtest didn't showed any issues after replacing the RAM modules the problem resolved. :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!