Proxmox keeps freezing, out of options

Sessix

New Member
Aug 5, 2023
4
1
3
HI Guys,

My Proxmox is freezing and need a reboot every 4 / 5 hours, even with low cpu load.

What i've tried so far:
- Switched the NVME's (they are in raid1 2x 2TB, also tried without raid, same issue)
- Switched the RAM modules
- Reinstalled proxmox with 7.4
- Reinstalled proxmox with 8.0
- Memtest complete without any issues

Code:
Aug 05 15:36:24 versuvio systemd[1]: user@0.service: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: Stopped User Manager for UID 0.
Aug 05 15:36:24 versuvio systemd[1]: Stopping User Runtime Directory /run/user/0...
Aug 05 15:36:24 versuvio systemd[1]: run-user-0.mount: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: user-runtime-dir@0.service: Succeeded.
Aug 05 15:36:24 versuvio systemd[1]: Stopped User Runtime Directory /run/user/0.
Aug 05 15:36:24 versuvio systemd[1]: Removed slice User Slice of UID 0.
Aug 05 15:36:56 versuvio pveproxy[157020]: worker exit
Aug 05 15:36:56 versuvio pveproxy[1715]: worker 157020 finished
Aug 05 15:36:56 versuvio pveproxy[1715]: starting 1 worker(s)
Aug 05 15:36:56 versuvio pveproxy[1715]: worker 234914 started
Aug 05 15:40:16 versuvio pveproxy[177526]: worker exit
Aug 05 15:40:16 versuvio pveproxy[1715]: worker 177526 finished
Aug 05 15:40:16 versuvio pveproxy[1715]: starting 1 worker(s)
Aug 05 15:40:16 versuvio pveproxy[1715]: worker 237604 started
Aug 05 15:48:33 versuvio smartd[1154]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 75 to 76
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 73 to 74
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 100
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sde [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 58 to 59
Aug 05 15:48:34 versuvio smartd[1154]: Device: /dev/sde [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 42 to 41
Aug 05 16:11:00 versuvio pvedaemon[164508]: <root@pam> successful auth for user 'root@pam'
Aug 05 16:17:01 versuvio CRON[267083]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 05 16:17:01 versuvio CRON[267084]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 05 16:17:01 versuvio CRON[267083]: pam_unix(cron:session): session closed for user root
Aug 05 17:15:18 versuvio pvedaemon[160535]: <root@pam> successful auth for user 'root@pam'
Aug 05 17:17:01 versuvio CRON[305614]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Aug 05 17:17:01 versuvio CRON[305615]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 05 17:17:01 versuvio CRON[305614]: pam_unix(cron:session): session closed for user root
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 74 to 75
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 64
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 54
Aug 05 17:18:33 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 46
Aug 05 17:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53
Aug 05 17:48:34 versuvio smartd[1154]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 47
Aug 05 18:02:21 versuvio pvedaemon[160535]: <root@pam> successful auth for user 'root@pam'
-- Reboot --

The errors on the screenshot won't appear on the syslog, i'm out of options i hope somebody can explain what to try next.

-> https://imageupload.io/ib/3VOQuJ1Z73ZaRIC_1691254665.jpg
 
Last edited:
Now its showing inside the syslog but without crashing, can't see whats triggers it:

Code:
Aug 05 22:40:29 versuvio kernel: ------------[ cut here ]------------
Aug 05 22:40:29 versuvio kernel: WARNING: CPU: 2 PID: 44 at kernel/locking/rwsem.c:240 down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Modules linked in: tcp_diag inet_diag xt_nat xt_tcpudp nf_conntrack_netlink xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm_intel ppdev mei_hdcp kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd snd_hda_codec_hdmi cryptd eeepc_wmi rapl intel_cstate snd_hda_intel asus_wmi snd_intel_dspcfg platform_profile sparse_keymap wmi_bmof nouveau snd_intel_sdw_acpi i915 snd_hda_codec pcspkr drm_ttm_helper snd_hda_core efi_pstore ttm snd_hwdep input_leds ee1004 mxm_wmi snd_pcm snd_timer
Aug 05 22:40:29 versuvio kernel:  drm_kms_helper cec joydev snd rc_core i2c_algo_bit fb_sys_fops syscopyarea apple_mfi_fastcharge sysfillrect soundcore mei_me sysimgblt mei parport_pc mac_hid parport acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb hid_apple hid_generic usbmouse usbkbd usbhid hid r8169 crc32_pclmul realtek nvme xhci_pci ahci i2c_i801 xhci_pci_renesas i2c_smbus nvme_core xhci_hcd libahci wmi video
Aug 05 22:40:29 versuvio kernel: CPU: 2 PID: 44 Comm: ksmd Tainted: P           O      5.15.108-1-pve #1
Aug 05 22:40:29 versuvio kernel: Hardware name: System manufacturer System Product Name/PRIME B250M-A, BIOS 1205 05/11/2018
Aug 05 22:40:29 versuvio kernel: RIP: 0010:down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Code: cc cc cc 49 8b 44 24 08 65 48 8b 14 25 c0 fb 01 00 83 e0 02 48 09 d0 48 83 c8 01 49 89 44 24 08 4c 8b 65 f8 c9 c3 cc cc cc cc <0f> 0b 49 8b 44 24 08 a8 01 74 b7 a8 02 75 b3 48 89 c2 48 83 ca 02
Aug 05 22:40:29 versuvio kernel: RSP: 0018:ffffb1a3401dbe30 EFLAGS: 00010282
Aug 05 22:40:29 versuvio kernel: RAX: 0000000000000000 RBX: ffff9317dd664968 RCX: 00000000000000c3
Aug 05 22:40:29 versuvio kernel: RDX: fffff84d35dc0000 RSI: ffffffffb7feb480 RDI: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: RBP: ffffb1a3401dbe38 R08: ffff9324b70fd000 R09: 00000000000005f0
Aug 05 22:40:29 versuvio kernel: R10: fffff84d05697ee8 R11: 0000000000000080 R12: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: R13: ffff9318401ae238 R14: 00007ef8af9fd11e R15: ffff93244c392ef8
Aug 05 22:40:29 versuvio kernel: FS:  0000000000000000(0000) GS:ffff932746d00000(0000) knlGS:0000000000000000
Aug 05 22:40:29 versuvio kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 05 22:40:29 versuvio kernel: CR2: 0000153b176d6000 CR3: 0000000cc4210006 CR4: 00000000003726e0
Aug 05 22:40:29 versuvio kernel: Call Trace:
Aug 05 22:40:29 versuvio kernel:  <TASK>
Aug 05 22:40:29 versuvio kernel:  ksm_scan_thread+0xb57/0x1c30
Aug 05 22:40:29 versuvio kernel:  ? wait_woken+0x70/0x70
Aug 05 22:40:29 versuvio kernel:  ? try_to_merge_with_ksm_page+0xd0/0xd0
Aug 05 22:40:29 versuvio kernel:  kthread+0x127/0x150
Aug 05 22:40:29 versuvio kernel:  ? set_kthread_struct+0x50/0x50
Aug 05 22:40:29 versuvio kernel:  ret_from_fork+0x1f/0x30
Aug 05 22:40:29 versuvio kernel:  </TASK>
Aug 05 22:40:29 versuvio kernel: ---[ end trace 6bb26abdbee71c07 ]---
 
Now its showing inside the syslog but without crashing, can't see whats triggers it:

Code:
Aug 05 22:40:29 versuvio kernel: ------------[ cut here ]------------
Aug 05 22:40:29 versuvio kernel: WARNING: CPU: 2 PID: 44 at kernel/locking/rwsem.c:240 down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Modules linked in: tcp_diag inet_diag xt_nat xt_tcpudp nf_conntrack_netlink xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat overlay cmac nls_utf8 cifs cifs_arc4 cifs_md4 fscache netfs cfg80211 veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter nf_tables bonding tls nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm_intel ppdev mei_hdcp kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd snd_hda_codec_hdmi cryptd eeepc_wmi rapl intel_cstate snd_hda_intel asus_wmi snd_intel_dspcfg platform_profile sparse_keymap wmi_bmof nouveau snd_intel_sdw_acpi i915 snd_hda_codec pcspkr drm_ttm_helper snd_hda_core efi_pstore ttm snd_hwdep input_leds ee1004 mxm_wmi snd_pcm snd_timer
Aug 05 22:40:29 versuvio kernel:  drm_kms_helper cec joydev snd rc_core i2c_algo_bit fb_sys_fops syscopyarea apple_mfi_fastcharge sysfillrect soundcore mei_me sysimgblt mei parport_pc mac_hid parport acpi_pad vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c simplefb hid_apple hid_generic usbmouse usbkbd usbhid hid r8169 crc32_pclmul realtek nvme xhci_pci ahci i2c_i801 xhci_pci_renesas i2c_smbus nvme_core xhci_hcd libahci wmi video
Aug 05 22:40:29 versuvio kernel: CPU: 2 PID: 44 Comm: ksmd Tainted: P           O      5.15.108-1-pve #1
Aug 05 22:40:29 versuvio kernel: Hardware name: System manufacturer System Product Name/PRIME B250M-A, BIOS 1205 05/11/2018
Aug 05 22:40:29 versuvio kernel: RIP: 0010:down_read+0x74/0xa0
Aug 05 22:40:29 versuvio kernel: Code: cc cc cc 49 8b 44 24 08 65 48 8b 14 25 c0 fb 01 00 83 e0 02 48 09 d0 48 83 c8 01 49 89 44 24 08 4c 8b 65 f8 c9 c3 cc cc cc cc <0f> 0b 49 8b 44 24 08 a8 01 74 b7 a8 02 75 b3 48 89 c2 48 83 ca 02
Aug 05 22:40:29 versuvio kernel: RSP: 0018:ffffb1a3401dbe30 EFLAGS: 00010282
Aug 05 22:40:29 versuvio kernel: RAX: 0000000000000000 RBX: ffff9317dd664968 RCX: 00000000000000c3
Aug 05 22:40:29 versuvio kernel: RDX: fffff84d35dc0000 RSI: ffffffffb7feb480 RDI: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: RBP: ffffb1a3401dbe38 R08: ffff9324b70fd000 R09: 00000000000005f0
Aug 05 22:40:29 versuvio kernel: R10: fffff84d05697ee8 R11: 0000000000000080 R12: ffff9318401ae238
Aug 05 22:40:29 versuvio kernel: R13: ffff9318401ae238 R14: 00007ef8af9fd11e R15: ffff93244c392ef8
Aug 05 22:40:29 versuvio kernel: FS:  0000000000000000(0000) GS:ffff932746d00000(0000) knlGS:0000000000000000
Aug 05 22:40:29 versuvio kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 05 22:40:29 versuvio kernel: CR2: 0000153b176d6000 CR3: 0000000cc4210006 CR4: 00000000003726e0
Aug 05 22:40:29 versuvio kernel: Call Trace:
Aug 05 22:40:29 versuvio kernel:  <TASK>
Aug 05 22:40:29 versuvio kernel:  ksm_scan_thread+0xb57/0x1c30
Aug 05 22:40:29 versuvio kernel:  ? wait_woken+0x70/0x70
Aug 05 22:40:29 versuvio kernel:  ? try_to_merge_with_ksm_page+0xd0/0xd0
Aug 05 22:40:29 versuvio kernel:  kthread+0x127/0x150
Aug 05 22:40:29 versuvio kernel:  ? set_kthread_struct+0x50/0x50
Aug 05 22:40:29 versuvio kernel:  ret_from_fork+0x1f/0x30
Aug 05 22:40:29 versuvio kernel:  </TASK>
Aug 05 22:40:29 versuvio kernel: ---[ end trace 6bb26abdbee71c07 ]---
ksm_scan_thread might indicate KSM. You can try disabling it. If might also indicate a memory corruption (run memtest for a long time?) or other faulty hardware.
 
ksm_scan_thread might indicate KSM. You can try disabling it. If might also indicate a memory corruption (run memtest for a long time?) or other faulty hardware.

Today it crashed again with different errors underneath <TASK>

It was stable for 80+ days before i added the following hardware:

2x G.Skill Aegis F4-2666C19S-16GIS
2x Corsair MP600 NH 2TB (NVME)
2x Seagate Exos X20 - 20 TB
1x PNY Nvidia Quadro P2000

Before:
2x Samsung EVO 970 SSD's
4x Seagate Exos X20 - 20 TB
2x G.Skill Aegis F4-2666C19S-16GIS

After:
2x Corsair MP600 NH 2TB
6x Seagate Exos X20 - 20 TB
4x G.Skill Aegis F4-2666C19S-16GIS
1x PNY Nvidia Quadro P2000

Memtest is running, first test passed after a hour with 0 issues, how many runs do you suggest? And if you see the list above can you point out which can trigger the errors?
 
t was stable for 80+ days before i added the following hardware:
...
1x PNY Nvidia Quadro P2000
I got a 4-port Quadro in my workstation. The benefit I have of four screens hardly outweighs the amount X-restarts, Plasma freezes and kernel crashes. It's my first nVidia card in years, having stayed with manufacturers that do provide proper open source drivers. I had not imagined the state to be so bad.

Your nVidia would be my first suspect!
 
In my case the freezes are happening on Intel NUC with UHD gfx though :D
 
Have you ever resolved the freezes?
I got a 4-port Quadro in my workstation. The benefit I have of four screens hardly outweighs the amount X-restarts, Plasma freezes and kernel crashes. It's my first nVidia card in years, having stayed with manufacturers that do provide proper open source drivers. I had not imagined the state to be so bad.

Your nVidia would be my first suspect!

The problem was a faulty RAM module, unless MEMtest didn't showed any issues after replacing the RAM modules the problem resolved. :)