Proxmox PVE 8.2-1 unresponsive after a few days

islandtime · Jan 12, 2025

Greetings. Proxmox PVE 8.2-1 host becomes unresponsive after a few days. It responds to ping, but no SSH or Web Interface. VMs are also unresponsive. The server needs to be power cycled in order to recover. In a few days again the same thing will happen.

Here are the errors that repeat from a remote syslog server:

Code:

/usr/sbin/ksmtuned: line 105: [: too many arguments
/usr/sbin/ksmtuned: line 79: /usr/bin/awk: Input/output error
/usr/sbin/ksmtuned: line 79: /usr/bin/ps: Input/output error
/usr/sbin/ksmtuned: line 83: /usr/bin/awk: Input/output error
/usr/sbin/ksmtuned: line 131: /usr/bin/sleep: Input/output error
/usr/sbin/ksmtuned: line 111: [: -lt: unary operator expected
/usr/sbin/ksmtuned: line 105: [: too many arguments
/usr/sbin/ksmtuned: line 79: /usr/bin/awk: Input/output error

Any help is greatly appreciated. I am a newbie but very excited about Proxmox.

UdoB · Jan 12, 2025

You gave us zero information about your setup. The more details you post here, the greater the chance for a helpful reply ;-)

Did you watch the RAM usage? Did you overcommit RAM? Is your storage set up with redundancy? Which filesystem does you host use?

Please start by giving us some information, like the copy-n-pasted output of some commands, run on a PVE host (either via SSH or via "Datacenter --> <one Node> --> Shell", both ways allow copy-n-paste):

Hardware information:

lscpu # the CPU with its capabilities ("Flags")
free -g # RAM availability including Swap. Units set to "GiB"

PVE System information:

pveversion -v

PVE Storages/Disks:

pvesm status #
df -h | grep -v subvol- # disk usage without listing ZFS datasets used for containers

Those are examples. You may add/edit commands and options if you can enrich the information given. Oh, and please put each command in a separate [CODE]...[/CODE]-block for better readability.

islandtime · Jan 15, 2025

Greetings UdoB. Thanks for trying to help me with this.

This is a Dell Optiplex box with a consumer NVMe SSD. It has one Ubuntu guest and one Windows 10 guest. It is not over-committed on RAM or any other resources, unless I am mistaken. Qemu agent is running on both VMs. No redundancy. This is using LVMThin for the VM storage.

I did the following to try to lessen the wear and tear on the SSD:
-installed log2ram https://github.com/azlux/log2ram?tab=readme-ov-file#installation
-systemctl disable --now pve-ha-lrm.service pve-ha-crm.service corosync.service

Code:

lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   12
  On-line CPU(s) list:    0-11
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel(R) Corporation
  Model name:             Intel(R) Core(TM) i5-10505 CPU @ 3.20GHz
    BIOS Model name:      Intel(R) Core(TM) i5-10505 CPU @ 3.20GHz  CPU @ 3.1GHz
    BIOS CPU family:      205
    CPU family:           6
    Model:                165
    Thread(s) per core:   2
    Core(s) per socket:   6
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   44%
    CPU max MHz:          4600.0000
    CPU min MHz:          800.0000
    BogoMIPS:             6399.96
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 s
                          s ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
                          art arch_perfmon pebs bts rep_good nopl xtopology nons
                          top_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
                          ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm p
                          cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                          er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
                           cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tp
                          r_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adj
                          ust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx sm
                          ap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves
                          dtherm ida arat pln pts hwp hwp_notify hwp_act_window
                          hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:         VT-x
Caches (sum of all):
  L1d:                    192 KiB (6 instances)
  L1i:                    192 KiB (6 instances)
  L2:                     1.5 MiB (6 instances)
  L3:                     12 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-11
Vulnerabilities:
  Gather data sampling:   Vulnerable: No microcode
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode;
                           SMT vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; Enhanced IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prct
                          l
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointe
                          r sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditiona
                          l; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop,
                          KVM SW loop
  Srbds:                  Vulnerable: No microcode
  Tsx async abort:        Not affected

Code:

 free -g
               total        used        free      shared  buff/cache   available
Mem:              15           8           7           0           0           7
Swap:              7           0           7

Code:

pvesm status
Name             Type     Status           Total            Used       Available                                                                                                %
NVMe          lvmthin   disabled               0               0               0                                                                                              N/A
data          lvmthin   disabled               0               0               0                                                                                              N/A
local             dir     active        51290592         3766908        44885860                                                                                            7.34%
local-lvm     lvmthin     active       880156672       118997182       761159489                                                                                           13.52%

Code:

df -h | grep -v subvol-
Filesystem            Size  Used Avail Use% Mounted on
udev                  7.7G     0  7.7G   0% /dev
tmpfs                 1.6G   12M  1.6G   1% /run
/dev/mapper/pve-root   49G  3.6G   43G   8% /
tmpfs                 7.7G   34M  7.7G   1% /dev/shm
tmpfs                 5.0M     0  5.0M   0% /run/lock
efivarfs              384K   71K  309K  19% /sys/firmware/efi/efivars
/dev/nvme0n1p2       1022M   12M 1011M   2% /boot/efi
log2ram               128M   21M  108M  17% /var/log
/dev/fuse             128M   40K  128M   1% /etc/pve
tmpfs                 1.6G     0  1.6G   0% /run/user/0

islandtime · Jan 16, 2025

The filesystem is the default ext4. Thank you

gfngfn256 · Jan 16, 2025

islandtime said:
It has one Ubuntu guest and one Windows 10 guest. It is not over-committed on RAM or any other resources, unless I am mistaken.

Your system has a total of 16GB? How much have you given to each VM? Your host also needs it's own RAM to operate. How much does it have?

islandtime · Jan 16, 2025

Of course that makes sense. 5.59GiB to the Windows VM and 2.0GiB to the Ubuntu VM.

gfngfn256 · Jan 16, 2025

islandtime said:
5.59GiB

Wow! Not 5.6 & not 5.58, what made you use that exact amount? I hope you haven't just written here how much the GUI reports what that VM is using. We need to know how much you have assigned in the VMs configuration.

Anyway you should probably start by testing that ram for errors.
Then look at thermals, NVMe & PSU.

Good luck.

islandtime · Jan 16, 2025

The Windows VM was converted from VMWare ESXi. I gave it the same amount of RAM that it was assigned in VMWare; 6GB=5.59GiB as far as I could tell from Google.

The RAM was tested 6 months ago, but only with the Dell Onboard Diagnostic quick test. Would you suggest I run something more intensive and longer?

Yep, thermals. I will work on setting up Zabbix or Observium so I can keep an eye on those.

Are there any clues that I can look for in the logs?

Thank you

proxale · Jan 17, 2025

What network controller is your ProxMox host using?
Are there any errors from dmesg ?

gfngfn256 · Jan 17, 2025

islandtime said:
Would you suggest I run something more intensive and longer?

Definitely. Best to test on a non running system. Maybe run a memtest86 from any live Linux. You can do it from the Proxmox ISO installer too - on boot.

islandtime said:
Zabbix or Observium so I can keep an eye on those.

Not sure I would get that fancy on a failing system - you'll probably just overwhelm it that way. Just do something simple CLI etc. Monitor the system logs etc.

Check / replace NVMe for testing. Smartctl data isn't the most intuitive or reliable, but still worth looking at.

islandtime · Jan 17, 2025

proxale said:
What network controller is your ProxMox host using?
Are there any errors from dmesg ?

The NIC is Intel Gigabit CT Desktop Adapter EXPI9301CTBLK.

No, no logs at all from dmesg on the syslog server.
Thank you

proxale · Jan 19, 2025

islandtime said:
The NIC is Intel Gigabit CT Desktop Adapter EXPI9301CTBLK.

No, no logs at all from dmesg on the syslog server.
Thank you

There is a somewhat similar issue which occurs with some Intel NIC where the server just randomly drops off the network after a while. May or may not be relevant to your situation since you don't see any errors from dmesg but you can try running some of the diagnostic steps from https://forum.proxmox.com/threads/network-card-drop-igc-0000-09-00-0-eno1-pcie-link-lost.121295/

Search

Search

Proxmox PVE 8.2-1 unresponsive after a few days

islandtime

New Member

UdoB

Distinguished Member

islandtime

New Member

islandtime

New Member

gfngfn256

Famous Member

islandtime

New Member

gfngfn256

Famous Member

islandtime

New Member

proxale

Active Member

gfngfn256

Famous Member

islandtime

New Member

proxale

Active Member

We value your privacy