Proxmox node crashing ocassionally

mocanub

Active Member
Dec 12, 2018
26
0
41
38
Hello,

I have a Proxmox node that ocassionally crashes. I really want to figure out the reason behind these crashes as the servers running on them are critical.

Node configuration:
CPU - AMD Ryzen 5 5600G (offering 12 vCPUs)
RAM - 128 GB (2 -3 GBs being used by the iGPU)
OS disk -120 GB SSD
VM disk - 2 TB SSD

Proxmox version running: 7.3-3

When the VMs become unavailable, I can connect to the affected node over SSH, I get this message and I can't do anything else to control the server over SSH:
sudo: error while loading shared libraries: /usr/lib/sudo/libsudo_util.so.0: cannot read file data: Input/output error

I've also installed netdata on Proxmox node but I can't see nothing obvious that could lead to the crash.
I would appreciate any advice on the needed steps to identify the root cause of my node issues.

If needed I can post additional info about my environment.

Thanks in advance.
 
I really want to figure out the reason behind these crashes as the servers running on them are critical.
Better move the VMs to another Proxmox system as your current system is unreliable. I assume you have a spare system and backups in case this one fails for the critical servers.
When the VMs become unavailable, I can connect to the affected node over SSH, I get this message and I can't do anything else to control the server over SSH:
sudo: error while loading shared libraries: /usr/lib/sudo/libsudo_util.so.0: cannot read file data: Input/output error
Sounds like a failing drive. Better replace it and reinstall Proxmox because you don't know what other files are corrupted. Maybe check SMART of the drive and run a long self-test?
 
Better move the VMs to another Proxmox system as your current system is unreliable. I assume you have a spare system and backups in case this one fails for the critical servers.

Sounds like a failing drive. Better replace it and reinstall Proxmox because you don't know what other files are corrupted. Maybe check SMART of the drive and run a long self-test?
Thanks for your response @leesteken

I'll try to move VMs to a different node and try to run a long self-test as you've suggested.

The SMART values don't look that bad apart from that single Runtime_Bad_Block:
1678959001870.png
 
Last edited: