[HELP] Proxmox Linux VM crashing randomly with no logs or way to diagnose. I've tried everything!

rursache

New Member
Jul 7, 2022
12
5
3
Romania
radu.ursache.ro
Hey guys,

I'm pulling my hair out for over a week already trying to troubleshoot a very weird issue.

I'm running Proxmox on a Intel NUC 11 (NUC11ATKPE) with 32GB RAM and 256GB SSD.

I have 2 debian VMs:
  1. A light one running only AdGuardHome in docker (1 cpu, 1gb ram, 15gb storage) which works without any issue (5 days+ uptime)
  2. Another one running 13 docker containers, smb, 4 passthrough USBs (2 HDDs passed to SMB, a zigbee adapter and a usb audio card) with 4 cpu cores, 8gb ram, 80gb storage.
The second VM crashes randomly anywhere between 2 hours and 2 days of uptime. The crash is actually a complete freeze/lockup of the machine: no ping, no mouse interaction from Proxmox console, no VNC access, no nothing. I must hard reset the VM for it to work again.

I suspected memory at first however the RAM usage stays under 40% at all times.

There are no logs in either Proxmox syslog or the entire /var/log directory on the VM

Here is the VM config and screenshots of the settings and hardware tabs of the VM with issues

What I tried and didn't fixed it:
  • Multiple Linux distros from Ubuntu Server to normal Ubuntu, Linux Mint, Ubuntu Mate, Debian without DE, Debian with Cinnamon, Debian with LXDE. I'm currently running Ubuntu Mate.
  • Setting tdp_mmu as per Proxmox documentation
  • Updating the kernel in VM and in Proxmox to multiple versions from 5.4 to 5.18
  • Increasing swap size
  • Disabling ballooning in memory options
  • Switching machine mode from q35 to i440fx
  • Changing processor type from host to kvm64 and qemu64
  • A memtest at host level - all good
I currently have a cronjob in Proxmox to ping the VM each minute and if the ping fails it will reset it. However this breaks any SMB file transfer currently in progress and other things like my HomeAssistant automations, different scripts or ongoing torrent downloads.

EDIT: All the things running in the "crashing VM" were running for 1 year+ on a Raspberry Pi 4 without any issues. Same configs, containers and paths. I manually migrated them one by one.
EDIT 2: Here is a list of all the running docker containers
EDIT 3: Here is the `/var/log` directory of the VM

Any help or ideas?

Thanks!
 
Last edited:
I can't think of anything specific right now, but I would stop the containers and unplug the passthroughs, and see if the issue persists. If not, progressively enable them to find the culprit.
 
  • Like
Reactions: rursache
I can't think of anything specific right now, but I would stop the containers and unplug the passthroughs, and see if the issue persists. If not, progressively enable them to find the culprit.
Thanks for your answer.

I 100% relay on those passthroughs, if those are the culprits I might just as well go bare-metal.

I'll try stopping the docker containers and see what's up.
 
i think it's a problem between the cpu model and the container.

I have the same problem.

before i have Intel Apollo Lake N3450 / with debian Vm with docker for bitwarden.E verything was working as expected


i migrate on n5105 ( the same cpu branch a you ) and the vm crash . So far 2 times in two days, no logs.

i have no other vm with docker ( proxy / mails/ web serveur / mail gateway / pbs ... ) and every is fine for those;

i see in another topic that a new non official kernel can do better with this CPU..
 
  • Like
Reactions: rursache
I've been having the exact same issue, even though this is an old tread I hope someone out there has some ideas.
I have tried everything, but no effect yet.

Also running on an intel NUC 11.
I feel the issue is related to the NVME drive, a Samsung 970 EVO plus. 2 TB.

Everytime there is IO load the system seems to crash (the host). NO syslog, just freezes and needs to get a cold reset.

It came to a point where the system did a backup at night, it would crash around that same time.

Looking for any ideas.
 
  • Like
Reactions: rursache
I have exactly the same issue. VM crashes randomly after some hours. No log entries on PVE neither on VM.

Intel Celeron J6412 (2,00 GHz, 4-Core, 1,5 MB)
3x 2,5GBit/s on Board LAN (Intel I225-V)
32 GB (1x 32768 MB) SO-DIMM DDR4 3200 RAM
128 GB ATP A600Vc Value M.2 SATA SSD
Sata-Samsung_SSD_850_EVO_2TB_S2RMNX0H500691N

no intel microcode installed by myself

Related to the Samsung EVO:

Code:
root@pve:~# fwupdmgr get-devices | grep -A 5 EVO
├─SSD 850 EVO 2TB:
│     Device ID:          e421b2fc248391f6fe3e55ddbb3c9043be068bd0
│     Summary:            ATA drive
│     Current version:    EMT02B6Q
│     Vendor:             Samsung (ATA:0x144D, OUI:002538)
│     Serial Number:      S2RMNX0H500691N
│     GUIDs:              e84efe7d-f45e-5643-80ac-b8f8d1dade5e ← IDE\Samsung_SSD_850_EVO_2TB_________________EMT02B6Q
│                         66af6b88-f065-561b-9f29-22561089d7b2 ← IDE\0Samsung_SSD_850_EVO_2TB_________________
│                         b023a3c8-ff60-5391-843b-4121cf2fe425 ← Samsung SSD 850 EVO 2TB
│     Device Flags:       • Internal device
│                         • Updatable
│                         • System requires external power source
│                         • Needs a reboot after installation
│                         • Device is usable for the duration of the update
 
Last edited:
I suspect most of these issues are RAM configuration related.

For instance:

@rursache runs on an N6005 Processor with 32gb which technically only supports up to 16gb:
intel.com

@hoggle runs his on a J6412 Processor with a single stick 32gb which although the number 32 is consistent with intel's docs "32gb Max", from their docs its seems to be a little choosy to the exact config required:
intel.com

Just my 2 cents!
 
Hello, my VMs started randomly crashing, the GUI kinda works, but all the VMs have like a grey dot by them, now i can restart them manually but it stays grey, on the app when i click resources i get and error saying "Null check operator used on a null value". If i do a hardware restart everything will be alright all day then come probably 2 am it's all crashed again
 
Hi all, I am having exactly the same error as described here. I am running them on a N5105.
Everything has been working perfectly for a long time, I have 3 machines with Debian on the proxmox server
2 of them running docker containers (with two cores)
I made an update of the VMs sudo apt update / upgrade and since them the random crashes started
First on one machine, after a while it stop. Now on the other machine almost every day

Funny though, on the third machine with 4 cores, I see no issues....

I will check the kernel versions as it looks like the error comes from there. Maybe the assigned cores to each VM have also something to do with it
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!