[HELP] Proxmox Linux VM crashing randomly with no logs or way to diagnose. I've tried everything!

rursache · Jul 25, 2022

Hey guys,

I'm pulling my hair out for over a week already trying to troubleshoot a very weird issue.

I'm running Proxmox on a Intel NUC 11 (NUC11ATKPE) with 32GB RAM and 256GB SSD.

I have 2 debian VMs:

A light one running only AdGuardHome in docker (1 cpu, 1gb ram, 15gb storage) which works without any issue (5 days+ uptime)
Another one running 13 docker containers, smb, 4 passthrough USBs (2 HDDs passed to SMB, a zigbee adapter and a usb audio card) with 4 cpu cores, 8gb ram, 80gb storage.

The second VM crashes randomly anywhere between 2 hours and 2 days of uptime. The crash is actually a complete freeze/lockup of the machine: no ping, no mouse interaction from Proxmox console, no VNC access, no nothing. I must hard reset the VM for it to work again.

I suspected memory at first however the RAM usage stays under 40% at all times.

There are no logs in either Proxmox syslog or the entire /var/log directory on the VM

Here is the VM config and screenshots of the settings and hardware tabs of the VM with issues

What I tried and didn't fixed it:

Multiple Linux distros from Ubuntu Server to normal Ubuntu, Linux Mint, Ubuntu Mate, Debian without DE, Debian with Cinnamon, Debian with LXDE. I'm currently running Ubuntu Mate.
Setting tdp_mmu as per Proxmox documentation
Updating the kernel in VM and in Proxmox to multiple versions from 5.4 to 5.18
Increasing swap size
Disabling ballooning in memory options
Switching machine mode from q35 to i440fx
Changing processor type from host to kvm64 and qemu64
A memtest at host level - all good

I currently have a cronjob in Proxmox to ping the VM each minute and if the ping fails it will reset it. However this breaks any SMB file transfer currently in progress and other things like my HomeAssistant automations, different scripts or ongoing torrent downloads.

EDIT: All the things running in the "crashing VM" were running for 1 year+ on a Raspberry Pi 4 without any issues. Same configs, containers and paths. I manually migrated them one by one.
EDIT 2: Here is a list of all the running docker containers
EDIT 3: Here is the `/var/log` directory of the VM

Any help or ideas?

Thanks!

Matthias. · Jul 25, 2022

I can't think of anything specific right now, but I would stop the containers and unplug the passthroughs, and see if the issue persists. If not, progressively enable them to find the culprit.

rursache · Jul 25, 2022

Matthias. said:
I can't think of anything specific right now, but I would stop the containers and unplug the passthroughs, and see if the issue persists. If not, progressively enable them to find the culprit.

Thanks for your answer.

I 100% relay on those passthroughs, if those are the culprits I might just as well go bare-metal.

I'll try stopping the docker containers and see what's up.

Dark26 · Sep 11, 2022

i think it's a problem between the cpu model and the container.

I have the same problem.

before i have Intel Apollo Lake N3450 / with debian Vm with docker for bitwarden.E verything was working as expected

i migrate on n5105 ( the same cpu branch a you ) and the vm crash . So far 2 times in two days, no logs.

i have no other vm with docker ( proxy / mails/ web serveur / mail gateway / pbs ... ) and every is fine for those;

i see in another topic that a new non official kernel can do better with this CPU..

rursache · Sep 11, 2022

hey @Dark26, thanks for the reply! i went bare-metal, couldnt waste more time with those crashes. which kernel were you referring to? i remember trying pve-edge-kernel but no changes

Dark26 · Sep 11, 2022

that's the one... but no try myself ( yet). the next crash i try.

Malvada · Jan 10, 2024

I've been having the exact same issue, even though this is an old tread I hope someone out there has some ideas.
I have tried everything, but no effect yet.

Also running on an intel NUC 11.
I feel the issue is related to the NVME drive, a Samsung 970 EVO plus. 2 TB.

Everytime there is IO load the system seems to crash (the host). NO syslog, just freezes and needs to get a cold reset.

It came to a point where the system did a backup at night, it would crash around that same time.

Looking for any ideas.

sb-jw · Jan 10, 2024

Malvada said:
Everytime there is IO load the system seems to crash (the host). NO syslog, just freezes and needs to get a cold reset.

It sounds like your NVMe might be overheating?

Dark26 · Jan 10, 2024

DID you install the intel microcode?

hoggle · Jan 29, 2024

I have exactly the same issue. VM crashes randomly after some hours. No log entries on PVE neither on VM.

Intel Celeron J6412 (2,00 GHz, 4-Core, 1,5 MB)
3x 2,5GBit/s on Board LAN (Intel I225-V)
32 GB (1x 32768 MB) SO-DIMM DDR4 3200 RAM
128 GB ATP A600Vc Value M.2 SATA SSD
Sata-Samsung_SSD_850_EVO_2TB_S2RMNX0H500691N

no intel microcode installed by myself

Related to the Samsung EVO:

Code:

root@pve:~# fwupdmgr get-devices | grep -A 5 EVO
├─SSD 850 EVO 2TB:
│     Device ID:          e421b2fc248391f6fe3e55ddbb3c9043be068bd0
│     Summary:            ATA drive
│     Current version:    EMT02B6Q
│     Vendor:             Samsung (ATA:0x144D, OUI:002538)
│     Serial Number:      S2RMNX0H500691N
│     GUIDs:              e84efe7d-f45e-5643-80ac-b8f8d1dade5e ← IDE\Samsung_SSD_850_EVO_2TB_________________EMT02B6Q
│                         66af6b88-f065-561b-9f29-22561089d7b2 ← IDE\0Samsung_SSD_850_EVO_2TB_________________
│                         b023a3c8-ff60-5391-843b-4121cf2fe425 ← Samsung SSD 850 EVO 2TB
│     Device Flags:       • Internal device
│                         • Updatable
│                         • System requires external power source
│                         • Needs a reboot after installation
│                         • Device is usable for the duration of the update

gfngfn256 · Jan 29, 2024

I suspect most of these issues are RAM configuration related.

For instance:

@rursache runs on an N6005 Processor with 32gb which technically only supports up to 16gb:
intel.com

@hoggle runs his on a J6412 Processor with a single stick 32gb which although the number 32 is consistent with intel's docs "32gb Max", from their docs its seems to be a little choosy to the exact config required:
intel.com

Just my 2 cents!

Torchwood1 · Feb 2, 2024

Hello, my VMs started randomly crashing, the GUI kinda works, but all the VMs have like a grey dot by them, now i can restart them manually but it stays grey, on the app when i click resources i get and error saying "Null check operator used on a null value". If i do a hardware restart everything will be alright all day then come probably 2 am it's all crashed again

jfdzar · Apr 19, 2024

Hi all, I am having exactly the same error as described here. I am running them on a N5105.
Everything has been working perfectly for a long time, I have 3 machines with Debian on the proxmox server
2 of them running docker containers (with two cores)
I made an update of the VMs sudo apt update / upgrade and since them the random crashes started
First on one machine, after a while it stop. Now on the other machine almost every day

Funny though, on the third machine with 4 cores, I see no issues....

I will check the kernel versions as it looks like the error comes from there. Maybe the assigned cores to each VM have also something to do with it

frijsdijk · May 25, 2024

I'm having the same issues (N6005 on Odroid H3+), and this might be very interesting for you all: https://forums.servethehome.com/ind...ke-proxmox-kvm-qemu-vm-guest-stability.38824/ (sorry if it was mentioned before)

[HELP] Proxmox Linux VM crashing randomly with no logs or way to diagnose. I've tried everything!

rursache

Member

Matthias.

Proxmox Retired Staff

rursache

Member

Dark26

Renowned Member

rursache

Member

Dark26

Renowned Member

Malvada

Member

sb-jw

Famous Member

Dark26

Renowned Member

hoggle

Active Member

gfngfn256

Distinguished Member

Torchwood1

New Member

jfdzar

New Member

frijsdijk

New Member

We value your privacy