Proxmox 8.4.1 Host is crashing mysteriously

the1corrupted

New Member
Sep 30, 2023
15
2
3
I did a fresh install of Proxmox on this server, put up a container with a mount point and got to work but when I got up this morning, the host was frozen. It was freezing previously as well but I thought a fresh install might address the issue. It has not.

The SMART Celsius is not showing the same values as lm_sensors when I installed and ran the sensors command.

pveversion
pve-manager/8.4.1/2a5fa54a8503f96d (running kernel: 6.8.12-10-pve)

I can't find evidence of the crash in the host logs

Code:
Apr 28 04:17:01 proxmox-b CRON[285635]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Apr 28 04:17:01 proxmox-b CRON[285636]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Apr 28 04:17:01 proxmox-b CRON[285635]: pam_unix(cron:session): session closed for user root
Apr 28 04:33:10 proxmox-b smartd[1727]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 125
Apr 28 04:56:35 proxmox-b pmxcfs[2020]: [dcdb] notice: data verification successful
-- Reboot --
Apr 28 05:49:48 proxmox-b kernel: Linux version 6.8.12-10-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-10 (2025-04-18T07:39Z) ()
Apr 28 05:49:48 proxmox-b kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-10-pve root=/dev/mapper/pve-root ro quiet

Sensors:
Code:
nvme-pci-0300
Adapter: PCI adapter
Composite:    +46.9°C  (low  =  -0.1°C, high = +99.8°C)
                       (crit = +109.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +36.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +30.0°C  (high = +80.0°C, crit = +100.0°C)
Core 4:        +33.0°C  (high = +80.0°C, crit = +100.0°C)
Core 8:        +35.0°C  (high = +80.0°C, crit = +100.0°C)
Core 12:       +32.0°C  (high = +80.0°C, crit = +100.0°C)
Core 16:       +31.0°C  (high = +80.0°C, crit = +100.0°C)
Core 20:       +30.0°C  (high = +80.0°C, crit = +100.0°C)
Core 28:       +33.0°C  (high = +80.0°C, crit = +100.0°C)
Core 29:       +33.0°C  (high = +80.0°C, crit = +100.0°C)
Core 30:       +33.0°C  (high = +80.0°C, crit = +100.0°C)
Core 31:       +33.0°C  (high = +80.0°C, crit = +100.0°C)

nouveau-pci-0100
Adapter: PCI adapter
fan1:        1052 RPM
temp1:        +35.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

nvme-pci-0200
Adapter: PCI adapter
Composite:    +33.9°C  (low  = -273.1°C, high = +80.8°C)
                       (crit = +84.8°C)
Sensor 1:     +30.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +33.9°C  (low  = -273.1°C, high = +65261.8°C)
 
Hi the1corrupted,

Code:
Apr 28 04:33:10 proxmox-b smartd[1727]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 125

looks wildly hot for a drive. What kind of drive is your /dev/sdd? How is it connected?

lm-sensors by default will not show hdd-temperatures, but smartmontools does, so there's no obvious contradictions in the outputs. You could however try to load the drivetemp kernel module
Code:
modprobe drivetemp
and check if lm-sensors picks it up [0].

Could you share the output of the following command?
Code:
smartctl -a /dev/sdd

[0] https://wiki.archlinux.org/title/Lm_sensors -- check for 'drivetemp'
 
  • Like
Reactions: Kingneutron
looks wildly hot for a drive

I don't see any temp issue in the logs you provide.
SMART Usage Attribute: 194 Temperature_Celsius changed from 122 to 125
This is only the SMART attribute value. AFAIK It is a range of 0-255, with 255 point being the lowest cold-point & 0 being the highest hot-point. So 125 is actually colder than 122.


You don't actually provide much detail to diagnose your issue.

Proxmox on this server
HW details? NW details? Cluster details? Storage details?

put up a container with a mount point and got to work but when I got up this morning, the host was frozen
host as in the complete node or just the container (I imagine the whole proxmox-b node from the logs) ? Was it pingable from within the NW? If you have a monitor/keyb attached was it accessible?

If you don't run this "container" - does the host run stable? What are the config details of this, or any other "offending" LXC/VM?

What is the free space like on the /root dir?
 
I don't see any temp issue in the logs you provide.

This is only the SMART attribute value. AFAIK It is a range of 0-255, with 255 point being the lowest cold-point & 0 being the highest hot-point. So 125 is actually colder than 122.


You don't actually provide much detail to diagnose your issue.


HW details? NW details? Cluster details? Storage details?


host as in the complete node or just the container (I imagine the whole proxmox-b node from the logs) ? Was it pingable from within the NW? If you have a monitor/keyb attached was it accessible?

If you don't run this "container" - does the host run stable? What are the config details of this, or any other "offending" LXC/VM?

What is the free space like on the /root dir?
By host, yes I mean the node.

It is one in a cluster of 4. Stand alone LXC of Ubuntu 24.04 with a mount point to a ZFS pool that is stable.

The LXC boot drive is stored in LVM-Local, same boot drive

That is all the node is running.

Hardware platform is an Asus Z690 with an Intel i5 12600k, 64GB (2x32GB) of Crucial memory

Boot drive is a brand new Samsung 990 Pro 1TB.

All I have done so far:
Reinstalled Proxmox to the boot drive
Create a new instance of my Ubuntu Samba server
Joined the Samba server to Active directory
Created a 6TB mount point on the ZFS pool for the container
Tried to copy files from a backup (another Samba share running stable on another node)

There have been issues with the console on 24.04 and I had to add "LXC.apparmor.profile = uncontained"

It is a privileged container.
 
Last edited:
It is one in a cluster of 4
As you probably know, you should have an odd number of nodes (or add a Qdevice) to avoid a Quorum/split-brain issue.

Boot drive is a brand new Samsung 990 Pro 1TB.
Problems have been known on these drives (search these forums & others) - you could try changing the drive to a different one for comparison.

You still have not provided the available space. Show output for:
Code:
df -h

Reinstalled Proxmox to the boot drive
What does the cluster look like after that?

Create a new instance of my Ubuntu Samba server
Show config with:
Code:
pct config <CTID>

Tried to copy files from a backup (another Samba share running stable on another node)
but unsuccessful? This is when node went down?


Points you have as of yet not responded/addressed:

  • NW details
  • Was it pingable from within the NW?
  • If you have a monitor/keyb attached was it accessible?
  • If you don't run this "container" - does the host run stable?

Tip: Maybe change your username for better fate! ;)
 
As you probably know, you should have an odd number of nodes (or add a Qdevice) to avoid a Quorum/split-brain issue.


Problems have been known on these drives (search these forums & others) - you could try changing the drive to a different one for comparison.

You still have not provided the available space. Show output for:
Code:
df -h


What does the cluster look like after that?


Show config with:
Code:
pct config <CTID>


but unsuccessful? This is when node went down?


Points you have as of yet not responded/addressed:

  • NW details
  • Was it pingable from within the NW?
  • If you have a monitor/keyb attached was it accessible?
  • If you don't run this "container" - does the host run stable?

Tip: Maybe change your username for better fate! ;)
I can't just go changing my fate!!

AS I am currently away from the cluster/source I can give at least some general ideas.

I don't know if my Samba transfer was successful. I did a nohup background call to smbclient and the last line indicated that files were transferred but there was no completion message at the end. A recursive "mget *" to pull the data over with an authenticated user.

This transfer takes 3-4 hours so I can't watch it the whole time. I did restart the transfer this morning, waiting to see if it crashes the host.

I do have a monitor and KB attached and it was inaccessible there. Unresponsive to input. It required a hard reboot to get going again.

I left out a piece. I am passing through "/dev/dri/by-path/pcie-0:00:02.0-card" and "/dev/dri/by-path/pcie-0:00:02.0-render" for graphics acceleration but as far as I know nothing would be using it. This is the integrated Intel graphics on the chip.

At the end I am removing one of the other nodes from the cluster to Quorum back at 3. It's literally there to give me 3 node quorum while I am still working on the issues with this new node. No VMs or CTs on it.

NW: IPv4 flat network, 192.168.69.0/24
No IPv6 enabled on this server yet but I do have managed IPv6.

DNS is a couple PiHoles running for redundancy on other nodes.

If the LXC container is not running, the host has been stable overnight.

I have also seen other threads having issues with the Ubuntu 24.04 template.
 
I left out a piece. I am passing through "/dev/dri/by-path/pcie-0:00:02.0-card" and "/dev/dri/by-path/pcie-0:00:02.0-render" for graphics acceleration but as far as I know nothing would be using it.
Your physically attached monitor may very well be using it or trying to, and that is why you cannot access it that way (as above).
Try removing passthrough & retest the LXC for stability. Passthrough can sometimes/often cause unforeseen outcomes. I realize that possibly that passthrough is critical to your workload - but at least for testing, remove it.

Ubuntu 24.04 template
Try rebuilding with a Debian template? - I generally have zero issues with them.

For the third time:
  • During "freeze" was the host node pingable from within the NW?
  • You still have not provided the available space. Show output for: df -h
I realize you may not have answers currently for the above (as you mention), but try to answer them when you can.
 
  • Like
Reactions: the1corrupted
Not pingable. df -h output available later today as I am not currently on site with the cluster.

It feels like a kernel crash or kernel panic without showing in journalctl.

Full update later today. I will try with Debian 12 as well, as I like that much better anyway.
 
I will try with Debian 12 as well, as I like that much better anyway.
Me too - but I'm really in love with alpine templates - as they are absolutely the least resource-hungry, although/because they are so bare, you have more building/configuring to do. But once you get them sorted - there is no going back!
(Look also at the template sizes - you will be astounded. Another example: I created just today for a friend - an LXC with alpine for adguardhome + ssh that has a backup size of ~24MB, do that with Debian & it will be 10 times that amount. This is in addition to the above-mentioned resource gains while running).

Anyway, enough off-topic ranting, good luck with your issue.
 
Your physically attached monitor may very well be using it or trying to, and that is why you cannot access it that way (as above).
Try removing passthrough & retest the LXC for stability. Passthrough can sometimes/often cause unforeseen outcomes. I realize that possibly that passthrough is critical to your workload - but at least for testing, remove it.


Try rebuilding with a Debian template? - I generally have zero issues with them.

For the third time:
  • During "freeze" was the host node pingable from within the NW?
  • You still have not provided the available space. Show output for: df -h
I realize you may not have answers currently for the above (as you mention), but try to answer them when you can.
Here it is, the awaited:

Code:
root@proxmox-b:~# df -h
Filesystem                Size  Used Avail Use% Mounted on
udev                       32G     0   32G   0% /dev
tmpfs                     6.3G  2.3M  6.3G   1% /run
/dev/mapper/pve-root      125G  9.5G  110G   8% /
tmpfs                      32G   66M   31G   1% /dev/shm
tmpfs                     5.0M     0  5.0M   0% /run/lock
efivarfs                  192K  147K   41K  79% /sys/firmware/efi/efivars
/dev/nvme0n1p2           1022M   12M 1011M   2% /boot/efi
radz-b                    9.7T  256K  9.7T   1% /radz-b
radz-b/subvol-401-disk-0  6.5T  4.8T  1.8T  73% /radz-b/subvol-401-disk-0
/dev/fuse                 128M   60K  128M   1% /etc/pve
tmpfs                     6.3G     0  6.3G   0% /run/user/0

I fully expected to come home to a frozen host today, but it's been 12 hours stable. The "crash" was around 5 AM local time, it is currently 6:11 PM

Here is a copy of the CONF file

Code:
cat /etc/pve/lxc/401.conf
arch: amd64
cores: 4
dev0: /dev/dri/by-path/pci-0000:00:02.0-card,gid=44
dev1: /dev/dri/by-path/pci-0000:00:02.0-render,gid=108
hostname: jellyfin
memory: 4096
mp0: radz-b:subvol-401-disk-0,mp=/jellyfin-media,size=6646G
net0: name=eth0,bridge=vmbr0,firewall=1,gw=192.168.69.1,hwaddr=BC:24:11:C7:95:01,ip=192.168.69.40/24,type=veth
ostype: ubuntu
rootfs: local-lvm:vm-401-disk-0,size=32G
searchdomain: ad.necrodex.io
swap: 1024
lxc.apparmor.profile = unconfined

Can I use a different privilege type and still give it decent access to the PCIe device? Like if I make it unprivileged, not nested?

As for a rebuild, this is still scheduled, but I won't have time today unfortunately.
 
Ok so I tried to migrate Debian but the Active Directory authentication is still evading me on actually working.

The LXC brought the host down again after it was up for a period of time.

The uptime might be repeatable but for now, what am I looking for?

I get nothing from the host.
Would the LXC have the diagnostic logging I need to start root cause?
 
You might also want to try one of our newer opt-in kernels 6.11 [0] or 6.14 [1] to check if this enhances the situation for you. You might be running into issues that have been fixed in newer kernel versions. 'Random' crashes without journal messages might point at kernel crashes in the IO stack, so there's no time to have it written out.

[0] https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.156818/
[1] https://forum.proxmox.com/threads/o...e-8-available-on-test-no-subscription.164497/