PROXMX CRASHING, and HANGING DURING SHUTDOWNS and REBOOTS

dva411

New Member
Feb 13, 2024
25
0
1
I have had Proxmox for about 6 weeks. I'm running on a new mini PC with an Intel i13500H core i5 CPUwith 64GB DDR5 and integrated intel Raptor Lake Graphics card. I'm running proxmox on 2X2TB internal NVME in a Raid 1 ZFS configuration. I've probably created 50+ VMs and LXCs in the last month, learning, and getting to the right setup. I'm finally happy with my services, but I'm having significant stability issues.

1) I have an Ubuntu 22.04 desktop VM that rarely shuts down or reboots. It has guest agent installed. But just hangs.

2) Even if I manually stop my VMs and LXCs, Proxmox often just hangs on reboot or shutdown. The power is still on, but I lose connectivity and screen. I have to pull the plug to get it to restart.

3) I have frequent crashes. This happens most frequently during software installs (new VMs, sometimes installs from apt repositories, also sometimes restores from backups), and sometimes in Jellyfin when fast-forwarding a movie. These are always preceeded by a heavy ramp up of my fan, like it is working really hard, despite the CPU and Memory useage being low. Then it just clicks off, the fan goes silent, no connectivity, no display, but it still has power. I can't shut it down with the switch I have to pull the plug. I checked my smart statistics on my NVME and it seems that they've been powered off abruptly over 150 times. Not good! I can sometimes hit the drives heavily, transcoding, etc. copying, etc for hours on end and never have a problem. It really seems to be installation of new softare is the most unstable.

4) I attached my logs from last boot. Seems i have something about kernel split lock, at least 1 memory conflict, perhaps some pci proplems? Also noticed unrelated that my bluetooth firmware isnt being loaded (less important but i'd like to get that fixed). I also noticed some references to NVIDIA, even though I just have a single integrated intel CPU/GPU.

I had a kernel panic issue last week and had to reinstall proxmox. I think it was related to a bad initramfs refresh. I thought it would be a good opporunity to start with a new install of Proxmox, and that some of my issues would go away. It seems it was the opposite. The rebooting issue went from intermittent, to NEVER reboots or shuts down. I really need some help. I cant rely on this system at all right now. Its very fragile, and if i'm not here to reboot it. It will just hang indefinately. I really would appreciate some help triaging and getting this to stable! I really like the versatility proxmox provides. However, its unusable with the all of the stability issues.
 

Attachments

  • last_boot.txt
    152.8 KB · Views: 2
Last edited:
I would define a process to first identify if its hardware of software related. Then eliminate/remove as many components as possible and slowly build back up while tracking if problem reappears. Lots of time and energy.

For me with similar issues, doing such process led to me finding:
1 - A non Linux VM I had assigned number of CPU not a power of 2 ( ei not a 2, 4, 8, 16,) and VM would lock at close/shutdown being unhappy with 6, 10 or 12 cpus for example.
2 - With a dedicated GPU the bios setting of Enable Resize BAR(Resizable BAR) function was causing strange lock up similar to your experience. That was not a igpu like yours, it was a dedicated AMD GPU.
 
thanks, I've tried everything that I have the knowledge to do. I'm at a loss. I've removed everything but 1 ubuntu lxc and 1 ubuntu VM. The lXC which is running jellyfin never boots up automatically. It always faiils the pre-hook. By the time, I can launch it manually, it always starts successfully. It shuts down fine. The VM wont shut down. Proxmox host wont shut down. I really need some pointers in the right direction based on my logs, or suggestions on things I should be looking at.
 
I'm desperate for a lifeline. I cant go more than 36 hours without a crash. Usually reboot at least once a day. Is this highly unusual, or are stability issues like this common? I really could use some assistance!
 
You don't provide enough HW/SW details. But here are some pointers as to what I would do in your position:

  • Check your Network Configuration / NICS / cables carefully - you provided no details.
  • I assume you have checked your RAM - properly! (Memtest etc.)
  • Don't know what Mini PC you're running - But what you describe looks a lot like a thermal problem.
  • Try running the Mini PC enclosure open - i.e.: Open the box as much as possible. Even put a room fan over it. See if this makes a difference.
  • Next I would look at the power supply. Can it cope with the setup you have? Is it faulty?
  • Then you could / should look at the NVMEs. Don't know what they are - But start by trying a fresh setup with a spare NVME (single, your PCI might not be coping) - see if you can get a stable system up & running that way.
  • Try running an alternate OS - bare metal - and see if that's stable.

Good luck.
 
You don't provide enough HW/SW details. But here are some pointers as to what I would do in your position:

  • Check your Network Configuration / NICS / cables carefully - you provided no details.
  • I assume you have checked your RAM - properly! (Memtest etc.)
  • Don't know what Mini PC you're running - But what you describe looks a lot like a thermal problem.
  • Try running the Mini PC enclosure open - i.e.: Open the box as much as possible. Even put a room fan over it. See if this makes a difference.
  • Next I would look at the power supply. Can it cope with the setup you have? Is it faulty?
  • Then you could / should look at the NVMEs. Don't know what they are - But start by trying a fresh setup with a spare NVME (single, your PCI might not be coping) - see if you can get a stable system up & running that way.
  • Try running an alternate OS - bare metal - and see if that's stable.

Good luck.
Thanks for the suggestions on process of elimination. For reference its a nucbox k7 from GMKTEC. I upgraded it with 64GB DDR5 and 2X2TB NVME (one is a Seagate firecuda 520, the other is WD Black... Unfortunately hey don't make the highly durable version of the 520 any more, so I just went with a WD with similar specs that was on sale) . The NVME are zfs raid 1 with standard proxmox pool setup. The cables are all brand new Cate 6e connected to a 2.5gbs switch.) Recently it has been freezing when latent and not when doing heaving disk io. I've disconnected all my USB drives (at first I had a couple of NVME in enclosures connected to USB 3.2 ports. I also had 2 X Seagate x18 18tb hard drives connected via a dual SATA to USB dock that has its own power supply. I had the NVME enclosures connected to USB hubs that also had their own power. I tried everything from direct connecttions of the drives, to powered hubs. Recently I disconnected everything and connected the drives to an old raspi4 and shared them back to the nucbox via samba. Yet I continue to have crashes/freezes daily. I don't know what to look at hone in on the root cause. I see a number of things in my boot when looking at journalctl, but not sure if they are real problems. If I look back to right before the crash I see nothing. I eliminated all but 1 containersand 2 VMs. The LXC and one of the VMs are Ubuntu 22.04. (the VM is jammy desktop). The Ubuntu VM has docker installed and is running about 30 docker containers. The other VM is HASSOS. The Ubuntu LXC is an unprivileged container with Jellyfin installed directly into the container with GPU passthrough and hardware transcoding/acceleration. The docker services within the Ubuntu VM seem rock solid. I have a love hate with this machine. I was 5 years on a raspi. It was sooo stable. It just worked for everything I through at it and handled it well (even though it was a pi4b with only 4gb of ram) The difference in speed and power and flexibility with my current setup is massive, but I cant keep it running.

Some of the things in my log at boot are below I just pasted in some excerpts that contained some warnings. Some overlapping memory registers, zfs, etc. Not sure if my my GPU PCI pass through is part of the problem,.... My full journalctl from a prior boot was attached to first thread. Any help or suggestions on other logs to pull would be greatly appreciated. FYI, I did learn that my SATA dock/drives were causing my Ubuntu VM not to shutdown/reboot. After connecting those to my raspi, my Ubuntu VM and proxmox shuts down and reboots quickly, and consistently. Is this normal? I'd like those drives connected to the NUC. Is there something else I should do to make that part of the puzzle work reliably (different dock, some other config.. or is that expected)? That's a bit secondary though. My primary concern is to stop the crashing..... :

Mar 30 17:05:10 nuc kernel: efi: Remove mem69: MMIO range=[0xc0000000-0xcfffffff] (256MB) from e820 map
Mar 30 17:05:10 nuc kernel: e820: remove [mem 0xc0000000-0xcfffffff] reserved
0 17:05:10 nuc kernel: efi: Not removing mem74: MMIO range=[0xfee00000-0xfee00fff] (4KB) from e820 map
Mar 30 17:05:10 nuc kernel: efi: Remove mem75: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
Mar 30 17:05:10 nuc kernel: e820: remove [mem 0xff000000-0xffffffff] reserved
Mar 30 17:05:10 nuc kernel: secureboot: Secure boot disabled
Mar 30 17:05:10 nuc kernel: SMBIOS 3.6.0 present.
Mar 30 17:05:10 nuc kernel: DMI: GMKtec NucBox K7/GMKtec, BIOS NucBox K71130 11/30/2023
Mar 30 17:05:10 nuc kernel: tsc: Detected 3200.000 MHz processor
Mar 30 17:05:10 nuc kernel: tsc: Detected 3187.200 MHz TSC
Mar 30 17:05:10 nuc kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
....
Mar 30 17:05:10 nuc kernel: Faking a node at [mem 0x0000000000000000-0x00000010afbfffff]
Mar 30 17:05:10 nuc kernel: NODE_DATA(0) allocated [mem 0x10afbd5000-0x10afbfffff]
Mar 30 17:05:10 nuc kernel: Zone ranges:
Mar 30 17:05:10 nuc kernel: DMA [mem 0x0000000000001000-0x0000000000ffffff]
Mar 30 17:05:10 nuc kernel: DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Mar 30 17:05:10 nuc kernel: Normal [mem 0x0000000100000000-0x00000010afbfffff]
Mar 30 17:05:10 nuc kernel: Device empty
Mar 30 17:05:10 nuc kernel: Movable zone start for each node
Mar 30 17:05:10 nuc kernel: Early memory node ranges

Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 1 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 1 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 96 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA32: 16898 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone Normal: 8192 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone Normal: 1024 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: Reserving Intel graphics memory at [mem 0x4c800000-0x503fffff]
Mar 30 17:05:10 nuc kernel: ACPI: PM-Timer IO Port: 0x1808
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])

----
Mar 30 17:05:10 nuc kernel: IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-119
Mar 30 17:05:10 nuc kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
Mar 30 17:05:10 nuc kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Mar 30 17:05:10 nuc kernel: ACPI: Using ACPI (MADT) for SMP configuration information
Mar 30 17:05:10 nuc kernel: ACPI: HPET id: 0x8086a201 base: 0xfed00000
Mar 30 17:05:10 nuc kernel: e820: update [mem 0x3bff8000-0x3c0fafff] usable ==> reserved

----

Mar 30 17:05:10 nuc kernel: PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
Mar 30 17:05:10 nuc kernel: PM: hibernation: Registered nosave memory: [mem 0x0009e000-0x0009efff]
Mar 30 17:05:10 nuc

Mar 30 17:05:10 nuc kernel: Kernel command line: initrd=\EFI\proxmox\6.5.13-1-pve\initrd.img-6.5.13-1-pve roo>
Mar 30 17:05:10 nuc kernel: DMAR: IOMMU enabled
Mar 30 17:05:10 nuc kernel: Unknown kernel command line parameters "boot=zfs", will be passed to user space.
Mar 30 17:05:10 nuc kernel: Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, linear)
Mar 30 17:05

Mar 30 17:05:10 nuc kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Mar 30 17:05:10 nuc kernel:

Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x20: [io 0x3000-0x303f]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 2: assigned to efifb
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: VF(n) BAR0 space: [mem 0x00000000-0x06ffffff 64bit] (contains B>
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x34c: [mem 0x00000000-0x1fffffff 64bit pref]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: VF(n) BAR2 space: [mem 0x00000000-0xdfffffff 64bit pref] (conta>
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: [8086:a74d] type 01 class 0x060400


Mar 30 17:05:10 nuc kernel: PCI: Using ACPI for IRQ routing
Mar 30 17:05:10 nuc kernel: PCI: pci_cache_line_size set to 64 bytes
Mar 30 17:05:10 nuc kernel: pci 0000:00:1f.5: can't claim BAR 0 [mem 0xfe010000-0xfe010fff]: no compatible br>
Mar 30 17:05:10 nuc kernel: e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
Mar 30 17:05:10 nuc kernel: e820: reserve RAM buffer [mem 0x3bff8000-0x3bffffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: setting as boot VGA device
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: bridge control possible
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
Mar 30 17:05:10 nuc kernel: vgaarb: loaded
Mar 30 17:05:10 nuc kerne

Mar 30 17:05:10 nuc kernel: system 00:00: [io 0x0680-0x069f] has been reserved
Mar 30 17:05:10 nuc kernel: system 00:00: [io 0x164e-0x164f] has been reserved
Mar 30 17:05:10 nuc kernel: system 00:01: [io 0x1854-0x1857] has been reserved
Mar 30 17:05:10 nuc kernel: pnp 00:02: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0>
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfedc0000-0xfedc7fff] has been reserved

5:10 nuc kernel: system 00:02: [mem 0xfed90000-0xfed93fff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfed45000-0xfed8ffff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfee00000-0xfeefffff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:04: [io 0x2000-0x20fe] has been reserved
Mar 30 17:05:10 nuc

Mar 30 17:05:10 nuc kernel: pci_bus 0000:00: max bus depth: 1 pci_try_num: 2
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 9: assigned [mem 0x4020000000-0x40ffffffff 64bit pref]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 7: assigned [mem 0x4010000000-0x4016ffffff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: BAR 13: assigned [io 0x4000-0x4fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:15.0: BAR 0: assigned [mem 0x4017000000-0x4017000fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:15.1: BAR 0: assigned [mem 0x4017001000-0x4017001fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:19.0: BAR 0: assigned [mem 0x4017002000-0x4017002fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:19.1: BAR 0: assigned [mem 0x4017003000-0x4017003fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1e.0: BAR 0: assigned [mem 0x4017004000-0x4017004fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1e.3: BAR 0: assigned [mem 0x4017005000-0x4017005fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1f.5: BAR 0: assigned [mem 0x50400000-0x50400fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: PCI bridge to [bus 01]
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: bridge window [mem 0x5e800000-0x5e8fffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: PCI bridge to [bus 02-2b]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: bridge window [io 0x4000-0x4fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: bridge window [mem 0x52000000-0x5e1fffff]

Mar 30 17:05:10 nuc kernel: pci 0000:00:1d.0: bridge window [mem 0x5e700000-0x5e7fffff]

Mar 30 17:05:10 nuc kernel: ACPI: thermal: Thermal Zone [TZ00] (28 C)
Mar 30 17:05:10 nuc kernel: Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
Mar 30 17:05:10 nuc kernel: hpet_acpi_add: no address or irqs in _CRS
Mar 30 17:05:10 nuc kernel: Linux agpgart interface v0.103
Mar 30 17:05:10 nuc kernel: loop: module loaded
Mar 30 17:05:10 nuc kernel: tun: Universal TUN/TAP device driver, 1.6
Mar 30 17:05:10 nuc kernel: PPP generic driver version 2.4.2
Mar 30 17:05:10 nuc kernel: i8042: PNP: PS/2 Controller [PNP0303:pS2K] at 0x60,0x64 irq 1
Mar 30 17:05:10 nuc kernel: i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please b>
Mar 30 17:05:10 nuc kernel: serio: i8042 KBD port at 0x60,0x64 irq 1
Mar 30 17:05:10 nuc kernel: mousedev: PS/2 mouse device common for all mice

Mar 30 17:05:10 nuc kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measure>
Mar 30 17:05:10 nuc kernel: device-mapper: uevent: version 1.0.3
Mar 30 17:05:10 nuc kernel: device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-devel@redhat.com
Mar 30 17:05:10 nuc kernel: platform eisa.0: Probing EISA bus 0
Mar 30 17:05:10 nuc kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Mar 30 17:05:10 nuc kernel: platform eisa.0: EISA: Detected 0 cards
Mar 30 17:05:10 nuc kernel: intel_pstate: Intel P-state driver initializing
Mar 30 17:05:10 nuc kernel: intel_pstate: HWP enabled

Mar 30 17:05:10 nuc kernel: usb usb3: SerialNumber: 0000:00:14.0
Mar 30 17:05:10 nuc kernel: hub 3-0:1.0: USB hub found
Mar 30 17:05:10 nuc kernel: hub 3-0:1.0: 12 ports detected
Mar 30 17:05:10 nuc kernel: nvme nvme0: missing or invalid SUBNQN field.
Mar 30 17:05:10 nuc kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Mar 30 17:05:10 nuc kernel: usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.05
Mar 30 17:05:10 nuc kernel: usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Mar 30 17:05:10 nu


Mar 30 17:05:10 nuc kernel: input: Logitech Wireless Device PID:4008 Mouse as /devices/pci0000:00/0000:00:14.>
Mar 30 17:05:10 nuc kernel: hid-generic 0003:046D:4008.0004: input,hidraw1: USB HID v1.11 Mouse [Logitech Wir>
Mar 30 17:05:10 nuc kernel: input: Logitech K270 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7:1.2/0003:04>
Mar 30 17:05:10 nuc kernel: spl: loading out-of-tree module taints kernel.
Mar 30 17:05:10 nuc kernel: zfs: module license 'CDDL' taints kernel.
Mar 30 17:05:10 nuc kernel: Disabling lock debugging due to kernel taint
Mar 30 17:05:10 nuc kernel: zfs: module license taints kernel.
Mar 30 17:05:10 nuc kernel: logitech-hidpp-device 0003:046D:4003.0005: input,hidraw1: USB HID v1.11 Keyboard >
Mar 30 17:05:10 nuc kernel: input: Logitech M185 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7:1.2/0003:04>
Mar 30 17:05:10 nuc kernel: logitech-hidpp-device 0003:046D:4008.0004: input,hidraw2: USB HID v1.11 Mouse [Lo>
Mar 30 17:05:10 nuc kernel: ZFS: Loaded module v2.2.2-pve1, ZFS pool version 5000, ZFS filesystem version 5
Mar 30 17:05:10 nuc kernel: zd32: p1 p2 p3
Mar 30 17:05:10 nuc kernel
 
Last edited:
FYI. The memtest suggestion was a good one. I ran one Saturday night. There was good news and bad news. The good news was that my mem test passed 100pct. The bad news was that it took two tries to get through it. The first time, my system froze less than 5 minutes into the test, and I had to pull the plug and reboot. Since memtest was running off a flash drive, I guess I can rule out my NVMEs as well as my proxmox software stack as the source of the problem. Seems I probably do have a heat related problem. Amazon gratefully agreed to exchange it. I also discovered I had the bios set to performance and not balanced. My journalctl always reported that my system was set to balanced, and it used to be performance, so I never rechecked it. I switched it to balanced, and its been that way for about 12 hours. It also has two top covers. The first holds a fan for the NVME/RAM and screws into the unit. The second snaps on top of the first and covers the screws and has a nice logo. I removed the second cover which seems to give it slightly more breathing room top. I tried creating a few Windows VMs, while streaming/transcoding, and it hasn't crashed (things its struggled on in the past). Fan noise is down, it doesn't seem like it's pushing as much. I'm feeling cautiously optimistic between the exchange and bios adjustments that I might actually end up in a good place. I shouldn't say that out loud.... I've thought that I had it solved at least 20 times. Two additional questions:

1) I'm just curious about other's experiences with mini PCs. Is it common to not be able to run a mini pc on performance settings due to heat.... or is this just poor design by this particular vendor. I guess I'm not terribly worried about it, as most things I've read indicate that the difference in performance between balanced and high performance is somewhat negligible.

2) Once I get stability figured out, I'd like to reattach my Sata/USB dock (with the two 18 TB hard drives) to the nucbox. I was passing them through to the VM as scsi single devices. As stated before, I discovered those drives/dock were causing my VM not to reboot or shut down. I always had to force stop because it would hang during shutdown. Are there tips or suggestions regarding how to resolve that issue (get the VM to gracefully shutdown with large spinning drives mounted up)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!