You don't provide enough HW/SW details. But here are some pointers as to what I would do in your position:
- Check your Network Configuration / NICS / cables carefully - you provided no details.
- I assume you have checked your RAM - properly! (Memtest etc.)
- Don't know what Mini PC you're running - But what you describe looks a lot like a thermal problem.
- Try running the Mini PC enclosure open - i.e.: Open the box as much as possible. Even put a room fan over it. See if this makes a difference.
- Next I would look at the power supply. Can it cope with the setup you have? Is it faulty?
- Then you could / should look at the NVMEs. Don't know what they are - But start by trying a fresh setup with a spare NVME (single, your PCI might not be coping) - see if you can get a stable system up & running that way.
- Try running an alternate OS - bare metal - and see if that's stable.
Good luck.
Thanks for the suggestions on process of elimination. For reference its a nucbox k7 from GMKTEC. I upgraded it with 64GB DDR5 and 2X2TB NVME (one is a Seagate firecuda 520, the other is WD Black... Unfortunately hey don't make the highly durable version of the 520 any more, so I just went with a WD with similar specs that was on sale) . The NVME are zfs raid 1 with standard proxmox pool setup. The cables are all brand new Cate 6e connected to a 2.5gbs switch.) Recently it has been freezing when latent and not when doing heaving disk io. I've disconnected all my USB drives (at first I had a couple of NVME in enclosures connected to USB 3.2 ports. I also had 2 X Seagate x18 18tb hard drives connected via a dual SATA to USB dock that has its own power supply. I had the NVME enclosures connected to USB hubs that also had their own power. I tried everything from direct connecttions of the drives, to powered hubs. Recently I disconnected everything and connected the drives to an old raspi4 and shared them back to the nucbox via samba. Yet I continue to have crashes/freezes daily. I don't know what to look at hone in on the root cause. I see a number of things in my boot when looking at journalctl, but not sure if they are real problems. If I look back to right before the crash I see nothing. I eliminated all but 1 containersand 2 VMs. The LXC and one of the VMs are Ubuntu 22.04. (the VM is jammy desktop). The Ubuntu VM has docker installed and is running about 30 docker containers. The other VM is HASSOS. The Ubuntu LXC is an unprivileged container with Jellyfin installed directly into the container with GPU passthrough and hardware transcoding/acceleration. The docker services within the Ubuntu VM seem rock solid. I have a love hate with this machine. I was 5 years on a raspi. It was sooo stable. It just worked for everything I through at it and handled it well (even though it was a pi4b with only 4gb of ram) The difference in speed and power and flexibility with my current setup is massive, but I cant keep it running.
Some of the things in my log at boot are below I just pasted in some excerpts that contained some warnings. Some overlapping memory registers, zfs, etc. Not sure if my my GPU PCI pass through is part of the problem,.... My full journalctl from a prior boot was attached to first thread. Any help or suggestions on other logs to pull would be greatly appreciated. FYI, I did learn that my SATA dock/drives were causing my Ubuntu VM not to shutdown/reboot. After connecting those to my raspi, my Ubuntu VM and proxmox shuts down and reboots quickly, and consistently. Is this normal? I'd like those drives connected to the NUC. Is there something else I should do to make that part of the puzzle work reliably (different dock, some other config.. or is that expected)? That's a bit secondary though. My primary concern is to stop the crashing..... :
Mar 30 17:05:10 nuc kernel: efi: Remove mem69: MMIO range=[0xc0000000-0xcfffffff] (256MB) from e820 map
Mar 30 17:05:10 nuc kernel: e820: remove [mem 0xc0000000-0xcfffffff] reserved
0 17:05:10 nuc kernel: efi: Not removing mem74: MMIO range=[0xfee00000-0xfee00fff] (4KB) from e820 map
Mar 30 17:05:10 nuc kernel: efi: Remove mem75: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map
Mar 30 17:05:10 nuc kernel: e820: remove [mem 0xff000000-0xffffffff] reserved
Mar 30 17:05:10 nuc kernel: secureboot: Secure boot disabled
Mar 30 17:05:10 nuc kernel: SMBIOS 3.6.0 present.
Mar 30 17:05:10 nuc kernel: DMI: GMKtec NucBox K7/GMKtec, BIOS NucBox K71130 11/30/2023
Mar 30 17:05:10 nuc kernel: tsc: Detected 3200.000 MHz processor
Mar 30 17:05:10 nuc kernel: tsc: Detected 3187.200 MHz TSC
Mar 30 17:05:10 nuc kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
....
Mar 30 17:05:10 nuc kernel: Faking a node at [mem 0x0000000000000000-0x00000010afbfffff]
Mar 30 17:05:10 nuc kernel: NODE_DATA(0) allocated [mem 0x10afbd5000-0x10afbfffff]
Mar 30 17:05:10 nuc kernel: Zone ranges:
Mar 30 17:05:10 nuc kernel: DMA [mem 0x0000000000001000-0x0000000000ffffff]
Mar 30 17:05:10 nuc kernel: DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Mar 30 17:05:10 nuc kernel: Normal [mem 0x0000000100000000-0x00000010afbfffff]
Mar 30 17:05:10 nuc kernel: Device empty
Mar 30 17:05:10 nuc kernel: Movable zone start for each node
Mar 30 17:05:10 nuc kernel: Early memory node ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 1 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 1 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA: 96 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone DMA32: 16898 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone Normal: 8192 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: On node 0, zone Normal: 1024 pages in unavailable ranges
Mar 30 17:05:10 nuc kernel: Reserving Intel graphics memory at [mem 0x4c800000-0x503fffff]
Mar 30 17:05:10 nuc kernel: ACPI: PM-Timer IO Port: 0x1808
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
Mar 30 17:05:10 nuc kernel: ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
----
Mar 30 17:05:10 nuc kernel: IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-119
Mar 30 17:05:10 nuc kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
Mar 30 17:05:10 nuc kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Mar 30 17:05:10 nuc kernel: ACPI: Using ACPI (MADT) for SMP configuration information
Mar 30 17:05:10 nuc kernel: ACPI: HPET id: 0x8086a201 base: 0xfed00000
Mar 30 17:05:10 nuc kernel: e820: update [mem 0x3bff8000-0x3c0fafff] usable ==> reserved
----
Mar 30 17:05:10 nuc kernel: PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
Mar 30 17:05:10 nuc kernel: PM: hibernation: Registered nosave memory: [mem 0x0009e000-0x0009efff]
Mar 30 17:05:10 nuc
Mar 30 17:05:10 nuc kernel: Kernel command line: initrd=\EFI\proxmox\6.5.13-1-pve\initrd.img-6.5.13-1-pve roo>
Mar 30 17:05:10 nuc kernel: DMAR: IOMMU enabled
Mar 30 17:05:10 nuc kernel: Unknown kernel command line parameters "boot=zfs", will be passed to user space.
Mar 30 17:05:10 nuc kernel: Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, linear)
Mar 30 17:05
Mar 30 17:05:10 nuc kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Mar 30 17:05:10 nuc kernel:
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x20: [io 0x3000-0x303f]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 2: assigned to efifb
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: DMAR: Skip IOMMU disabling for graphics
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x344: [mem 0x00000000-0x00ffffff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: VF(n) BAR0 space: [mem 0x00000000-0x06ffffff 64bit] (contains B>
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: reg 0x34c: [mem 0x00000000-0x1fffffff 64bit pref]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: VF(n) BAR2 space: [mem 0x00000000-0xdfffffff 64bit pref] (conta>
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: [8086:a74d] type 01 class 0x060400
Mar 30 17:05:10 nuc kernel: PCI: Using ACPI for IRQ routing
Mar 30 17:05:10 nuc kernel: PCI: pci_cache_line_size set to 64 bytes
Mar 30 17:05:10 nuc kernel: pci 0000:00:1f.5: can't claim BAR 0 [mem 0xfe010000-0xfe010fff]: no compatible br>
Mar 30 17:05:10 nuc kernel: e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
Mar 30 17:05:10 nuc kernel: e820: reserve RAM buffer [mem 0x3bff8000-0x3bffffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: setting as boot VGA device
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: bridge control possible
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
Mar 30 17:05:10 nuc kernel: vgaarb: loaded
Mar 30 17:05:10 nuc kerne
Mar 30 17:05:10 nuc kernel: system 00:00: [io 0x0680-0x069f] has been reserved
Mar 30 17:05:10 nuc kernel: system 00:00: [io 0x164e-0x164f] has been reserved
Mar 30 17:05:10 nuc kernel: system 00:01: [io 0x1854-0x1857] has been reserved
Mar 30 17:05:10 nuc kernel: pnp 00:02: disabling [mem 0xc0000000-0xcfffffff] because it overlaps 0000:00:02.0>
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfedc0000-0xfedc7fff] has been reserved
5:10 nuc kernel: system 00:02: [mem 0xfed90000-0xfed93fff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfed45000-0xfed8ffff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:02: [mem 0xfee00000-0xfeefffff] could not be reserved
Mar 30 17:05:10 nuc kernel: system 00:04: [io 0x2000-0x20fe] has been reserved
Mar 30 17:05:10 nuc
Mar 30 17:05:10 nuc kernel: pci_bus 0000:00: max bus depth: 1 pci_try_num: 2
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 9: assigned [mem 0x4020000000-0x40ffffffff 64bit pref]
Mar 30 17:05:10 nuc kernel: pci 0000:00:02.0: BAR 7: assigned [mem 0x4010000000-0x4016ffffff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: BAR 13: assigned [io 0x4000-0x4fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:15.0: BAR 0: assigned [mem 0x4017000000-0x4017000fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:15.1: BAR 0: assigned [mem 0x4017001000-0x4017001fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:19.0: BAR 0: assigned [mem 0x4017002000-0x4017002fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:19.1: BAR 0: assigned [mem 0x4017003000-0x4017003fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1e.0: BAR 0: assigned [mem 0x4017004000-0x4017004fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1e.3: BAR 0: assigned [mem 0x4017005000-0x4017005fff 64bit]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1f.5: BAR 0: assigned [mem 0x50400000-0x50400fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: PCI bridge to [bus 01]
Mar 30 17:05:10 nuc kernel: pci 0000:00:06.0: bridge window [mem 0x5e800000-0x5e8fffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: PCI bridge to [bus 02-2b]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: bridge window [io 0x4000-0x4fff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:07.0: bridge window [mem 0x52000000-0x5e1fffff]
Mar 30 17:05:10 nuc kernel: pci 0000:00:1d.0: bridge window [mem 0x5e700000-0x5e7fffff]
Mar 30 17:05:10 nuc kernel: ACPI: thermal: Thermal Zone [TZ00] (28 C)
Mar 30 17:05:10 nuc kernel: Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
Mar 30 17:05:10 nuc kernel: hpet_acpi_add: no address or irqs in _CRS
Mar 30 17:05:10 nuc kernel: Linux agpgart interface v0.103
Mar 30 17:05:10 nuc kernel: loop: module loaded
Mar 30 17:05:10 nuc kernel: tun: Universal TUN/TAP device driver, 1.6
Mar 30 17:05:10 nuc kernel: PPP generic driver version 2.4.2
Mar 30 17:05:10 nuc kernel: i8042: PNP: PS/2 Controller [PNP0303
S2K] at 0x60,0x64 irq 1
Mar 30 17:05:10 nuc kernel: i8042: PNP: PS/2 appears to have AUX port disabled, if this is incorrect please b>
Mar 30 17:05:10 nuc kernel: serio: i8042 KBD port at 0x60,0x64 irq 1
Mar 30 17:05:10 nuc kernel: mousedev: PS/2 mouse device common for all mice
Mar 30 17:05:10 nuc kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measure>
Mar 30 17:05:10 nuc kernel: device-mapper: uevent: version 1.0.3
Mar 30 17:05:10 nuc kernel: device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised:
dm-devel@redhat.com
Mar 30 17:05:10 nuc kernel: platform eisa.0: Probing EISA bus 0
Mar 30 17:05:10 nuc kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Mar 30 17:05:10 nuc kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Mar 30 17:05:10 nuc kernel: platform eisa.0: EISA: Detected 0 cards
Mar 30 17:05:10 nuc kernel: intel_pstate: Intel P-state driver initializing
Mar 30 17:05:10 nuc kernel: intel_pstate: HWP enabled
Mar 30 17:05:10 nuc kernel: usb usb3: SerialNumber: 0000:00:14.0
Mar 30 17:05:10 nuc kernel: hub 3-0:1.0: USB hub found
Mar 30 17:05:10 nuc kernel: hub 3-0:1.0: 12 ports detected
Mar 30 17:05:10 nuc kernel: nvme nvme0: missing or invalid SUBNQN field.
Mar 30 17:05:10 nuc kernel: nvme nvme0: Shutdown timeout set to 10 seconds
Mar 30 17:05:10 nuc kernel: usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.05
Mar 30 17:05:10 nuc kernel: usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Mar 30 17:05:10 nu
Mar 30 17:05:10 nuc kernel: input: Logitech Wireless Device PID:4008 Mouse as /devices/pci0000:00/0000:00:14.>
Mar 30 17:05:10 nuc kernel: hid-generic 0003:046D:4008.0004: input,hidraw1: USB HID v1.11 Mouse [Logitech Wir>
Mar 30 17:05:10 nuc kernel: input: Logitech K270 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7:1.2/0003:04>
Mar 30 17:05:10 nuc kernel: spl: loading out-of-tree module taints kernel.
Mar 30 17:05:10 nuc kernel: zfs: module license 'CDDL' taints kernel.
Mar 30 17:05:10 nuc kernel: Disabling lock debugging due to kernel taint
Mar 30 17:05:10 nuc kernel: zfs: module license taints kernel.
Mar 30 17:05:10 nuc kernel: logitech-hidpp-device 0003:046D:4003.0005: input,hidraw1: USB HID v1.11 Keyboard >
Mar 30 17:05:10 nuc kernel: input: Logitech M185 as /devices/pci0000:00/0000:00:14.0/usb3/3-7/3-7:1.2/0003:04>
Mar 30 17:05:10 nuc kernel: logitech-hidpp-device 0003:046D:4008.0004: input,hidraw2: USB HID v1.11 Mouse [Lo>
Mar 30 17:05:10 nuc kernel: ZFS: Loaded module v2.2.2-pve1, ZFS pool version 5000, ZFS filesystem version 5
Mar 30 17:05:10 nuc kernel: zd32: p1 p2 p3
Mar 30 17:05:10 nuc kernel