Node crashing randomly

howlecomputing

New Member
Jul 6, 2019
2
0
1
33
Every 2 or 3 days this node locks up. I can't reach the node from the network. Can't get to GUI, unable to ping, doesn't show in ARP table of router, switch port shows TX traffic but no input RX traffic (but remains connected with light indication on switch). I've been scratching my head for a month and as the issue only happens every few days troubleshooting is painfully slow. This machine has a Ryzen 1700 with no graphics card and this chip doesn't have igpu so I haven't been able to see anything on a monitor as its headless. Also, plugging in a keyboard it doesn't light up like the node crashed hard.

When it locks up 1 of my 3 containers do not start back up. I get the this in the logs. Additionally I often have to try and manually start the container more than once after a reboot to get it started. Giving the same error. Once it has started though I can reboot the node and 95% of the time the container autostarts without issue.

Job for pve-container@105.service failed because the control process exited with error code.
See "systemctl status pve-container@105.service" and "journalctl -xe" for details.
TASK ERROR: command 'systemctl start pve-container@105' failed: exit code 1

Code:
root@proxmox:~# systemctl status pve-container@105.service
● pve-container@105.service - PVE LXC Container: 105
   Loaded: loaded (/lib/systemd/system/pve-container@.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sat 2019-07-06 02:41:48 EDT; 7min ago
     Docs: man:lxc-start
           man:lxc
           man:pct
  Process: 1772 ExecStart=/usr/bin/lxc-start -n 105 (code=exited, status=1/FAILURE)

Jul 06 02:41:47 proxmox systemd[1]: Starting PVE LXC Container: 105...
Jul 06 02:41:48 proxmox lxc-start[1772]: lxc-start: 105: lxccontainer.c: wait_on_daemonized_start: 856 No such file or directory - Failed to receive
Jul 06 02:41:48 proxmox lxc-start[1772]: lxc-start: 105: tools/lxc_start.c: main: 330 The container failed to start
Jul 06 02:41:48 proxmox lxc-start[1772]: lxc-start: 105: tools/lxc_start.c: main: 333 To get more details, run the container in foreground mode
Jul 06 02:41:48 proxmox lxc-start[1772]: lxc-start: 105: tools/lxc_start.c: main: 336 Additional information can be obtained by setting the --logfil
Jul 06 02:41:48 proxmox systemd[1]: pve-container@105.service: Control process exited, code=exited status=1
Jul 06 02:41:48 proxmox systemd[1]: Failed to start PVE LXC Container: 105.
Jul 06 02:41:48 proxmox systemd[1]: pve-container@105.service: Unit entered failed state.
Jul 06 02:41:48 proxmox systemd[1]: pve-container@105.service: Failed with result 'exit-code'.


I'm running 5.4-3. The container potentially causing the issue is a an Ubuntu 18 with unifi video and a directory for recordings. Thanks in advance for any advise!

Code:
root@proxmox:~# pveversion -v
proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2
 
Hi,

do you use ZFS as rpool?
If yes do you have a swap partition on the rpool?
 
Not using ZFS at this time. Earlier tonight I had a brief power outage and when powering back on the box that container didn't start back up. (no UPS on this lab box) Same error. After manually starting it's working fine with no apparent issues. Do you think the container is causing the entire node to become unresponsive after a few days? I could try and reinstall it, but a failing container/vm causing the entire node to stop didn't seem to make sense in my mind.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!