So I've learned a lot, and made a lot of progress since my post, but still experiencing some issues.
I realized that I was able to access the syslog from the UI which made things a lot easier. I dug through it around the times that the system would go down and didn't really see anything that stood out. I thought it could be hardware related and started to monitor utilization/thermals. I installed Netdata and set up sensors.
CPU utilization was very high (consistently over 90%), especially on the Home Assistant VM and I believe that was causing very high CPU temperatures (steadily over 90C). I found out how to change the governor from
performance
to
powersave
and repasted the NUC. That helped a lot and I started to see better uptime for Proxmox but it still ended up going down. I eventually tried switching the CPU type in Proxmox for both Home Assistant and Plex from
kvm64
to
host
and that improved things a ton. Now overall CPU utilization is under 20% and thermals hover in the 60-85C range. Now Proxmox has been up for 19 hours where I wouldn't see half of that previously.
The main thing that I'm still experiencing is that my Home Assistant goes down, while the VM is still running. It has happened a couple times now, most recently overnight last night. Syslog from last night is attached. Another thing I noticed that could be related, is that when I try to reboot the VM to get Home Assistant back up, the reboot fails, and I see this in the syslog:
Code:
Jan 09 07:16:03 pve pvedaemon[1111]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Jan 09 07:16:06 pve pvedaemon[1109]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Jan 09 07:17:01 pve CRON[498327]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 09 07:17:01 pve CRON[498328]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jan 09 07:17:01 pve CRON[498327]: pam_unix(cron:session): session closed for user root
Jan 09 07:17:35 pve pvedaemon[1109]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Jan 09 07:17:46 pve pvedaemon[498703]: requesting reboot of VM 100: UPID:pve:00079C0F:00670D52:63BC05EA:qmreboot:100:root@pam:
Jan 09 07:17:46 pve pvedaemon[1111]: <root@pam> starting task UPID:pve:00079C0F:00670D52:63BC05EA:qmreboot:100:root@pam:
Jan 09 07:17:55 pve pvedaemon[1111]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Jan 09 07:18:15 pve pvedaemon[1109]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - got timeout
Jan 09 07:18:35 pve pvedaemon[1111]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - unable to connect to VM 100 qga socket - timeout after 31 retries
Jan 09 07:18:54 pve pvedaemon[1109]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - unable to connect to VM 100 qga socket - timeout after 31 retries
Jan 09 07:19:13 pve pvedaemon[1111]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - unable to connect to VM 100 qga socket - timeout after 31 retries
Jan 09 07:27:12 pve pvedaemon[1109]: VM 100 qmp command failed - VM 100 qmp command 'guest-ping' failed - unable to connect to VM 100 qga socket - timeout after 31 retries
Jan 09 07:27:46 pve pvedaemon[498703]: VM 100 qmp command failed - VM 100 qmp command 'guest-shutdown' failed - got timeout
Jan 09 07:27:46 pve pvedaemon[498703]: VM quit/powerdown failed
Jan 09 07:27:46 pve pvedaemon[1111]: <root@pam> end task UPID:pve:00079C0F:00670D52:63BC05EA:qmreboot:100:root@pam: VM quit/powerdown failed
Jan 09 07:30:20 pve pvedaemon[1109]: <root@pam> successful auth for user 'root@pam'