PM shuts down host unexpectedly

Not sure this is even a PM issue, but:

I recently installed it on an (old) Apple Mac Pro (2013) after running a Debian derivative (Zorin 16.2) on it faultlessly for a few months.

Now, after running for random periods of time the system shuts down. Actually, it seems to sleep (power light pulses slowly) but it cannot resume from this so reboot required.

Obviously, this could be some thermal event triggered by a hardware issue, but research suggests the system can run OK up to 125C. It never exceeds 95C (according to lm sensors and bpytop I run from SSH) so wonder if there is anything I can configure in PM to alleviate this

Otherwise, the system should be able to manage the load. It has 64GB RAM and 12 CPUs


MTIA

P
 
Have you tried looking in the syslogs? Generally 95 degrees C already sounds pretty hot, from my experience that's already close to the max temperature. 125 degrees C seems way too hot, where did you get that from?

You can check them via the following command (adjust date/time accordingly):

Code:
journalctl --since '2023-01-30 00:00' --until '2023-01-30 23:59'

If you want me to take a look as well you can export it via:

Code:
journalctl --since '2023-01-30 00:00' --until '2023-01-30 23:59' > output.txt
 
Have you tried looking in the syslogs? Generally 95 degrees C already sounds pretty hot, from my experience that's already close to the max temperature. 125 degrees C seems way too hot, where did you get that from?

You can check them via the following command (adjust date/time accordingly):

Code:
journalctl --since '2023-01-30 00:00' --until '2023-01-30 23:59'

If you want me to take a look as well you can export it via:

Code:
journalctl --since '2023-01-30 00:00' --until '2023-01-30 23:59' > output.txt
That was one of the first things I checked.

Unfortunately, there is nothing in the logs and nothing appears on the host console (actual hardware) - the screen just goes blank.

This is one of the reasons I suspect hardware. However, I have just been running another Debian derivative, Kaisen, for most of today from a USB live image and that works perfectly.

I believe the temperature is a red herring as, ironically, I was checking the logs again and it crashed just as I was scrolling through. The temp was only 71C with only two small VMs running.

I had also just managed to upgrade to kernel 6.1 so this did not help, either.

I am currently installing Zorin 16.2 again on an external SSD to see if this helps to troubleshoot the issue.

So, although it is feasible the issue it hardware related I suspect there is something in how PM implements some feature that triggers it that is not triggered by other Debian derivatives . Perhaps I will eliminate some possibilities by selectively excluding or removing USB devices. I currently use a USB 3.0 docking station to hold a 8TB ZFS single disc pool. I realise USB is not recommended for ZFS but it had been working perfectly under Zorin, so I thought I would try my luck with PM. In any event, this is unlikely to cause such catastrophic failure, I hope!

MTIA

;-}
P
 
Hi,

A spontaneous shutdown has also happened to me (twice now and only recently). Nothing suspicious in the syslog leading up to the halt.

I too have upgraded to kernel 6.1 and removed it again - I was troubleshooting freezes on my Asus PN51 (see separate threads). After updating the BIOS/Firmware of the PC I still had freezes on kernel 6.1 so removed it again to revert to the supported 5.15 kernel. Since then I have run without freezes for more than 4 days, but today the system suddenly halted. I am honestly not sure if the first time was before or after the 6.1 removal (I had never seen the PC shut down by itself so I actually suspected I had accidentally pressed the power button while moving a network switch ;)).

Could this have anything to do with the 6.1 install/removal?

Is there anywhere we can see if the shutdown was caused by a power management event and/or check if the shutdown was at least run properly (e.g. file systems unmounted properly, etc.)?

/Jeppe
 
I have been running and stress testing Zorin most of today, using phoronix-test-suite (PTS) and manually copying/reading/deleting large amounts of data (400GB) to/from my external USB ZFS pool, all flawlessly.

I had the issue with both kernels, so not sure if they are the root cause. For me, when PM crashed/halted it did so most unceremoniously. So, each time I was half expecting one or other filesystem/pool to be corrupted/inaccessible but so far, so good.

Unless I can find a PTS test/suite that stresses the system the same way PM does I can only assume there is nothing wrong with the hardware and PM is doing something outside the envelope. But what? I ask. If I can find it I may be able to fix/workaround it.

I am even considering installing PM in Zorin. It is based on Ubuntu focal, after all, so should be compatible. But all this depends on root cause, without which I am snookered, or at least ill informed.

:-{
P
 
  • Like
Reactions: jeppe
Unless I can find a PTS test/suite that stresses the system the same way PM does

Did you actually test (KVM-)virtualization on this Zorin-installation?

I am even considering installing PM in Zorin. It is based on Ubuntu focal, after all, so should be compatible.

Proxmox-products are based on Debian with a modified/extended Ubuntu-kernel.
So this might not work at all, but you might prove otherwise...
 
The PTS Suite I am running, cpu-massive, is currently hammering the system and is scheduled to take two days, so I will let it run to completion.

As you can see from the attached, the system is stressed far more that normal with PM and in the background I am writing tons of data to the USB ZFS pool, in addition to what the PTS tests are doing AND I am installing a VM under Virtualbox using KVM (also attached)

So far, everything looks extremely healthy, apart from the not unexpected sluggish behaviour with other apps.

;-}
P
 

Attachments

  • Screenshot 2023-01-31 at 17.41.54.png
    Screenshot 2023-01-31 at 17.41.54.png
    280.1 KB · Views: 19
  • Screenshot 2023-01-31 at 17.52.04.png
    Screenshot 2023-01-31 at 17.52.04.png
    141.5 KB · Views: 18
  • Screenshot 2023-01-31 at 22.53.56.png
    Screenshot 2023-01-31 at 22.53.56.png
    272.4 KB · Views: 13
  • Screenshot 2023-01-31 at 23.52.24.png
    Screenshot 2023-01-31 at 23.52.24.png
    280.7 KB · Views: 11
  • Screenshot 2023-01-31 at 23.46.28.png
    Screenshot 2023-01-31 at 23.46.28.png
    55.6 KB · Views: 17
After days of testing and some slightly strange behaviour involving ZFS I have decided to reinstall PVE from scratch using BTRFS

I have asked sales to 'reset' my subscription key as it no longer works:

Suscription Keyxxxxxxxxxxxxxxxxxxxxx
Statusinvalid: Invalid Server ID
Server IDxxxxxxxxxxxxxxxxxxxxxxxxxxx
 
@nobillgates why do you suspect ZFS? I believe I use ext4 and I also see periodic freezes and even some poweroffs with my PN51. I have upgraded the BIOS/firmware, tried the 6.1 kernel which was even more unstable and now reverted to current 5.x kernel.

So strange that there is no clear remedy or explanation for an issue experienced so similarly by quite many.
 
ZFS? no definitive truth table, but some strange things happened with it not showing full capacity (1Gb instead of 1Tb) so I tried BTRFS on USB but that also failed, but it is currently working on another system (NUC) being used for a slightly different purpose.

Meanwhile, my PVE is running without any external dasd, only internal 1Tb SSD. So far, all is well. I gave even been able to pvesm set ISO and backup on a dir which happens to be an NFS mount (in fstab) to my NAS. So, backups and ISOs are offloaded and only VM disks are local.

Looking good.

;-}
P
 
I got this repeatedly every maybe 10 seconds and could not log in to the console - guess it was busy crashing…
BEC38CEE-FF2B-4A7F-96CF-56A40D311213.jpeg
 
ZFS? no definitive truth table, but some strange things happened with it not showing full capacity (1Gb instead of 1Tb) so I tried BTRFS on USB but that also failed, but it is currently working on another system (NUC) being used for a slightly different purpose.

Meanwhile, my PVE is running without any external dasd, only internal 1Tb SSD. So far, all is well. I gave even been able to pvesm set ISO and backup on a dir which happens to be an NFS mount (in fstab) to my NAS. So, backups and ISOs are offloaded and only VM disks are local.

Looking good.

;-}
P
I spoke to soon, but this time it was better news.

The system went to sleep after a couple of days but on this occasion there were useful entries in the logs saying sleep target has been reached.

Further research led me to a couple of reports from which I gleaned I should try the following:

systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

and

/etc/systemd/sleep.conf:
[...]
[Sleep]
AllowSuspend=no
AllowHibernation=no
AllowSuspendThenHibernate=no
AllowHybridSleep=no

[...]

We have now been running for over 4 days without incident. Perhaps this time my confidence will be suitably rewarded.

;-}
P
 
Just an update: after a routine update/upgrade my setup has been running for 19 days without issues now - I am afraid to jinx it with this message, but maybe recent kernel updates has solved the issue (fingers crossed)?

/Jeppe
 
  • Like
Reactions: keldbroe
Looking good here, too.

For some reason, without restart or other intervention, it started performing really well. All VMs using host CPU, not emulation, so this might help.

I have been able to run up to 15 VMs simultaneously, although to be fair I was only using 2-3 at any given time. Even so, it has been quite fun to create/destroy on a whim without the technology getting in the way.


Tentative plaudits to the team.

;-}
P
 

Attachments

  • Screenshot 2023-03-11 at 14.05.20.png
    Screenshot 2023-03-11 at 14.05.20.png
    34.6 KB · Views: 24

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!