[SOLVED] System broken by misconfigured backup

Nov 11, 2024
6
2
3
Since recently my installation detaches itself from the network after 1-2 days and I have to hard reset the machine. The problem exists since an automated update on October 29th.
Does anyone know hot to fix this or how to revert back to a proper kernel version?
 
Hello, the command:

Code:
last reboot -F -n 20

will tell you the last 20 booted kernels. If you want to boot from a specified kernel version, please take a look at our documentation [1].

Having said that, is there anything in the system logs pointing why you lose network connectivity?

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_kernel_pin
 
I looked through the logs and each time they just stop at 3 o'clock and some minutes. Looks like the system crashes. One during backup and once during emailing to root@pam
Then there are no more entries until the reboot remark.
 
One during backup and once during emailing to root@pam
IDK your HW, but assuming the email is a backup notification, we can assume that the crash is happening at either the time of the backup or just after it. Since you show no other log entries after this - we can assume you have a hard crash where all components of the system are crashed. So that backup is causing a hard crash. I would check in the following order; Cooling, RAM & PSU.
 
One of the VMs runs TrueNAS, which produces a good log. There I could not see any correlation to load or exact time. Once the machine died shortly after 2h with hardly any load on the system and the other time it died shortly after 3h during backup, when the trueNAS VM was under significant load. If it was a cooling issue, it would rather die during the day, when I work on it and there is more going on. Power supply is not likely.
Is there a way to run test cycle on the RAM?
 
You mention in your original post:
The problem exists since an automated update on October 29th.
1. How long was the system running/working correctly before these updates?

2. What automated update to you refer to? PVE (by default) doesn't automatically update itself.

3. What convinces you that it is linked to these updates?
 
You mention in your original post:

1. How long was the system running/working correctly before these updates?

2. What automated update to you refer to? PVE (by default) doesn't automatically update itself.

3. What convinces you that it is linked to these updates?
I was wrongly assuming pve autoupdates, which I now found out is not the case. So I did a complete update cycle yesterday.
During the night it crashed again during backups, but I am now convinced that a stupid misconfiguration in the backup policy is the actual root cause.
I change the config and the next days will show if I was right. Should it crash again, I will need to run memtest for further investigations.
 
During the night it crashed again during backups, but I am now convinced that a stupid misconfiguration in the backup policy is the actual root cause.
I change the config and the next days will show if I was right. Should it crash again, I will need to run memtest for further investigations.
Are you willing to share some more details on what the misconfiguration was and how you resolved it?

This would allow us to either reproduce and find the underlying issue leading to a crash. Also, how does the system crash? is this the PVE host or the VM that crashes? any errors related to the crash in the systemd journal?
 
It was a pretty stupid mistake to be honest:
One of the VMs runs TrueNAS, which has direct access to a SSD controller card via PCI passthrough. The TrueNAS boot directory is on the regular Proxmox NVME. One of the TrueNAS disks is mounted in Proxmox itself to store the VM backups.
I accidentally added TrueNAS to the list of VMs to backup at night. This brough the whole system to an uncontrolled stop and I could only hard reset it.
 
  • Like
Reactions: gfngfn256 and Chris
Looks like you got it solved.
Maybe mark this thread as solved. At the top of the thread, choose the Edit thread button, then from the (no prefix) dropdown choose Solved.
I guess the Title of the thread looks misleading.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!