Host reboots after few seconds restoring Backup

Firlefanz · Dec 21, 2023

Hey everyone,

i got some new hardware for my Proxmox host. The older one was running flawlessly for about 3 years now.
While trying to restore the backups from my external NAS, the new host reboots without any errors shown (--Reboot--) after a few seconds fetching data from the NAS.
Its an Asrock N100-ITX with Promox 8.1.3 on it. It idles for a lot of time without any issues, reboot occurs only while restoring backups.

I already tried:
- Switching PSU
- Performing memtest for 2h without errors
- Switching RAM
- trying different backups (which are not broken, could be restored on the old hardware as it should)

but it doenst work at all.
I'm not that firm with Linux, if you need any logs/details just let me know how i can get them for you.

Thx all!

Folke · Dec 21, 2023

Hi,
is that server part of a cluster?
If so, you might be overloading the network to a point where corosync (cluster engine) can't sync with other nodes and fences the node (hard reboot)[0]. In the restore dialog you can set a bandwidth limit to circumvent the overload.

The log of corosync would be interesting, you can get that by running journalctl -eu corosync as root

[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#ha_manager_fencing

Firlefanz · Dec 21, 2023

It's not part of a cluster, only a single Server.

Code:

Dec 21 11:12:09 Proxmox systemd[1]: corosync.service - Corosync Cluster Engine was sk> 
lines 1-1/1 (END)...skipping... 
Dec 21 11:12:09 Proxmox systemd[1]: corosync.service - Corosync Cluster Engine was skipped because of an unmet condition check (ConditionPathExists=/etc/corosync/corosync.conf).

I tried to limit the bandwith with 80MiB/s and 10 MiB/s, it just took a bit longer till rebooting the system.

Maybe the Mainboard got a failure, that's the only thing i haven't changed in this setup. But it looks somehow like a Software problem

Folke · Dec 21, 2023

If I understood you correctly so far, the rebooting occurs while still transferring the backup to the local machine, not when the vm is actually getting started?
Even if no errors are logged, maybe there are some early hints at what is causing the error. I'd look at the logs from last boot journalctl -b-1 and pve's logs journalctl -eu pvedaemon.
Also I would try to reproduce it in different ways, like running a large file transfer, trying to restore a container instead of a vm, backing up and restoring a tiny container like alpine linux, stuff like that.

Firlefanz · Dec 21, 2023

Thx for your answers.

I tried backups with 2TB, 40GB, 9 GB and 7 GB. The one with 400MB was restored correctly.

Code:

()
recovering backed-up configuration from 'BackupNAS:backup/vzdump-lxc-104-2023_12_21-00_17_13.tar.zst'
  Logical volume "vm-104-disk-0" created.
Creating filesystem with 2097152 4k blocks and 524288 inodes
Filesystem UUID: 6028aa4d-d666-4120-bd23-bc093c2e2097
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
restoring 'BackupNAS:backup/vzdump-lxc-104-2023_12_21-00_17_13.tar.zst' now..
extracting archive '/mnt/pve/BackupNAS/dump/vzdump-lxc-104-2023_12_21-00_17_13.tar.zst'
Total bytes read: 1746862080 (1.7GiB, 135MiB/s)
merging backed-up and given configuration..
TASK OK

The part from "total bytes read" was not reached previously. You are right, it crashes after starting the restore job and quits some time at "extracting archive".

Dec 21 12:41:18 Proxmox pvedaemon[945]: starting server
Dec 21 12:41:18 Proxmox pvedaemon[945]: starting 3 worker(s)
Dec 21 12:41:18 Proxmox pvedaemon[945]: worker 946 started
Dec 21 12:41:18 Proxmox pvedaemon[945]: worker 947 started
Dec 21 12:41:18 Proxmox pvedaemon[945]: worker 948 started
Dec 21 12:41:18 Proxmox systemd[1]: Started pvedaemon.service - PVE API Daemon.
Dec 21 12:41:56 Proxmox pvedaemon[948]: <root@pam> successful auth for user 'root@pam'
Dec 21 12:42:08 Proxmox pvedaemon[1150]: starting termproxy UPID

roxmox:0000047E:0000229F:65842490:vncshell::root@pam:
Dec 21 12:42:08 Proxmox pvedaemon[946]: <root@pam> starting task UPID

roxmox:0000047E:0000229F:65842490:vncshell::root@pam:
Dec 21 12:42:08 Proxmox pvedaemon[947]: <root@pam> successful auth for user 'root@pam'
Dec 21 12:42:09 Proxmox login[1157]: pam_unix(login:session): session opened for user root(uid=0) by (uid=0)
Dec 21 12:56:57 Proxmox pvedaemon[947]: <root@pam> successful auth for user 'root@pam'
Dec 21 13:03:29 Proxmox pvedaemon[946]: <root@pam> end task UPID

roxmox:0000047E:0000229F:65842490:vncshell::root@pam: OK
Dec 21 13:03:48 Proxmox pvedaemon[4701]: starting termproxy UPID

roxmox:0000125D:00021E5C:658429A4:vncshell::root@pam:
Dec 21 13:03:48 Proxmox pvedaemon[948]: <root@pam> starting task UPID

roxmox:0000125D:00021E5C:658429A4:vncshell::root@pam:
Dec 21 13:03:48 Proxmox pvedaemon[946]: <root@pam> successful auth for user 'root@pam'
Dec 21 13:03:48 Proxmox login[4704]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)

The log of "journalctl -b-1" is 1300 lines long, which i cant copy with nano/micro/xclip. Just dont know how to get it here while logged into the pve using Web-Shell.

Folke · Dec 21, 2023

You could just ssh into the machine. The user and password should be the same as in the webui, if you are using the default realm.
You could also create the journal file with journalctl -b-1 > journal.txt and use FileZilla to pull the file via sftp, the login credentials are again the same as the UI's.
Since the crash happens at the extraction stage I'd suspect that the system doesn't handle prolonged burst loads well, you could try to trigger it with a stress test:

Bash:

# Install stress-ng
apt install stress-ng

# Run CPU stress test
stress-ng --cpu 0 -t 5m

# Run IO stress test
stress-ng --io 0 -t 5m

With the small restore working, maybe it could also be something trivial as overheating due to a mis-mounted cooler or missing thermal paste. In that case, the cpu stress test should crash the system as well.

Firlefanz · Dec 21, 2023

Thank you, i attached the file.

Both stress tests ran through without an issue.
But ridiculously the server rebooted somehow before the stress tests after 1h 30min in idle. Maybe the failed FTP attempts, i dont know.

Folke · Dec 21, 2023

The last two lines are kinda interesting, since that service only posts its update time if it takes suspiciously long, but that only tells us that there probably is some issue, not what it actually is.

Code:

Dec 21 12:39:28 Proxmox pvestatd[896]: status update time (5.528 seconds)
Dec 21 12:39:37 Proxmox pvestatd[896]: status update time (5.374 seconds)

While it's unlikely, since the system hard rebooted, last reboot might reveal something, but I haven't actively used it till now.

At this point I'm pretty sure it's hardware error, but what component exactly is hard to say. You can run a test on the hard drives with S.M.A.R.T. [0].
Enabling microcode might help, if it's a cpu firmware bug that already has been resolved [1]. If you haven't already, updating the mainboards firmware is always a good idea and it hasn't been unheard of that problems like this have been solved by that.

[0] https://www.fosslinux.com/111631/us...-to-check-the-health-of-your-hdds-or-ssds.htm
[1] https://wiki.debian.org/Microcode

Firlefanz · Dec 23, 2023

Thank you for your patience.

I tried Microcode Update before, SMART had no issues.
Tried another HDD and it crashed serveral times while installing Proxmox.

I've sent the MB back and waiting for a new one.
I'll tell how the new one works ;-)
Happy holidays!

Firlefanz · Jan 5, 2024

Unfortunately the new mainboard has the same issue..
But i gained a new error, so there is some kind of progress.

recovering backed-up configuration from 'BackupNAS:backup/vzdump-lxc-112-2024_01_05-00_21_09.tar.zst'
Logical volume "vm-101-disk-0" created.
Creating filesystem with 20971520 4k blocks and 5242880 inodes
Filesystem UUID: 806bda64-b721-402a-9482-876eca026189
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000
restoring 'BackupNAS:backup/vzdump-lxc-112-2024_01_05-00_21_09.tar.zst' now..
extracting archive '/mnt/pve/BackupNAS/dump/vzdump-lxc-112-2024_01_05-00_21_09.tar.zst'
/*stdin*\ : Decoding error (36) : Restored data doesn't match checksum
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Logical volume "vm-101-disk-0" successfully removed.
TASK ERROR: unable to restore CT 101 - command 'tar xpf - --zstd --totals --one-file-system -p --sparse --numeric-owner --acls --xattrs '--xattrs-include=user.*' '--xattrs-include=security.capability' '--warning=no-file-ignored' '--warning=no-xattr-write' -C /var/lib/lxc/101/rootfs --skip-old-files --anchored --exclude './dev/*'' failed: exit code 2

The 400MB recovery went trough, bigger backups still wont work

Anyone got a recommendation? Reinstall Proxmox maybe?

EDIT Redid some tests to the HDD, maybe that's the issue?

SamTzu · Sunday at 03:17

Similar problems with VirtioIO.
Switching to E1000 and problem will go away.

Sami Mattila
https://www.ic4.eu/

Search

Search

Host reboots after few seconds restoring Backup

Firlefanz

New Member

Folke

Proxmox Retired Staff

Firlefanz

New Member

Folke

Proxmox Retired Staff

Firlefanz

New Member

Folke

Proxmox Retired Staff

Firlefanz

New Member

Attachments

Folke

Proxmox Retired Staff

Firlefanz

New Member

Firlefanz

New Member

SamTzu

Renowned Member