Restoring from backup powers off server

Brendon

Active Member
Jan 9, 2016
23
4
43
I recently ordered three new servers for a staging cluster (hosting provider with dedicated hardware). They provided me with the following:

01 has a ASRockRack model E3C252D4U-2T mobo.

02 and 03 have GIGABYTE model MX33-BSA-V1 mobos.

All three have:

Intel Xeon-E 2386G - 6c/12t - 3.5 GHz/4.7 GHz

64 GB ECC 3200 MHz

4×960 GB SSD NVMe (Samsung PM9A3, direct SATA connections of course)

Intel Corporation Ethernet Controller X710 for 10GBASE-T

Latest PVE 7. (I'm going to use this cluster to test upgrading to 8.)

ZFS is striped mirrors (RAID 10).

All three nodes in a cluster but no HA/fencing is configured.

Code:
# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

I restored a VM from an NFS backup share to 01. Worked fine. VM is up and running.

Then I restored a VM to 02. Within seconds of the restore process starting the server shuts off. 03 does the same thing. I have to power them on via the IPMI interface.

No logs anywhere to be found. Not in the system logs, no stack traces, nothing on the console, etc. (outside of Corosync logs when the node dies of course).

The only logs I've been able to find that reference power is the system event log via IPMI:

30 | 08/23/2023 | 21:22:13 | Power Unit #0xff | Power off/down | Asserted
31 | 08/23/2023 | 21:22:59 | Power Unit #0xff | Power off/down | Deasserted
32 | 08/23/2023 | 21:22:59 | System Event | OEM System boot event | Asserted
33 | 08/23/2023 | 21:23:35 | OS Boot #0x22 | boot completed - device not specified | Asserted

This is the last thing that the server logs before shutting off. Looks like this happens on any restore:

2023-08-23T20:19:59.435888-04:00 staging-vmh02 kernel: [ 2037.724296] debugfs: Directory 'zd0' with parent 'block' already present!
2023-08-23T20:20:00.451914-04:00 staging-vmh02 kernel: [ 2038.740896] debugfs: Directory 'zd16' with parent 'block' already present!

Things I've tried:

- Network test with iperf3. All good.

- CPU burn-in test with sysbench. All good.

- Disk stress test with fio. All good.

- Restore VM to 02 from local storage. Powers off. Doesn't look like it's network/NFS related.

- Restore VM to 01 and migrate to 02/03. Works fine.

At this point I'm at a loss. I've never seen anything like this. I've reached out to the hosting provider of course but I also wanted to post here to see if anyone has seen something like this before.

We have a production cluster that has been running fine for years with nearly the exact same configuration. Very strange!
 
Last edited:
Just ran Memtest86+ and both servers pass.

I keep thinking this has to be a hardware issue but it ONLY happens when I try to restore a Proxmox backup. I have the hosting provider looking into it now.
 
Hosting provider is swapping the mobos on the two faulty servers to ASRockRack E3C252D4U-2T. They did already on one and everything is now working as expected. All three servers will have the same mobos.

While I am waiting for them to swap the third, I pulled it from the cluster and rebuilt it with PVE 8 (standalone). It does the same thing. Within a few seconds of restoring from backup, it shuts off. No logs anywhere. It's as if the power is cut.

I really wish I could figure out what it is about the restore process that causes this! I've tried everything I can think of to get this to happen otherwise but have been unsuccessful so far.
 
Hi Brendon,

You're not the only one, similar problem here with the same motherboard (Gigabyte MX33-BSA-V1F) and ZFS striped mirrors. Very frustrating.

In my case I have 10 PVE nodes, 3 of them have this motherboard and I've been experiencing different issues with them, all related to unexpected poweroffs:

- Node 1: It started to power off unexpectedly. The hosting provider changed the PSU and for now it works OK.
- Node 2: It started to power off / hang unexpectedly. The hosting provider changed the RAM and for now it works OK.
- Node 3: It started to power off unexpecedly, mostly when receiving replications. The hosting provider changed the PSU, but the power off issues keep happening. I've isolated the node and tried to reinstall it from scratch, but it powers off during the process. And the weirdest thing is that I'm using the same ISO I used 3 months ago when the server was delivered.

I have other nodes on the same 10-node cluster with Asus P12R-M/10G-2T and AsrockRack E3C252D4U-2T and no problems at all. Judging for the motherboard models, I think we share the same hosting provider...

Have you managed to fix it completely?
 
  • Like
Reactions: Brendon
Have you managed to fix it completely?
After they replaced the boards with ASRockRack E3C252D4U-2T everything has been solid. Those Gigabyte boards are great so this is a real shame. They boot faster and have better / more stable IPMI. I'm thinking they have some sort of hardware fault though. I'm not sure what the heck else it could be at this point. I was never able to recreate this issue outside of restoring a PVE backup!

If I were you I'd have them replace the boards and be done. Hopefully they figure out what's causing this!
 
  • Like
Reactions: carles89
Thanks for your update. I agree that it may be a hardware fault, but it's very hard to replicate.

I'm waiting for the motherboards replacement. Let's see how it goes. I've got also some nodes with GIGA MX33-BS1-V1, but for now I've had no problems with these.
 
  • Like
Reactions: Brendon

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!