Restoring from backup powers off server

Brendon · Aug 24, 2023

I recently ordered three new servers for a staging cluster (hosting provider with dedicated hardware). They provided me with the following:

01 has a ASRockRack model E3C252D4U-2T mobo.

02 and 03 have GIGABYTE model MX33-BSA-V1 mobos.

All three have:

Intel Xeon-E 2386G - 6c/12t - 3.5 GHz/4.7 GHz

64 GB ECC 3200 MHz

4×960 GB SSD NVMe (Samsung PM9A3, direct SATA connections of course)

Intel Corporation Ethernet Controller X710 for 10GBASE-T

Latest PVE 7. (I'm going to use this cluster to test upgrading to 8.)

ZFS is striped mirrors (RAID 10).

All three nodes in a cluster but no HA/fencing is configured.

Code:

# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

I restored a VM from an NFS backup share to 01. Worked fine. VM is up and running.

Then I restored a VM to 02. Within seconds of the restore process starting the server shuts off. 03 does the same thing. I have to power them on via the IPMI interface.

No logs anywhere to be found. Not in the system logs, no stack traces, nothing on the console, etc. (outside of Corosync logs when the node dies of course).

The only logs I've been able to find that reference power is the system event log via IPMI:

30 | 08/23/2023 | 21:22:13 | Power Unit #0xff | Power off/down | Asserted
31 | 08/23/2023 | 21:22:59 | Power Unit #0xff | Power off/down | Deasserted
32 | 08/23/2023 | 21:22:59 | System Event | OEM System boot event | Asserted
33 | 08/23/2023 | 21:23:35 | OS Boot #0x22 | boot completed - device not specified | Asserted

This is the last thing that the server logs before shutting off. Looks like this happens on any restore:

2023-08-23T20:19:59.435888-04:00 staging-vmh02 kernel: [ 2037.724296] debugfs: Directory 'zd0' with parent 'block' already present!
2023-08-23T20:20:00.451914-04:00 staging-vmh02 kernel: [ 2038.740896] debugfs: Directory 'zd16' with parent 'block' already present!

Things I've tried:

- Network test with iperf3. All good.

- CPU burn-in test with sysbench. All good.

- Disk stress test with fio. All good.

- Restore VM to 02 from local storage. Powers off. Doesn't look like it's network/NFS related.

- Restore VM to 01 and migrate to 02/03. Works fine.

At this point I'm at a loss. I've never seen anything like this. I've reached out to the hosting provider of course but I also wanted to post here to see if anyone has seen something like this before.

We have a production cluster that has been running fine for years with nearly the exact same configuration. Very strange!

Brendon · Aug 24, 2023

Just ran Memtest86+ and both servers pass.

I keep thinking this has to be a hardware issue but it ONLY happens when I try to restore a Proxmox backup. I have the hosting provider looking into it now.

Brendon · Aug 24, 2023

Hosting provider is swapping the mobos on the two faulty servers to ASRockRack E3C252D4U-2T. They did already on one and everything is now working as expected. All three servers will have the same mobos.

While I am waiting for them to swap the third, I pulled it from the cluster and rebuilt it with PVE 8 (standalone). It does the same thing. Within a few seconds of restoring from backup, it shuts off. No logs anywhere. It's as if the power is cut.

I really wish I could figure out what it is about the restore process that causes this! I've tried everything I can think of to get this to happen otherwise but have been unsuccessful so far.

carles89 · Aug 28, 2023

Hi Brendon,

You're not the only one, similar problem here with the same motherboard (Gigabyte MX33-BSA-V1F) and ZFS striped mirrors. Very frustrating.

In my case I have 10 PVE nodes, 3 of them have this motherboard and I've been experiencing different issues with them, all related to unexpected poweroffs:

- Node 1: It started to power off unexpectedly. The hosting provider changed the PSU and for now it works OK.
- Node 2: It started to power off / hang unexpectedly. The hosting provider changed the RAM and for now it works OK.
- Node 3: It started to power off unexpecedly, mostly when receiving replications. The hosting provider changed the PSU, but the power off issues keep happening. I've isolated the node and tried to reinstall it from scratch, but it powers off during the process. And the weirdest thing is that I'm using the same ISO I used 3 months ago when the server was delivered.

I have other nodes on the same 10-node cluster with Asus P12R-M/10G-2T and AsrockRack E3C252D4U-2T and no problems at all. Judging for the motherboard models, I think we share the same hosting provider...

Have you managed to fix it completely?

Brendon · Aug 29, 2023

carles89 said:
Have you managed to fix it completely?

After they replaced the boards with ASRockRack E3C252D4U-2T everything has been solid. Those Gigabyte boards are great so this is a real shame. They boot faster and have better / more stable IPMI. I'm thinking they have some sort of hardware fault though. I'm not sure what the heck else it could be at this point. I was never able to recreate this issue outside of restoring a PVE backup!

If I were you I'd have them replace the boards and be done. Hopefully they figure out what's causing this!

carles89 · Aug 30, 2023

Thanks for your update. I agree that it may be a hardware fault, but it's very hard to replicate.

I'm waiting for the motherboards replacement. Let's see how it goes. I've got also some nodes with GIGA MX33-BS1-V1, but for now I've had no problems with these.

carles89 · Sep 7, 2023

Hi,

I write just to confirm the motherboard change also did the trick for us. Now everything works as expected. I hope this can help to someone else!

Search

Search

Restoring from backup powers off server

Brendon

Renowned Member

Brendon

Renowned Member

Brendon

Renowned Member

carles89

Renowned Member

Brendon

Renowned Member

carles89

Renowned Member

carles89

Renowned Member

We value your privacy