I recently ordered three new servers for a staging cluster (hosting provider with dedicated hardware). They provided me with the following:
01 has a ASRockRack model E3C252D4U-2T mobo.
02 and 03 have GIGABYTE model MX33-BSA-V1 mobos.
All three have:
Intel Xeon-E 2386G - 6c/12t - 3.5 GHz/4.7 GHz
64 GB ECC 3200 MHz
4×960 GB SSD NVMe (Samsung PM9A3, direct SATA connections of course)
Intel Corporation Ethernet Controller X710 for 10GBASE-T
Latest PVE 7. (I'm going to use this cluster to test upgrading to 8.)
ZFS is striped mirrors (RAID 10).
All three nodes in a cluster but no HA/fencing is configured.
I restored a VM from an NFS backup share to 01. Worked fine. VM is up and running.
Then I restored a VM to 02. Within seconds of the restore process starting the server shuts off. 03 does the same thing. I have to power them on via the IPMI interface.
No logs anywhere to be found. Not in the system logs, no stack traces, nothing on the console, etc. (outside of Corosync logs when the node dies of course).
The only logs I've been able to find that reference power is the system event log via IPMI:
30 | 08/23/2023 | 21:22:13 | Power Unit #0xff | Power off/down | Asserted
31 | 08/23/2023 | 21:22:59 | Power Unit #0xff | Power off/down | Deasserted
32 | 08/23/2023 | 21:22:59 | System Event | OEM System boot event | Asserted
33 | 08/23/2023 | 21:23:35 | OS Boot #0x22 | boot completed - device not specified | Asserted
This is the last thing that the server logs before shutting off. Looks like this happens on any restore:
2023-08-23T20:19:59.435888-04:00 staging-vmh02 kernel: [ 2037.724296] debugfs: Directory 'zd0' with parent 'block' already present!
2023-08-23T20:20:00.451914-04:00 staging-vmh02 kernel: [ 2038.740896] debugfs: Directory 'zd16' with parent 'block' already present!
Things I've tried:
- Network test with iperf3. All good.
- CPU burn-in test with sysbench. All good.
- Disk stress test with fio. All good.
- Restore VM to 02 from local storage. Powers off. Doesn't look like it's network/NFS related.
- Restore VM to 01 and migrate to 02/03. Works fine.
At this point I'm at a loss. I've never seen anything like this. I've reached out to the hosting provider of course but I also wanted to post here to see if anyone has seen something like this before.
We have a production cluster that has been running fine for years with nearly the exact same configuration. Very strange!
01 has a ASRockRack model E3C252D4U-2T mobo.
02 and 03 have GIGABYTE model MX33-BSA-V1 mobos.
All three have:
Intel Xeon-E 2386G - 6c/12t - 3.5 GHz/4.7 GHz
64 GB ECC 3200 MHz
4×960 GB SSD NVMe (Samsung PM9A3, direct SATA connections of course)
Intel Corporation Ethernet Controller X710 for 10GBASE-T
Latest PVE 7. (I'm going to use this cluster to test upgrading to 8.)
ZFS is striped mirrors (RAID 10).
All three nodes in a cluster but no HA/fencing is configured.
Code:
# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.102-1-pve: 5.15.102-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1
I restored a VM from an NFS backup share to 01. Worked fine. VM is up and running.
Then I restored a VM to 02. Within seconds of the restore process starting the server shuts off. 03 does the same thing. I have to power them on via the IPMI interface.
No logs anywhere to be found. Not in the system logs, no stack traces, nothing on the console, etc. (outside of Corosync logs when the node dies of course).
The only logs I've been able to find that reference power is the system event log via IPMI:
30 | 08/23/2023 | 21:22:13 | Power Unit #0xff | Power off/down | Asserted
31 | 08/23/2023 | 21:22:59 | Power Unit #0xff | Power off/down | Deasserted
32 | 08/23/2023 | 21:22:59 | System Event | OEM System boot event | Asserted
33 | 08/23/2023 | 21:23:35 | OS Boot #0x22 | boot completed - device not specified | Asserted
This is the last thing that the server logs before shutting off. Looks like this happens on any restore:
2023-08-23T20:19:59.435888-04:00 staging-vmh02 kernel: [ 2037.724296] debugfs: Directory 'zd0' with parent 'block' already present!
2023-08-23T20:20:00.451914-04:00 staging-vmh02 kernel: [ 2038.740896] debugfs: Directory 'zd16' with parent 'block' already present!
Things I've tried:
- Network test with iperf3. All good.
- CPU burn-in test with sysbench. All good.
- Disk stress test with fio. All good.
- Restore VM to 02 from local storage. Powers off. Doesn't look like it's network/NFS related.
- Restore VM to 01 and migrate to 02/03. Works fine.
At this point I'm at a loss. I've never seen anything like this. I've reached out to the hosting provider of course but I also wanted to post here to see if anyone has seen something like this before.
We have a production cluster that has been running fine for years with nearly the exact same configuration. Very strange!
Last edited: