Howdy folks, I felt compelled to report this (and ask for help) as I have run out of ideas and noticed that another user is having a similar issue.
One of my nodes has begun restarting with no warning or logged info.
The system is a 6 NIC, fanless network appliance with an i7-8565U CPU, 32Gb DDR4 RAM (2x16Gb) and a pair of 250Gb SSDs in a ZFS Raidz1 mirror.
I only noticed this issue about 3 weeks ago when I upgraded to PVE 8. I didn't notice it before that, but I have been interacting with it much more since the upgrade.
At first it seemed to be random but since then I have found a few correlations:
Your thoughts and advice would be most welcome at this stage as I have completely run out of ideas and I can no longer consider this node to be stable.
One of my nodes has begun restarting with no warning or logged info.
The system is a 6 NIC, fanless network appliance with an i7-8565U CPU, 32Gb DDR4 RAM (2x16Gb) and a pair of 250Gb SSDs in a ZFS Raidz1 mirror.
I only noticed this issue about 3 weeks ago when I upgraded to PVE 8. I didn't notice it before that, but I have been interacting with it much more since the upgrade.
At first it seemed to be random but since then I have found a few correlations:
- Updating the system (Updates > Upgrade) has triggered it at least three times.
- Creating large files full of random data on the local drive regularly (but not always) triggers it (I've also tested doing this inside a CT and it still triggers the restarts).
- Importing a large ISO (>4Gb) via the webui to the local drive almost always triggers it, always at the part where the system copies the file from the temp directory to the ISO store (command: /usr/bin/scp -o BatchMode=yes -p -- /var/tmp/pveupload...:/var/lib/vz/template/iso/upload.iso)
- Removing each of the disk drives in turn, booting the system in a degraded ZFS state and causing a reboot with either solo disk using the ISO upload method. This has ruled out the two disks.
- I have run memtest for 8 hours and found no errors.
- I have run a CPU stress test to check the CPU and also force the system into high power draw and therefore indirectly test the power brick.
Your thoughts and advice would be most welcome at this stage as I have completely run out of ideas and I can no longer consider this node to be stable.