I am investigating 2 incidents that seem to be similar. On 2 different PVE servers, both running ZFS with raidz (mirroring with 2 disks), I experience random reboots during backups of KVM machines. (no clustering, single hosts)
Both server are low on RAM. One has 8 GB RAM and a single VM with 1 GB RAM allocated to it. The other has 64 GB of RAM and some VMs using a total of 30 GB minimum, and 60 GB maximum.
Logs do not show anything useful: the typical log is as follows:
As you can see, during a backup the host reboots with no indication about why.
I believe the HW (disks, ram) is OK (ram is ECC of course).
I am thinking about an issue with ZFS eating up enough RAM to make the host reboot, but I don't know if PVE has actually some sort of mechanism to reboot in case it's entering a critically unstable status.
I have googled all day long, learning about ZFS ARC limits (still not tried forcing them to a lower value, the hosts are in production), and about how someone says swapping on ZFS as PVE does is really a bad idea.
I came up with two credible (but still only theoretical) scenarios:
1- ZFS eats up memory under backup load, host swaps because of not enough RAM, swapping to ZFS makes things only worse, host somehow fills up all memory (or enters some sort of deadlock) and reboots.
2- ZFS eats up memory under backup load, and regardless of swap issues, fills up memory (or enters some sort of deadlock) and reboots.
I'm also baffled by the fact that another PVE host, with 32 GB RAM and a lot of VMs has 15 GB in use by ZFS, 8 GB swap in use at 99% (maybe swappiness is too high on PVE?), and still runs fine.
Summing it up, I'm at a loss. I have hosts running fine under heavy memory usage, and I have hosts crashing while running backups. All hosts share the same basic structure: PVE 4.4.1 (latest V4 ISO with no updates), ZFS mirroring, and backup to a locally mounted single disk (ext4 formatted).
Maybe it should be a good idea to log to a remote host, hoping to see something more in the syslog at the time of the crash and reboot? Because locally I just don't see anything useful.
Both server are low on RAM. One has 8 GB RAM and a single VM with 1 GB RAM allocated to it. The other has 64 GB of RAM and some VMs using a total of 30 GB minimum, and 60 GB maximum.
Logs do not show anything useful: the typical log is as follows:
Code:
Apr 26 22:00:48 pve vzdump[1618]: INFO: Finished Backup of VM 101 (00:00:14)
Apr 26 22:00:49 pve vzdump[1618]: INFO: Starting Backup of VM 102 (qemu)
Apr 26 22:00:49 pve qm[1794]: <root@pam> update VM 102: -lock backup
Apr 26 22:02:11 pve kernel: [ 0.000000] Initializing cgroup subsys cpuset
Apr 26 22:02:11 pve kernel: [ 0.000000] Initializing cgroup subsys cpu
Apr 26 22:02:11 pve kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Apr 26 22:02:11 pve kernel: [ 0.000000] Linux version 4.4.35-1-pve (root@elsa) (gcc version 4.9.2
(Debian 4.9.2-10) ) #1 SMP Fri Dec 9 11:09:55 CET 2016 ()
I believe the HW (disks, ram) is OK (ram is ECC of course).
I am thinking about an issue with ZFS eating up enough RAM to make the host reboot, but I don't know if PVE has actually some sort of mechanism to reboot in case it's entering a critically unstable status.
I have googled all day long, learning about ZFS ARC limits (still not tried forcing them to a lower value, the hosts are in production), and about how someone says swapping on ZFS as PVE does is really a bad idea.
I came up with two credible (but still only theoretical) scenarios:
1- ZFS eats up memory under backup load, host swaps because of not enough RAM, swapping to ZFS makes things only worse, host somehow fills up all memory (or enters some sort of deadlock) and reboots.
2- ZFS eats up memory under backup load, and regardless of swap issues, fills up memory (or enters some sort of deadlock) and reboots.
I'm also baffled by the fact that another PVE host, with 32 GB RAM and a lot of VMs has 15 GB in use by ZFS, 8 GB swap in use at 99% (maybe swappiness is too high on PVE?), and still runs fine.
Summing it up, I'm at a loss. I have hosts running fine under heavy memory usage, and I have hosts crashing while running backups. All hosts share the same basic structure: PVE 4.4.1 (latest V4 ISO with no updates), ZFS mirroring, and backup to a locally mounted single disk (ext4 formatted).
Maybe it should be a good idea to log to a remote host, hoping to see something more in the syslog at the time of the crash and reboot? Because locally I just don't see anything useful.