Sudden reboots while running backups (maybe ZFS ram issue?)

Kurgan

Well-Known Member
Apr 27, 2018
33
5
48
53
I am investigating 2 incidents that seem to be similar. On 2 different PVE servers, both running ZFS with raidz (mirroring with 2 disks), I experience random reboots during backups of KVM machines. (no clustering, single hosts)

Both server are low on RAM. One has 8 GB RAM and a single VM with 1 GB RAM allocated to it. The other has 64 GB of RAM and some VMs using a total of 30 GB minimum, and 60 GB maximum.

Logs do not show anything useful: the typical log is as follows:
Code:
Apr 26 22:00:48 pve vzdump[1618]: INFO: Finished Backup of VM 101 (00:00:14)
Apr 26 22:00:49 pve vzdump[1618]: INFO: Starting Backup of VM 102 (qemu)
Apr 26 22:00:49 pve qm[1794]: <root@pam> update VM 102: -lock backup
Apr 26 22:02:11 pve kernel: [    0.000000] Initializing cgroup subsys cpuset
Apr 26 22:02:11 pve kernel: [    0.000000] Initializing cgroup subsys cpu
Apr 26 22:02:11 pve kernel: [    0.000000] Initializing cgroup subsys cpuacct
Apr 26 22:02:11 pve kernel: [    0.000000] Linux version 4.4.35-1-pve (root@elsa) (gcc version 4.9.2
 (Debian 4.9.2-10) ) #1 SMP Fri Dec 9 11:09:55 CET 2016 ()
As you can see, during a backup the host reboots with no indication about why.

I believe the HW (disks, ram) is OK (ram is ECC of course).

I am thinking about an issue with ZFS eating up enough RAM to make the host reboot, but I don't know if PVE has actually some sort of mechanism to reboot in case it's entering a critically unstable status.

I have googled all day long, learning about ZFS ARC limits (still not tried forcing them to a lower value, the hosts are in production), and about how someone says swapping on ZFS as PVE does is really a bad idea.

I came up with two credible (but still only theoretical) scenarios:
1- ZFS eats up memory under backup load, host swaps because of not enough RAM, swapping to ZFS makes things only worse, host somehow fills up all memory (or enters some sort of deadlock) and reboots.
2- ZFS eats up memory under backup load, and regardless of swap issues, fills up memory (or enters some sort of deadlock) and reboots.

I'm also baffled by the fact that another PVE host, with 32 GB RAM and a lot of VMs has 15 GB in use by ZFS, 8 GB swap in use at 99% (maybe swappiness is too high on PVE?), and still runs fine.

Summing it up, I'm at a loss. I have hosts running fine under heavy memory usage, and I have hosts crashing while running backups. All hosts share the same basic structure: PVE 4.4.1 (latest V4 ISO with no updates), ZFS mirroring, and backup to a locally mounted single disk (ext4 formatted).

Maybe it should be a good idea to log to a remote host, hoping to see something more in the syslog at the time of the crash and reboot? Because locally I just don't see anything useful.
 
Hi

I had the same reboot behavior with proxmox 4.4 and zfs in mirror mode

root@xxxx:~# pveversion -v
proxmox-ve: 4.4-109 (running kernel: 4.4.117-1-pve)
pve-manager: 4.4-22 (running version: 4.4-22/2728f613)
pve-kernel-4.4.98-2-pve: 4.4.98-101
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.79-1-pve: 4.4.79-95
pve-kernel-4.4.98-3-pve: 4.4.98-103
pve-kernel-4.4.117-1-pve: 4.4.117-109
pve-kernel-4.4.98-4-pve: 4.4.98-104
pve-kernel-4.4.98-5-pve: 4.4.98-105
pve-kernel-4.4.95-1-pve: 4.4.95-99
pve-kernel-4.4.67-1-pve: 4.4.67-92
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.114-1-pve: 4.4.114-108
pve-kernel-4.4.83-1-pve: 4.4.83-96
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-2~pve4+2
libqb0: 1.0.1-1
pve-cluster: 4.0-54
qemu-server: 4.0-115
pve-firmware: 1.1-11
libpve-common-perl: 4.0-96
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-2
pve-docs: 4.4-4
pve-qemu-kvm: 2.9.1-9~pve4
pve-container: 1.0-105
pve-firewall: 2.0-33
pve-ha-manager: 1.0-41
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-4
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-9
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
 
Mfgamma, how frequently? I have an host rebooting every 2 days, another just sometimes, like every month or so...
 
I'm still struggling with ZFS eating up too much of my RAM.

I have disabled swap (swapoff -a) completely on the PVE host, and since that day, no more reboots.

No more reboots, true, but now VMs get killed when RAM (on host) is exhausted, and ARC cache does not even try to decrease its RAM usage, even when there is no more RAM available to applications (that is, VMs).

So it seems to me that, contrary to what I have read, ZFS does not release ARC ram when the system is low on ram.

I may be a newbie at ZFS (yes, I surely am) but I really, really, really would like PVE to use mdraid and LVM instead of ZFS for mirroring.
 
You should set arc_max (and eventually arc_min) according to the RAM you want ZFS ARC cache to use.

In /etc/modprobe.d/zfs.conf, you put the lines:
Code:
# EXAMPLE ZFS ARC MIN - 512MB
options zfs zfs_arc_min=536870912

# EXAMPLE ZFS ARC MAX - 2G
options zfs zfs_arc_max=2147483648

Everytime you change that file, you have to run:
Code:
$ update-initramfs  -u
and reboot
 
  • Like
Reactions: GadgetPig
Mbaldini, I have read elsewhere (on PVE wiki, too) about this configuration. I will try it, but I need to schedule a maintenance window on the server, and I have to HOPE that something does not go wrong, otherwise I will end up with a server that does not boot any more.
 
You should already have a maintenance window during which you install kernel updates and reboot server, you should do in that timeframe.
Usually changing a ZFS parameter should not cause problems, but you can try on a test machine before.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!