Hi,
I've been troubleshooting an issue for a couple of months now on a new server with 2 kvm VM's. Initially I found that both VM's logged the "BUG: soft lockup" error, but after searching for possible causes I ended narrowing it down to the ZFS raidz2 stalling on certain circumstances. Tipically on I/O intensive workloads (i.e. backups) the system load goes up on the host, it gets unresponsive, and VM's start throwing all kind of backtraces. The host doesn't log any errors and after that behaves normally.
Now some details on my setup:
- Server: IBM System x3650 M4, 12 cores, 16 GB RAM, 5 x 1 TB SATA disks
- ZFS: 5 disks raidz2
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 4.53T 2.26T 2.27T - 15% 49% 1.00x ONLINE -
# cat /etc/modprobe.d/zfs
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=8589934592
options zfs zfs_prefetch_disable=1
# cat /sys/module/zfs/version
0.6.5.4-1
- Proxmox: started with standard 4.1 installation, now running latest packages
# pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-34
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
# pveperf
CPU BOGOMIPS: 62399.16
REGEX/SECOND: 1337646
HD SIZE: 243.78 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2120.17
- VM's: 1 samba server (4GB RAM, 6 cores, zfs backend), 1 proxy server (2GB RAM, 4 cores, zfs backend)
So, any clues on how to fix this? Maybe newer ZFS versions would fix the issue (any plans on upgrading it?), there're some deadlocks fixed on 0.6.5.5 an 0.6.5.6 changelogs that might help.
Any help will be wellcome, as I'm running out of hair by now
TIA
I've been troubleshooting an issue for a couple of months now on a new server with 2 kvm VM's. Initially I found that both VM's logged the "BUG: soft lockup" error, but after searching for possible causes I ended narrowing it down to the ZFS raidz2 stalling on certain circumstances. Tipically on I/O intensive workloads (i.e. backups) the system load goes up on the host, it gets unresponsive, and VM's start throwing all kind of backtraces. The host doesn't log any errors and after that behaves normally.
Now some details on my setup:
- Server: IBM System x3650 M4, 12 cores, 16 GB RAM, 5 x 1 TB SATA disks
- ZFS: 5 disks raidz2
# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 4.53T 2.26T 2.27T - 15% 49% 1.00x ONLINE -
# cat /etc/modprobe.d/zfs
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=8589934592
options zfs zfs_prefetch_disable=1
# cat /sys/module/zfs/version
0.6.5.4-1
- Proxmox: started with standard 4.1 installation, now running latest packages
# pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-34
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie
# pveperf
CPU BOGOMIPS: 62399.16
REGEX/SECOND: 1337646
HD SIZE: 243.78 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2120.17
- VM's: 1 samba server (4GB RAM, 6 cores, zfs backend), 1 proxy server (2GB RAM, 4 cores, zfs backend)
So, any clues on how to fix this? Maybe newer ZFS versions would fix the issue (any plans on upgrading it?), there're some deadlocks fixed on 0.6.5.5 an 0.6.5.6 changelogs that might help.
Any help will be wellcome, as I'm running out of hair by now
TIA