Server stalls with ZFS

marianitn

Member
Mar 22, 2016
4
0
21
49
Hi,
I've been troubleshooting an issue for a couple of months now on a new server with 2 kvm VM's. Initially I found that both VM's logged the "BUG: soft lockup" error, but after searching for possible causes I ended narrowing it down to the ZFS raidz2 stalling on certain circumstances. Tipically on I/O intensive workloads (i.e. backups) the system load goes up on the host, it gets unresponsive, and VM's start throwing all kind of backtraces. The host doesn't log any errors and after that behaves normally.
Now some details on my setup:

- Server: IBM System x3650 M4, 12 cores, 16 GB RAM, 5 x 1 TB SATA disks
- ZFS: 5 disks raidz2

# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
rpool 4.53T 2.26T 2.27T - 15% 49% 1.00x ONLINE -

# cat /etc/modprobe.d/zfs
options zfs zfs_arc_min=4294967296
options zfs zfs_arc_max=8589934592
options zfs zfs_prefetch_disable=1

# cat /sys/module/zfs/version
0.6.5.4-1

- Proxmox: started with standard 4.1 installation, now running latest packages

# pveversion -v
proxmox-ve: 4.1-39 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-22 (running version: 4.1-22/aca130cf)
pve-kernel-4.2.6-1-pve: 4.2.6-34
pve-kernel-4.2.8-1-pve: 4.2.8-39
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-36
qemu-server: 4.0-64
pve-firmware: 1.1-7
libpve-common-perl: 4.0-54
libpve-access-control: 4.0-13
libpve-storage-perl: 4.0-45
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-9
pve-container: 1.0-52
pve-firewall: 2.0-22
pve-ha-manager: 1.0-25
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 2.0.0-pve2
cgmanager: 0.39-pve1
criu: 1.6.0-1
zfsutils: 0.6.5-pve7~jessie

# pveperf
CPU BOGOMIPS: 62399.16
REGEX/SECOND: 1337646
HD SIZE: 243.78 GB (rpool/ROOT/pve-1)
FSYNCS/SECOND: 2120.17

- VM's: 1 samba server (4GB RAM, 6 cores, zfs backend), 1 proxy server (2GB RAM, 4 cores, zfs backend)

So, any clues on how to fix this? Maybe newer ZFS versions would fix the issue (any plans on upgrading it?), there're some deadlocks fixed on 0.6.5.5 an 0.6.5.6 changelogs that might help.
Any help will be wellcome, as I'm running out of hair by now

TIA
 
Hi,

the next version of zfs is in preparation.
 
You can also install atop to try and figure out which resources are the bottle neck in those I/O situations (since you only have 8GB RAM for ZFS and host, I would keep an eye out for memory usage for example).
 
You can also install atop to try and figure out which resources are the bottle neck in those I/O situations (since you only have 8GB RAM for ZFS and host, I would keep an eye out for memory usage for example).

OK, monitoring with atop right now. Regarding memory, why 8GB? I count 10GB for ZFS and host, and 6GB for VM's. I changed zfs_arc_max to 5GB to ensure there's memory available for ZFS other than ARC. A backup is running, so far so good, with 3GB free memory in average:

# free -m
total used free shared buffers cached
Mem: 15892 15798 93 43 3677 56
-/+ buffers/cache: 12064 3827
Swap: 15359 1 15358

Another think I captured some days ago while failing is that perf top was reporting a high usage of native_queued_spin_lock_slowpath:

Samples: 154K of event 'cycles', Event count (approx.): 26450183436
Overhead Shared Object Symbol
36.34% [kernel] [k] native_queued_spin_lock_slowpath
4.08% [kernel] [k] kvm_handle_hva_range
3.49% [kernel] [k] _raw_spin_lock
2.59% [kernel] [k] __srcu_read_unlock
2.56% [kernel] [k] kvm_zap_rmapp
 
Sorry, I misread. You are of course right about the 6 vs. 8 GB.
 
Just to add information: the same issue is happening on another proxmox server (4 cores, 12 GB RAM, 500 GB raidz1, 6 GB VM). I could monitor it and gathered the following info during the failure:


Code:
atop
ATOP - srv-virt2                            2016/04/01  10:13:34                            ------                             10s elapsed
PRC |  sys   68.38s  | user   1.89s  |  #proc    610  |  #trun     16 |  #tslpi   630  |  #tslpu     6  |  #zombie    0 |  #exit     17  |
CPU |  sys     681%  | user     10%  |  irq       0%  |  idle     81% |  wait     23%  |  guest     5%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      94%  | user      1%  |  irq       0%  |  idle      3% |  cpu001 w  2%  |  guest     0%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      90%  | user      1%  |  irq       0%  |  idle      8% |  cpu002 w  2%  |  guest     0%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      89%  | user      1%  |  irq       0%  |  idle      8% |  cpu003 w  1%  |  guest     0%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      89%  | user      1%  |  irq       0%  |  idle      1% |  cpu007 w  9%  |  guest     0%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      84%  | user      3%  |  irq       0%  |  idle      8% |  cpu004 w  4%  |  guest     2%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      84%  | user      2%  |  irq       0%  |  idle     11% |  cpu000 w  2%  |  guest     1%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      77%  | user      2%  |  irq       0%  |  idle     19% |  cpu005 w  1%  |  guest     1%  |  curf 2.70GHz |  curscal  61%  |
cpu |  sys      74%  | user      0%  |  irq       0%  |  idle     22% |  cpu006 w  3%  |  guest     0%  |  curf 2.70GHz |  curscal  61%  |
CPL |  avg1   13.82  | avg5    7.12  |  avg15   3.14  |               |  csw    33651  |  intr  588380  |               |  numcpu     8  |
MEM |  tot    11.6G  | free    1.2G  |  cache  68.4M  |  dirty   0.0M |  buff    1.2M  |  slab    3.1G  |               |                |
SWP |  tot    11.0G  | free   10.9G  |                |               |                |                |  vmcom   8.8G |  vmlim  16.8G  |
PAG |  scan  2674e4  |               |  stall      0  |               |                |  swin      68  |               |  swout      0  |
DSK |           sda  | busy     98%  |  read      55  |  write    228 |  KiB/w      9  |  MBr/s   0.05  |  MBw/s   0.21 |  avio 35.0 ms  |
DSK |           sdb  | busy     98%  |  read      92  |  write    211 |  KiB/w      9  |  MBr/s   0.07  |  MBw/s   0.20 |  avio 32.7 ms  |
NET |  transport     | tcpi       3  |  tcpo       3  |  udpi       0 |  udpo       0  |  tcpao      0  |  tcppo      0 |  tcprs      0  |
NET |  network       | ipi       27  |  ipo        3  |  ipfrw      0 |  deliv     27  |                |  icmpi      0 |  icmpo      0  |
NET |  tap100i   4%  | pcki      71  |  pcko     274  |  si  427 Kbps |  so   21 Kbps  |  erri       0  |  erro       0 |  drpo       0  |
NET |  eth0      0%  | pcki     299  |  pcko     418  |  si   26 Kbps |  so  449 Kbps  |  erri       0  |  erro       0 |  drpo       0  |
NET |  vmbr0   ----  | pcki      42  |  pcko       3  |  si    2 Kbps |  so    1 Kbps  |  erri       0  |  erro       0 |  drpo       0  |

  PID   RUID       EUID        THR    SYSCPU    USRCPU    VGROW    RGROW    RDDSK    WRDSK   ST   EXC   S   CPUNR    CPU    CMD       1/16
2531   root       root         19    16.75s     1.52s       0K     -60K   10188K     832K   --     -   R       0   181%    kvm
1124   root       root          1     7.37s     0.00s       0K       0K       0K       0K   --     -   S       4    73%    z_wr_iss
1126   root       root          1     6.19s     0.00s       0K       0K       0K       0K   --     -   R       4    61%    z_wr_iss
1127   root       root          1     4.62s     0.00s       0K       0K       0K       0K   --     -   R       2    46%    z_wr_iss
1123   root       root          1     4.48s     0.00s       0K       0K       0K       0K   --     -   R       5    44%    z_wr_iss
1125   root       root          1     4.16s     0.00s       0K       0K       0K       0K   --     -   R       6    41%    z_wr_iss
1122   root       root          1     4.03s     0.00s       0K       0K       0K       0K   --     -   R       3    40%    z_wr_iss
4414   root       root          1     3.19s     0.00s       0K     524K       0K       0K   --     -   R       7    32%    arcstat.py
2416   root       root          7     2.43s     0.00s       0K     564K       0K      18K   --     -   S       4    24%    pmxcfs
2504   root       root          1     1.51s     0.00s       0K     508K       0K       0K   --     -   S       7    15%    pvedaemon
1140   root       root          1     1.39s     0.00s       0K       0K       0K       0K   --     -   S       4    14%    z_wr_int_0
1216   root       root          1     1.28s     0.00s       0K       0K       0K       0K   --     -   S       3    13%    z_wr_int_6
29995   root       root          1     1.18s     0.00s       0K       0K       0K       0K   --     -   S       0    12%    kworker/u16:2
2288   root       root         10     1.14s     0.00s       0K     236K       0K       0K   --     -   S       0    11%    rrdcached


Code:
perf top

Samples: 192M of event 'cycles', Event count (approx.): 306817932810                                                                     
Overhead  Shared Object            Symbol                                                                                                 
  61.60%  [kernel]                 [k] native_queued_spin_lock_slowpath                                                                   
   5.41%  [kernel]                 [k] kvm_age_rmapp                                                                                     
   4.88%  [kernel]                 [k] kvm_handle_hva_range                                                                               
   2.57%  [kernel]                 [k] __srcu_read_unlock                                                                                 
   2.42%  [kernel]                 [k] __page_check_address                                                                               
   2.02%  [kernel]                 [k] isolate_lru_pages.isra.45                                                                         
   1.91%  [kernel]                 [k] __srcu_read_lock                                                                                   
   1.90%  [kernel]                 [k] _raw_spin_lock                                                                                     
   1.45%  [kernel]                 [k] rmap_walk


Code:
zpool iostat 1
rpool       73.9G   370G      2     48  14.2K   343K
rpool       73.9G   370G      0     11      0  75.9K
rpool       73.9G   370G      0     12      0   140K
rpool       73.9G   370G     31     60   172K   596K
rpool       73.9G   370G     45      9   233K   206K
rpool       73.9G   370G      0      6      0  52.0K
rpool       73.9G   370G     45      6   200K  80.0K
rpool       73.9G   370G     31      7   131K  44.0K
rpool       73.9G   370G    103     40   627K   659K


Code:
iostat -zmx

Linux 4.2.8-1-pve (srv-virt2)   04/01/2016      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.05    0.00    1.20    2.95    0.00   93.81

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.06     0.05   10.40   44.56     0.07     0.77    31.39     0.22    4.03    6.41    3.47   2.57  14.13
sdb               1.08     0.05   11.08   44.56     0.08     0.77    31.18     0.20    3.64    5.61    3.15   2.35  13.09
zd0               0.00     0.00    0.08    0.13     0.00     0.00     8.04     0.00   16.40    3.43   23.82  15.21   0.32
zd16              0.00     0.00    0.00    0.00     0.00     0.00    14.47     0.00    0.00    0.00    0.00   0.00   0.00
zd32              0.00     0.00    1.11   28.47     0.05     0.10    10.40     0.06    2.08   13.55    1.63   1.50   4.44
zd48              0.00     0.00    4.37   68.55     0.09     0.22     8.88     0.26    3.62   12.87    3.03   1.68  12.22


Code:
arcstat.py 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c 
10:15:06   739    54      7    54    7     0    0     1  100   3.0G  3.0G 
10:15:08   469    64     13    64   13     0    0    10   66   3.0G  3.0G 
10:15:09   181    18      9    18    9     0    0     0    0   3.0G  3.0G 
10:15:10   360    30      8    30    8     0    0     2   40   3.0G  3.0G 
10:15:12   332    56     16    56   16     0    0     0    0   3.0G  3.0G 
10:15:13   970    12      1    12    1     0    0     3   50   3.0G  3.0G 
10:15:14   190     9      4     9    4     0    0     0    0   3.0G  3.0G 
10:15:15   449    17      3    17    3     0    0    12  100   3.0G  3.0G

Code:
free -m
             total       used       free     shared    buffers     cached
Mem:         11896      10676       1219         59          0         69
-/+ buffers/cache:      10606       1289
Swap:        11263         77      11186

According to this, there's still free memory unused, and disks are not overloaded. Just plenty of
z_wr_iss processes consuming all CPU, and very high native_queued_spin_lock_slowpath on perf top.

Do you think it's a ZFS bug? Or maybe something on my setups