Server hang during heavy IO

MasterPhi

Well-Known Member
Jan 7, 2019
51
7
48
33
Following this post, I've some updates.
https://forum.proxmox.com/threads/p...-blocked-for-more-than-120s.57765/post-277073

System Information:

Code:
CPU: Xeon E3-1285L V3
RAM: 32GB DDR3-1600 ECC, 50GB swap partition(not on zfs) on the same SSD as rpool
HBA: LSI9305-24i

Two SSD pool: rpool and S860; one hard disk pool: NAS

zpool status
  pool: NAS
state: ONLINE
  scan: scrub repaired 0B in 0 days 07:58:47 with 0 errors on Sun Nov 10 08:22:49 2019
config:

        NAME                                   STATE     READ WRITE CKSUM
        NAS                                    ONLINE       0     0     0
          raidz1-0                             ONLINE       0     0     0
            wwn-0x5000cca26bd9cc69             ONLINE       0     0     0
            ata-HGST_HUH721010ALE600_7JJ3GG4C  ONLINE       0     0     0
            ata-HGST_HUH721010ALE600_7JJ535MC  ONLINE       0     0     0

errors: No known data errors

  pool: S860
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:06:37 with 0 errors on Sun Nov 10 00:30:44 2019
config:

        NAME                                           STATE     READ WRITE CKSUM
        S860                                           ONLINE       0     0     0
          ata-Samsung_SSD_860_EVO_4TB_S3YPNB0KA00732Y  ONLINE       0     0     0
          ata-Samsung_SSD_860_EVO_4TB_S3YPNB0KA00919A  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
state: ONLINE
  scan: scrub repaired 0B in 0 days 00:16:00 with 0 errors on Sun Nov 10 00:40:14 2019
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sda3    ONLINE       0     0     0
            sdh3    ONLINE       0     0     0

errors: No known data errors

Code:
proxmox-ve: 6.0-2 (running kernel: 5.0.21-4-pve)
pve-manager: 6.0-11 (running version: 6.0-11/2140ef37)
pve-kernel-helper: 6.0-11
pve-kernel-5.0: 6.0-10
pve-kernel-5.0.21-4-pve: 5.0.21-9
pve-kernel-5.0.21-3-pve: 5.0.21-7
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-3
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-7
libpve-guest-common-perl: 3.0-2
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-8
pve-cluster: 6.0-7
pve-container: 3.0-10
pve-docs: 6.0-8
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-4
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.1-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-13
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

zfs parameters, suggested by somebody for resolve similar symptom:
options zfs zfs_arc_max=25769803776
options zfs zfetch_max_distance=268435456
options zfs zfs_prefetch_disable=1
options zfs zfs_nocacheflush=1
options zfs zfs_vdev_scheduler=none


Porblem:
Proxmox server hang during fio benchmark on S860 and NAS pool at the same time, no auto recovery after hours. No VM is running during the test.
Code:
[global]
name=fio-seq-write
filename=fio-seq-write
rw=write
bs=256K
direct=0
numjobs=1
time_based=1
runtime=300

[file1]
size=50G
ioengine=libaio
iodepth=16

txg_sync hang is observed from the system log, if IPMI or softdog is enabled, server is rebooted in 10 seconds.
Code:
[  968.292727] INFO: task txg_sync:2130 blocked for more than 120 seconds.
[  968.292762]        Tainted: P          0      5.0.21-4-pve #1
[  968.292784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  968.292815] txg_sync        D   0   2130      2 0x80000000
[  968.292837] Call Trace:
[  968.292855] __schedule+0x2d4/0x870
[  968.292872] schedule+Ox2c/0x70
[  968.292887] schedule_timeout+0x157/0x360
[  968.292906] ? __next_timer_interrupt+OxdO/OxdO
[  968.292925]  io_schedule_timeout+Ox1e/Ox50
[  968.292946] __cv_timedwait_common+Oxl2f/0x170  [spl]
[  968.292968] ? wait_woken+0x80/0x80
[  968.292985] __cv_timedwait_io+0x19/0x20 [spl]
[  968.293049] zio_wait+Ox13a/0x280 [zfs]
[  968.293067] ? _cond_resched+0x19/0x30
[  968.293115] dsl_pool_sync+Oxdc/Ox4f0 [zfs]
[  968.293168] spa_sync+0x5b2/Oxfc0 [zfs]
[  968.293221] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[  968.293278] txg_sync_thread+0x2d9/0x4c0 [zfs]
[  968.293331] ? txg_thread_exit.isra.11+0x60/0x60 [zfs]
[  968.293355] thread_generic_wrapper+0x74/0x90 [spl]
[  968.293376] kthread+Ox120/0x140
[  968.293393] ? __thread_exit+0x20/0x20 [sell
[  968.293410]    __kthread_parkme+0x70/0x70
[  968.293428] ret_from_fork+0x35/0x40


Update 11/13:

I'v disabled the swap, now it has plenty of RAM at idle state:
Code:
              total        used        free      shared  buff/cache   available
Mem:       32894512     6502496    22191116       54452      200900    21998720
Swap:             0           0           0

During the test RAM usage is between 24 - 30 GB. CPU usage is quite high since no SIMD support(?):

Code:
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                    
2337 root       1 -19       0      0      0 S  45.0   0.0  10:56.39 z_wr_iss                                                                                  
2340 root       1 -19       0      0      0 S  45.0   0.0  10:58.38 z_wr_iss                                                                                  
2338 root       1 -19       0      0      0 S  44.7   0.0  10:58.54 z_wr_iss                                                                                  
2339 root       1 -19       0      0      0 S  44.7   0.0  10:58.87 z_wr_iss                                                                                  
2341 root       1 -19       0      0      0 S  44.7   0.0  10:55.02 z_wr_iss                                                                                  
2342 root       1 -19       0      0      0 S  44.7   0.0  10:56.07 z_wr_iss                                                                                  
32655 root      20   0  789228   8808    636 D  17.2   0.0   0:43.01 fio                                                                                        
  453 root      39  19       0      0      0 S   4.3   0.0   0:35.99 dbuf_evict                                                                                
  447 root       0 -20       0      0      0 S   2.6   0.0   0:09.61 spl_kmem_cache                                                                            
2350 root       0 -20       0      0      0 R   2.0   0.0   0:24.96 z_wr_int                                                                                  
2344 root       0 -20       0      0      0 S   1.7   0.0   0:24.77 z_wr_int                                                                                  
2345 root       0 -20       0      0      0 S   1.7   0.0   0:24.85 z_wr_int                                                                                  
2348 root       0 -20       0      0      0 S   1.7   0.0   0:24.97 z_wr_int                                                                                  
2349 root       0 -20       0      0      0 S   1.7   0.0   0:24.75 z_wr_int                                                                                  
2351 root       0 -20       0      0      0 S   1.7   0.0   0:25.33 z_wr_int                                                                                  
2346 root       0 -20       0      0      0 S   1.3   0.0   0:24.82 z_wr_int                                                                                  
2347 root       0 -20       0      0      0 R   1.3   0.0   0:24.95 z_wr_int


I've run the test multiple times and no hung is observed, seems like swap is the issue but I can't explain it, since swap is not on zvol.
Code:
Device          Start        End    Sectors  Size Type
/dev/sdh1          34       2047       2014 1007K BIOS boot
/dev/sdh2        2048    1050623    1048576  512M EFI System
/dev/sdh3     1050624 3649044480 3647993857  1.7T Solaris /usr & Apple ZFS
/dev/sdh4  3649044488 3750748814  101704327 48.5G Linux swap
 
Last edited:
if you enable swap, can you monitor your I/O (e.g., using "zpool iostat" or "iostat") and memory ("arcstat"/"arc_summary" and "free")?
 
You were right, web interface seems dead but from ipmi console the server could be recovered from hung.

ARC usage is quite high, even fulled:

Code:
root@pve:~# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Nov 13 14:12:45 2019
Linux 5.0.21-4-pve                                            0.8.2-pve2
Machine: pve.hifiphile.com (x86_64)                           0.8.2-pve2

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   100.1 %   24.0 GiB
        Target size (adaptive):                       100.0 %   24.0 GiB
        Min size (hard limit):                         4.1 %  1003.9 MiB
        Max size (high water):                           24:1   24.0 GiB
        Most Frequently Used (MFU) cache size:          0.1 %   21.1 MiB
        Most Recently Used (MRU) cache size:           99.9 %   16.8 GiB
        Metadata cache size (hard limit):              75.0 %   18.0 GiB
        Metadata cache size (current):                  0.8 %  145.1 MiB
        Dnode cache size (hard limit):                 10.0 %    1.8 GiB
        Dnode cache size (current):                     0.3 %    5.3 MiB

ARC hash breakdown:
        Elements max:                                             218.4k
        Elements current:                              83.7 %     182.8k
        Collisions:                                                17.2k
        Chain max:                                                     3
        Chains:                                                     3.9k

ARC misc:
        Deleted:                                                  470.6k
        Mutex misses:                                               1.7k
        Eviction skips:                                             2.7M

ARC total accesses (hits + misses):                               913.9k
        Cache hit ratio:                               98.5 %     900.4k
        Cache miss ratio:                               1.5 %      13.5k
        Actual hit ratio (MFU + MRU hits):             98.2 %     897.6k
        Data demand efficiency:                        90.4 %      79.0k
        Data prefetch efficiency:                       0.0 %        113

Cache hits by cache type:
        Most frequently used (MFU):                    45.6 %     410.8k
        Most recently used (MRU):                      54.1 %     486.8k
        Most frequently used (MFU) ghost:               0.2 %       2.1k
        Most recently used (MRU) ghost:                 0.1 %       1.1k

Cache hits by data type:
        Demand data:                                    7.9 %      71.4k
        Demand prefetch data:                           0.0 %          0
        Demand metadata:                               91.8 %     826.2k
        Demand prefetch metadata:                       0.3 %       2.8k

Cache misses by data type:
        Demand data:                                   56.0 %       7.6k
        Demand prefetch data:                           0.8 %        113
        Demand metadata:                               28.4 %       3.8k
        Demand prefetch metadata:                      14.8 %       2.0k

DMU prefetch efficiency:                                               0
        Hit ratio:                                        n/a          0
        Miss ratio:                                       n/a          0

L2ARC not detected, skipping section


ZIL committed transactions:                                         1.0k
        Commit requests:                                             108
        Flushes to stable storage:                                   108
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:      315.1 KiB         97


sda & sdh are rpool; sdb & sdc & sdd are NAS pool; sdi & sdj are S860 pool;

Code:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.14    0.00   71.51   21.79    0.00    5.56

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              4.00  680.33     49.33 124872.00     0.00     1.67   0.00   0.24   14.83    8.52   4.72    12.33   183.55   0.78  53.60
sdd              4.00  641.33     42.67 124584.00     0.00     3.00   0.00   0.47   10.92    9.04   4.74    10.67   194.26   0.81  52.40
sdb              1.33  631.00      5.33 123612.00     0.00     3.00   0.00   0.47    0.25   10.67   5.67     4.00   195.90   0.81  51.47
sdj              4.00 3661.33    122.67 458577.33     0.00     5.33   0.00   0.15    1.83    1.54   0.44    30.67   125.25   0.26  94.00
sde              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdf              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdh              0.67    0.00      2.67      0.00     0.00     0.00   0.00   0.00    0.50    0.00   0.00     4.00     0.00   4.00   0.27
sdg              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdi              4.33 3719.67    150.67 466154.67     0.00     2.33   0.00   0.06    2.23    1.41   0.43    34.77   125.32   0.26  95.60
zd0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
zd16             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
zd32             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
zd48             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
zd64             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00


Every time when it hungs, swaps usages changes, vm.swappiness is 10 by default:

Code:
Nov 13 13:30:33 pve kernel: [  262.506697] softdog: Triggered - Reboot ignored

root@pve:~# free
              total        used        free      shared  buff/cache   available
Mem:       32894512    30487644     1475068      875436      931800     1237624
Swap:      50852156        2048    50850108
root@pve:~# free
              total        used        free      shared  buff/cache   available
Mem:       32894512    18273060    14489332       53908      132120    14262820
Swap:      50852156        6656    50845500

Maybe it's a generic swap trashing problem ?
 
looks like it.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!