High iowait on a ZFS pool

djbobyd · Aug 28, 2020

Hello everyone,
I've been using proxmox for some time and I really like it. Your posts and advices were here in htis forum were very usefull to me throughout all the setups and optimizations that I did to the VMs I use.
It has come a time for me to also write in this forum. Unfortunately with a plead for help.
I've recently created a new ZFS pool on a couple of disks I have. Then I moved all my VMs to it. ZFS is really nice as a whole but since I started using it I started to get this weird hangs. At first I thought it was related to the ARC configuration. I read some documentation. again here in the forum, and optimized the dirty ratio and the ARC size. It seemed to have worked for a while but I got the problem againt today. This time I caught it while it was piling up and it turns out it is some kind of an issue with iowait. When I looked at the logs (/var/log/messages) I found the following dump:

Code:

Aug 28 12:06:11 pve kernel: [98445.956030] fwbr116i0: port 2(tap116i0) entered forwarding state
Aug 28 12:52:08 pve kernel: [101203.260579] fwbr116i0: port 2(tap116i0) entered disabled state
Aug 28 12:52:08 pve kernel: [101203.293818] fwbr116i0: port 1(fwln116i0) entered disabled state
Aug 28 12:52:08 pve kernel: [101203.293890] vmbr0: port 9(fwpr116p0) entered disabled state
Aug 28 12:52:08 pve kernel: [101203.296943] device fwln116i0 left promiscuous mode
Aug 28 12:52:08 pve kernel: [101203.296948] fwbr116i0: port 1(fwln116i0) entered disabled state
Aug 28 12:52:08 pve kernel: [101203.343154] device fwpr116p0 left promiscuous mode
Aug 28 12:52:08 pve kernel: [101203.343156] vmbr0: port 9(fwpr116p0) entered disabled state
Aug 28 18:19:35 pve kernel: [120849.445985] PGD 0 P4D 0
Aug 28 18:19:35 pve kernel: [120849.446005] Oops: 0000 [#1] SMP PTI
Aug 28 18:19:35 pve kernel: [120849.446032] CPU: 5 PID: 1326 Comm: z_wr_iss Tainted: P           O      5.4.55-1-pve #1
Aug 28 18:19:35 pve kernel: [120849.446080] Hardware name: ASUSTeK COMPUTER INC. Z10PE-D16 Series/Z10PE-D16 Series, BIOS 3502 11/24/2017
Aug 28 18:19:35 pve kernel: [120849.446205] RIP: 0010:lz4_compress_zfs+0x3e5/0x7e0 [zfs]
Aug 28 18:19:35 pve kernel: [120849.446239] Code: c8 48 29 df 49 29 d8 c1 ee 13 66 41 89 7c 75 00 69 31 b1 79 37 9e c1 ee 13 49 8d 74 75 00 0f b7 3e 66 44 89 06 8b 31 48 01 df <39> 37 0f 85 86 03 00 00
c6 00 00 4c 8d 40 01 48 89 ce 4c 8d 59 04
Aug 28 18:19:35 pve kernel: [120849.446338] RSP: 0018:ffff972249bd7cb0 EFLAGS: 00010202
Aug 28 18:19:35 pve kernel: [120849.446358] RAX: ffff8b0bf02a411b RBX: 0000000023658001 RCX: ffff8afdddb4a178
Aug 28 18:19:35 pve kernel: [120849.446383] RDX: ffff8afdddb4b000 RSI: 00000000bac58001 RDI: 0000000023658001
Aug 28 18:19:35 pve kernel: [120849.446409] RBP: ffff972249bd7d08 R08: ffff8afdba4f2177 R09: ffff8afdddb4aff4
Aug 28 18:19:35 pve kernel: [120849.446434] R10: ffff8b0bf02a4e00 R11: ffff8afdddb4a178 R12: 0000000000001000
Aug 28 18:19:35 pve kernel: [120849.446460] R13: ffff8b0be7b2c000 R14: ffff8afdddb4a112 R15: 0000000000000023
Aug 28 18:19:35 pve kernel: [120849.446485] FS:  0000000000000000(0000) GS:ffff8b0bff540000(0000) knlGS:0000000000000000
Aug 28 18:19:35 pve kernel: [120849.446514] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 18:19:35 pve kernel: [120849.446535] CR2: 0000000023658001 CR3: 00000010de20a006 CR4: 00000000003626e0
Aug 28 18:19:35 pve kernel: [120849.446560] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 18:19:35 pve kernel: [120849.446585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 28 18:19:35 pve kernel: [120849.446610] Call Trace:
Aug 28 18:19:35 pve kernel: [120849.446670]  zio_compress_data+0x85/0xc0 [zfs]
Aug 28 18:19:35 pve kernel: [120849.446732]  zio_write_compress+0x426/0x770 [zfs]
Aug 28 18:19:35 pve kernel: [120849.446794]  zio_execute+0x99/0xf0 [zfs]
Aug 28 18:19:35 pve kernel: [120849.446817]  taskq_thread+0x2ec/0x4d0 [spl]
Aug 28 18:19:35 pve kernel: [120849.446838]  ? wake_up_q+0x80/0x80
Aug 28 18:19:35 pve kernel: [120849.446895]  ? zio_taskq_member.isra.12.constprop.17+0x70/0x70 [zfs]
Aug 28 18:19:35 pve kernel: [120849.446922]  kthread+0x120/0x140
Aug 28 18:19:35 pve kernel: [120849.446940]  ? task_done+0xb0/0xb0 [spl]
Aug 28 18:19:35 pve kernel: [120849.446958]  ? kthread_park+0x90/0x90
Aug 28 18:19:35 pve kernel: [120849.446975]  ret_from_fork+0x35/0x40
Aug 28 18:19:35 pve kernel: [120849.446991] Modules linked in: tcp_diag inet_diag rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache veth ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables msr iptable_filter bpfilter bonding softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul zfs(PO) crc32_pclmul ghash_clmulni_intel aesni_intel zunicode(PO) crypto_simd zlua(PO) cryptd glue_helper zavl(PO) rapl icp(PO) ipmi_ssif dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ast drm_vram_helper intel_cstate ttm pcspkr drm_kms_helper drm fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me mei ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi nct6775 hwmon_vid coretemp ip_tables x_tables autofs4 btrfs xor zstd_compress hid_generic
Aug 28 18:19:35 pve kernel: [120849.447026]  usbhid hid raid6_pq libcrc32c xhci_pci ehci_pci ahci i2c_i801 lpc_ich ehci_hcd xhci_hcd igb libahci i2c_algo_bit dca wmi
Aug 28 18:19:35 pve kernel: [120849.452519] CR2: 0000000023658001
Aug 28 18:19:35 pve kernel: [120849.453471] ---[ end trace c626264b918d11a4 ]---
Aug 28 18:19:35 pve kernel: [120849.461752] RIP: 0010:lz4_compress_zfs+0x3e5/0x7e0 [zfs]
Aug 28 18:19:35 pve kernel: [120849.462666] Code: c8 48 29 df 49 29 d8 c1 ee 13 66 41 89 7c 75 00 69 31 b1 79 37 9e c1 ee 13 49 8d 74 75 00 0f b7 3e 66 44 89 06 8b 31 48 01 df <39> 37 0f 85 86 03 00 00
c6 00 00 4c 8d 40 01 48 89 ce 4c 8d 59 04
Aug 28 18:19:35 pve kernel: [120849.464543] RSP: 0018:ffff972249bd7cb0 EFLAGS: 00010202
Aug 28 18:19:35 pve kernel: [120849.465478] RAX: ffff8b0bf02a411b RBX: 0000000023658001 RCX: ffff8afdddb4a178
Aug 28 18:19:35 pve kernel: [120849.466442] RDX: ffff8afdddb4b000 RSI: 00000000bac58001 RDI: 0000000023658001
Aug 28 18:19:35 pve kernel: [120849.467372] RBP: ffff972249bd7d08 R08: ffff8afdba4f2177 R09: ffff8afdddb4aff4
Aug 28 18:19:35 pve kernel: [120849.468310] R10: ffff8b0bf02a4e00 R11: ffff8afdddb4a178 R12: 0000000000001000
Aug 28 18:19:35 pve kernel: [120849.469253] R13: ffff8b0be7b2c000 R14: ffff8afdddb4a112 R15: 0000000000000023
Aug 28 18:19:35 pve kernel: [120849.470191] FS:  0000000000000000(0000) GS:ffff8b0bff540000(0000) knlGS:0000000000000000
Aug 28 18:19:35 pve kernel: [120849.471119] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 18:19:35 pve kernel: [120849.472046] CR2: 0000000023658001 CR3: 00000010de20a006 CR4: 00000000003626e0
Aug 28 18:19:35 pve kernel: [120849.472961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 18:19:35 pve kernel: [120849.473850] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

I looked over the internet but all I found is a guy who had a memory problem. I am already checking the RAM for any issues with memtest86+ but I am already almost over half of it and still no errors. That's why I decided to ask if the above dump makes sense to any of you and if you can point me out to a possible solution.

Thanks a lot in advance.

wolfgang · Aug 31, 2020

Hi,
I never found a bad memory DIMM with memtest86+.

Try this here and see if the bandwidth/latency is ok.
https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html

djbobyd · Aug 31, 2020

Hi @wolfgang ,
Thanks for your reply. I've done what you suggested. Unfortunately I am not sure what to look for. Here is the output from the mlc tool. Please let me know if you see something wrong.

Code:

Intel(R) Memory Latency Checker - v3.9
Measuring idle latencies (in ns)...
                Numa node
Numa node            0       1
       0          99.0   145.7
       1         139.7    95.0

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      46886.8
3:1 Reads-Writes :      55974.1
2:1 Reads-Writes :      55339.9
1:1 Reads-Writes :      48901.8
Stream-triad like:      52289.3

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1
       0        30793.5 20205.6
       1        20302.0 30511.2

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  492.35    59823.3
 00002  513.11    59751.8
 00008  505.13    59823.1
 00015  468.78    59241.4
 00050  474.19    59732.5
 00100  476.53    59460.6
 00200  422.41    59225.6
 00300  144.15    49331.8
 00400  118.84    37897.4
 00500  118.87    30317.1
 00700  111.86    22409.5
 01000  104.93    16023.3
 01300  103.11    12346.7
 01700   96.43     9717.1
 02500   94.64     6852.7
 03500  114.55     4848.7
 05000  153.17     3433.5
 09000  102.99     2316.0
 20000   91.66     1479.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        43.2
Local Socket L2->L2 HITM latency        47.9
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -   106.4
            1    105.2       -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1
            0        -   106.0
            1    105.4       -

Thanks!

wolfgang · Sep 1, 2020

Hi,

These values generally are fast enough that you should have no problem.
Is my assumption correct that you use ECC memory?

Because then you should see memory problems in the log.
Do you have enough memory left or do you use ZRAM?

djbobyd · Sep 1, 2020

Hi @wolfgang ,
Your assumption is correct. I am using DDR4 ECC modules. I haven't seen any errors related to memory in the kern.log, syslog and messages log files.
I have set the maximum size of the ARC to 20GB of RAM but generally the system has around 50GB free at all times.

I've also additionally updated the bios to the latest version and installed intel-microcode package. I saw reports that these might help. So far 3 days uptime after these latest changes. It is far too short period to judge yet, as the last time the system worked for almost a week before hanging but we will see...

djbobyd · Oct 16, 2020

Hi all, after the above mentioned changes the problem has disappeared. Thanks for your time.

Search

Search

High iowait on a ZFS pool

djbobyd

Member

wolfgang

Proxmox Retired Staff

djbobyd

Member

wolfgang

Proxmox Retired Staff

djbobyd

Member

djbobyd

Member