kernel-4.15.18-8 ZFS freeze

Discussion in 'Proxmox VE: Installation and configuration' started by aa007, Nov 17, 2018.

  1. aa007

    aa007 New Member

    Joined:
    Feb 6, 2014
    Messages:
    7
    Likes Received:
    0
    Hi,

    I have upgraded the kernel to this version just yesterday and today our hypervisor showed kernel panic and journald was complaining that it cant write anything. VMs were running, but when i tried to write anything to the disk it froze. After 10 more minutes VMs stopped responding.
    After reseting it all went back to normal.

    I dont have much more, but I saw there were some patches regarding ZFS in this version. So for now I am downgrading to 4.15.18-7
     
    #1 aa007, Nov 17, 2018
    Last edited: Nov 17, 2018
  2. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,445
    Likes Received:
    386
    Please post details about your hardware, maybe this helps for debugging (e.g. storage controller)
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  3. aa007

    aa007 New Member

    Joined:
    Feb 6, 2014
    Messages:
    7
    Likes Received:
    0
    Its Fujitsu PRIMERGY RX2530M1 with PRAID EP400i - 8 Seagate ST900MM0018 drives put in JBOD mode so Proxmox sees all the drivers - one of them is a hot spare.
    We have attached two M2 SSDs (Samsung 970 PRO 512GB and Samsung 860 EVO M.2 250GB) for ZIL / L2ARC using I-TEC PCI-E 2x M.2 Card - we were out of disk slots for attaching SSDs, so one is attached using PCIe and the other is SATA.
    There are 2 partitions on each device (32GB/96GB). We have a mirror of the first 2 32GB partitions for ZIL and then we use the 96GB partition on Samsung 970 for L2ARC.
     
  4. marsian

    marsian Member
    Proxmox Subscriber

    Joined:
    Sep 27, 2016
    Messages:
    37
    Likes Received:
    3
    Are you using the latest BIOS and Firmware on it? We've seen sporadic hangs with older firmwares on FTS devices, but could fix all of them with recent upgrades....
     
  5. aa007

    aa007 New Member

    Joined:
    Feb 6, 2014
    Messages:
    7
    Likes Received:
    0
    It has happened this morning again even with older version. of the kernel
    So I have found there was outdated BIOS. Other components are up to date. I have upgraded the BIOS to the latest version and I see there was a new version of the kernel 4.15.18-9 available so I have installed it also.
    Will report if it happens again.
     
  6. aa007

    aa007 New Member

    Joined:
    Feb 6, 2014
    Messages:
    7
    Likes Received:
    0
    So unfortunately it has happened again. This time we managed to get the stack trace:
    Code:
    [223849.690311] kernel BUG at mm/slub.c:296!
    [223849.690345] invalid opcode: 0000 [#1] SMP PTI
    [223849.690368] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment xt_tcpudp xt_set xt_addrtype xt_conntrack xt_mark ip_set_hash_net ip_set xt_multiport iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif mgag200 ttm irqbypass crct10dif_pclmul drm_kms_helper crc32_pclmul ghash_clmulni_intel pcbc snd_pcm drm aesni_intel snd_timer aes_x86_64 crypto_simd snd i2c_algo_bit glue_helper cryptd fb_sys_fops syscopyarea sysfillrect soundcore mei_me intel_cstate joydev input_leds sysimgblt
    [223849.690618]  intel_rapl_perf ipmi_si pcspkr mei lpc_ich ipmi_devintf ipmi_msghandler shpchp wmi acpi_power_meter mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm sunrpc ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq uas usb_storage hid_generic usbkbd usbmouse usbhid hid ahci libahci i2c_i801 ixgbe be2net igb(O) dca ptp pps_core mdio megaraid_sas
    [223849.690758] CPU: 28 PID: 40604 Comm: z_wr_int_4 Tainted: P           O     4.15.18-9-pve #1
    [223849.690781] Hardware name: FUJITSU PRIMERGY RX2530 M1/D3279-A1, BIOS V5.0.0.9 R1.36.0 for D3279-A1x                     06/06/2018
    [223849.690816] RIP: 0010:__slab_free+0x1a2/0x330
    [223849.690830] RSP: 0018:ffffb84c5c8bfa70 EFLAGS: 00010246
    [223849.690847] RAX: ffff943781796f60 RBX: ffff943781796f60 RCX: 00000001002a0020
    [223849.691793] RDX: ffff943781796f60 RSI: ffffda0c5705e580 RDI: ffff9441ff407600
    [223849.692728] RBP: ffffb8
    The rest is captured on a photo:
    [​IMG]
     
  7. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,591
    Likes Received:
    305
    JBOD mode is still a Raid and not supported with ZFS.
    ZFS has problems with transparent caches.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  8. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Dear, this is also on non JBOD.

    There is an issue with ZFS on Proxmox because this does not happen outside Proxmox. There are several threads with that issue when you have high disk load that the kernel freezes (120s messages). This seams to be an issue caused by the Proxmox kernel (probably timer should be 1000 instead of 250) - an laternative workaround wasto have a separted ZFS pool for OS and VM, this helps a little, some people set io limits on VMs, but this does not help for backups.
     
  9. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,135
    Likes Received:
    147
    what? were did you got that from, as kernel ticks shouldn't influence this at all...

    Which ZFS version on which Distro with which Kernel did you test (long time) were this does not happen? maybe we can look what difference's there could solve this.

    yes, but "task hung for 120s" is an general error and can be a result of a lot of things... We and upstream ZFS fixed already a lot of those reasons such things have happened and quite a few threads' issues are not valid ones anymore.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  10. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    This happend on Archlinux too and the fix was to increase the timer to 1000. What wories me .. the config of Proxmox is 250 .. but when you measure it you get 100 and I don't know why this is. The easiest way would be to make a kernel with 1000 and retry. I have this issue on all my ZFS proxmox installations - just make a dd to a vm image and after several 10s fo GB it stops.

    You see reports from ZFS x.7 to x.15 with this issue .. mainly it comes from that zfs does not give back the scheduler to the disk io back to kernel and so it freezes. I did some debug with a scheduled task which sets sys to unfreeze zfs and so you get the control back when it freezes but only for a short call.


    120s comes 100% from zfs .. i saw it blocking disk io completely and so only programs can run which are in the memory. There are reports that when you make two zfs .. one for root and one for vm .. than an vm cannot block it any longer (as now) but if you run backups and access root it still happens.
     
    #10 at24106, Mar 30, 2019
    Last edited: Mar 30, 2019
  11. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,135
    Likes Received:
    147
    Got any links, sources on that, because I couldn't find anything and Archlinux Kernel is configured to tick with 300Hz.
    And yes we're using 250Hz in combination with "CONFIG_NO_HZ_IDLE" (disable ticks on idle CPU cores), which is a good trade-off of timer accuracy and wake-ups per seconds (which are costly, CPU and power wise (only the former is really important for PVE, the latter is more for mobile devices).

    We're using dynamic ticks now, so you get already anything between 100 and 1500, depending on needs, see:
    https://elinux.org/Kernel_Timer_Systems#Dynamic_ticks

    Sounds like a bug in ZFS, which you could report to ZFS on Linux, such a thing would be a general issue and not solved by increasing timer ticks, maybe reduced in chance but that's never a solution for such bugs.

    It can come from ZFS (never state that it couldn't be from ZFS) and does in your and some other cases, but it also can come from bugs in certain NICs or their driver/firmware (which you will also find reports here) or from anything else, well, blocking for longer than 120s, like doing IO on a dead NFS mount. Also the ZFS one can come from different issues (that's what I tried to say in my last reply), some of them already solved...

    That's the issue, I cannot reproduce this:

    Code:
    # zfs create -V $[128 * 1<<30] toms-big-pool/foo                                                  # creates a vdev with ~130GB size
    # dd if=/dev/urandom of=/dev/toms-big-pool/foo bs=1M count=$[1<<16]             # write random data (from urandom, so unblocked), ~ 64 GB
    65536+0 records in
    65536+0 records out
    68719476736 bytes (69 GB, 64 GiB) copied, 543.355 s, 252 MB/s
    
    or do you do something else? Clear steps for us to ensure there's no difference would be best for us to trying to reproduce this..

    What hardware do you have (the more details the better, disks, ram, cpu, vendor, HW raid or not, ...)?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #11 t.lamprecht, Mar 30, 2019
    Last edited: Mar 30, 2019
  12. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Hello,

    here ist the way how you get this reproduced:

    Standard Proxmox ZFS root install on a server from iso (Hetzner EX42, 64 GB RAM, 4 Cores)
    a) 4TB
    b) 4TB
    Enterprise SATA with ZFS RAID Z+1

    After install, add 100 GB parttion to read and 100 GB log cache SSD but it happes without this step too. NO dedupe, no compress

    Create a VM with a 1,5 TB zfs-disk with virtio
    Boot VM with any linux rescue (eg. Sysrescuecd)
    stream from a remote KVM server a partion (here 1.5 TB size) inside the VM to local disk (which is a ZFS diskimage)
    ssh root@remoteserver.com "dd if=/dev/vg1/partiton bs=4M | gzip -1 -" | gunzip -1 - | pv -s 1500G | dd of=/dev/sda bs=4m

    It starts with 60 MByte/s ...after about 10-20 minutes it gets slower to 200KB/s - 2 MByte/s and later it freezes completely with no CPU load.
     
    #12 at24106, Mar 30, 2019
    Last edited: Mar 30, 2019
  13. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Seams that there is a mitigation .. when virtio is not activated the diskspeed goes to about 60 % down and it doesn't come to this freeze.
     
  14. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Hello,

    the mitigation without virtio is only for lower load as above. If the load get's increased it stilll freezes with SATA. See example without gzip
    ssh root@remoteserver.com "dd if=/dev/vg1/partiton bs=4M | pv -s 1500G | dd of=/dev/sda bs=4M
     
  15. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Hello,

    got the extrem issue now, after reboot I do not come up because ZFS blocks kernel (120s message) ...

    ZFS on Proxmox is dead for me .. far too instable for production.
     
  16. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,135
    Likes Received:
    147
    what, are you really doing a raidz1 with only two disks? That'd make no sense and our installer doesn't even allows it... With two disks only mirror mode (RAID1) is supported and makes sense, as long as one wants a production system and not a accident-waiting-to-happen. So how did you even install Proxmox VE to get such a setup? Can you please post the output of the:
    Code:
    pve# zpool status
    pve# pveversion -v
    
    virtio what, SCSI on VirtIO-SCSI, or VirtIO Block? Because I'd really suggest SCSI on virtio-scsi bus, as VirtioBlk not only lacks some features, it's bit older and not as much worked on since a quite a bit.

    Your reproducers are quite complicated, can you reduce this to something host-only wise, removing VMs from the factors completely?

    Code:
    pve# zfs create -V 1.5T POOLNAME/freeze-test
    pve# dd if=/dev/urandom status=progress of=/dev/POOLNAME/freeze-test bs=1M count=1G
    
    (FYI: the status=progress option of newer dd allows to omit things like "pv", it reports directly how many and how fast it wrote)

    If you really have a RaidZ1 setup I'd heavily suggest to retry with a RAID1 setup or add a third disk so that a RAIDZ1 starts to make sense.
    We internal, and 1000s of our users run ZFS on PVE successfully in production, there are surely issues (as with every storage tech on some setups), but lets not generalize the few issues.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    at24106 likes this.
  17. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Dear,

    sorry it was not RAIDZ+1 .. this was on the second server on that one it was RAID1.

    Again, thanks for help, highly appreciated!

    Just installing with Debian Strech base RAID1 and LVM on and Proxmox per Repository top. I'd like to find out if it works without ZFS on that Hardware.
     
  18. t.lamprecht

    t.lamprecht Proxmox Staff Member
    Staff Member

    Joined:
    Jul 28, 2015
    Messages:
    1,135
    Likes Received:
    147
    OK, then it makes sense again, thanks for clarifying.

    Hmm, at least this:
    Code:
    pve# zfs create -V 1.5T toms-big-pool/freeze-test
    # write at least 1.5 TiB
    pve# dd if=/dev/urandom status=progress of=/dev/toms-big-pool/freeze-test bs=1M count=2M
    1649213964288 bytes (1.6 TB, 1.5 TiB) copied, 13731 s, 120 MB/s           
    dd: error writing '/dev/toms-big-pool/freeze-test': No space left on device
    1572865+0 records in
    1572864+0 records out
    1649267441664 bytes (1.6 TB, 1.5 TiB) copied, 13752.4 s, 120 MB/s
    
    went through without issues on 4TB spinning HDDs, speed stayed quite stable..
    If you could, it would be great to do the same test as I did (see above code), to look if we can produce this through host only, without a VM in between.
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    #18 t.lamprecht, Mar 31, 2019
    Last edited: Mar 31, 2019
  19. at24106

    at24106 New Member

    Joined:
    Jul 29, 2018
    Messages:
    10
    Likes Received:
    2
    Dear,

    this command does the same freeze on the host.

    In the meantime I migrated to LVM/LVM thin with SSD cache on the same hardware and everything works fine for 3 weeks (0 issues).


    The hardware is following - maybe one other person with the same problem has the same hardware components. It is following root server from Hetzner.de : https://www.hetzner.de/dedicated-rootserver/ex42 (please do not see this as advertisement)

    Intel® Core™ i7-6700 Quad-Core
    64 GB DDR4 RAM
    2 x 4 TB SATA Enterprise Hard Drive 7200 rpm
    00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
    00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
    00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
    00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
    00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
    00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
    00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
    00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
    00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
    00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)

    DMI: FUJITSU /D3401-H2, BIOS V5.0.0.12 R1.19.0.SR.1 for D3401-H2x 08/24/2018

    dmesg in attached file
     

    Attached Files:

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice