kernel-4.15.18-8 ZFS freeze

aa007

New Member
Feb 6, 2014
14
0
1
Hi,

I have upgraded the kernel to this version just yesterday and today our hypervisor showed kernel panic and journald was complaining that it cant write anything. VMs were running, but when i tried to write anything to the disk it froze. After 10 more minutes VMs stopped responding.
After reseting it all went back to normal.

I dont have much more, but I saw there were some patches regarding ZFS in this version. So for now I am downgrading to 4.15.18-7
 
Last edited:
Please post details about your hardware, maybe this helps for debugging (e.g. storage controller)
 
Its Fujitsu PRIMERGY RX2530M1 with PRAID EP400i - 8 Seagate ST900MM0018 drives put in JBOD mode so Proxmox sees all the drivers - one of them is a hot spare.
We have attached two M2 SSDs (Samsung 970 PRO 512GB and Samsung 860 EVO M.2 250GB) for ZIL / L2ARC using I-TEC PCI-E 2x M.2 Card - we were out of disk slots for attaching SSDs, so one is attached using PCIe and the other is SATA.
There are 2 partitions on each device (32GB/96GB). We have a mirror of the first 2 32GB partitions for ZIL and then we use the 96GB partition on Samsung 970 for L2ARC.
 
Are you using the latest BIOS and Firmware on it? We've seen sporadic hangs with older firmwares on FTS devices, but could fix all of them with recent upgrades....
 
It has happened this morning again even with older version. of the kernel
So I have found there was outdated BIOS. Other components are up to date. I have upgraded the BIOS to the latest version and I see there was a new version of the kernel 4.15.18-9 available so I have installed it also.
Will report if it happens again.
 
So unfortunately it has happened again. This time we managed to get the stack trace:
Code:
[223849.690311] kernel BUG at mm/slub.c:296!
[223849.690345] invalid opcode: 0000 [#1] SMP PTI
[223849.690368] Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_physdev xt_comment xt_tcpudp xt_set xt_addrtype xt_conntrack xt_mark ip_set_hash_net ip_set xt_multiport iptable_filter openvswitch nsh nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack libcrc32c softdog nfnetlink_log nfnetlink intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif mgag200 ttm irqbypass crct10dif_pclmul drm_kms_helper crc32_pclmul ghash_clmulni_intel pcbc snd_pcm drm aesni_intel snd_timer aes_x86_64 crypto_simd snd i2c_algo_bit glue_helper cryptd fb_sys_fops syscopyarea sysfillrect soundcore mei_me intel_cstate joydev input_leds sysimgblt
[223849.690618]  intel_rapl_perf ipmi_si pcspkr mei lpc_ich ipmi_devintf ipmi_msghandler shpchp wmi acpi_power_meter mac_hid vhost_net vhost tap ib_iser rdma_cm iw_cm ib_cm sunrpc ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) btrfs xor zstd_compress raid6_pq uas usb_storage hid_generic usbkbd usbmouse usbhid hid ahci libahci i2c_i801 ixgbe be2net igb(O) dca ptp pps_core mdio megaraid_sas
[223849.690758] CPU: 28 PID: 40604 Comm: z_wr_int_4 Tainted: P           O     4.15.18-9-pve #1
[223849.690781] Hardware name: FUJITSU PRIMERGY RX2530 M1/D3279-A1, BIOS V5.0.0.9 R1.36.0 for D3279-A1x                     06/06/2018
[223849.690816] RIP: 0010:__slab_free+0x1a2/0x330
[223849.690830] RSP: 0018:ffffb84c5c8bfa70 EFLAGS: 00010246
[223849.690847] RAX: ffff943781796f60 RBX: ffff943781796f60 RCX: 00000001002a0020
[223849.691793] RDX: ffff943781796f60 RSI: ffffda0c5705e580 RDI: ffff9441ff407600
[223849.692728] RBP: ffffb8

The rest is captured on a photo:
2jcudkh.jpg
 
- 8 Seagate ST900MM0018 drives put in JBOD mode so Proxmox
JBOD mode is still a Raid and not supported with ZFS.
ZFS has problems with transparent caches.
 
Dear, this is also on non JBOD.

There is an issue with ZFS on Proxmox because this does not happen outside Proxmox. There are several threads with that issue when you have high disk load that the kernel freezes (120s messages). This seams to be an issue caused by the Proxmox kernel (probably timer should be 1000 instead of 250) - an laternative workaround wasto have a separted ZFS pool for OS and VM, this helps a little, some people set io limits on VMs, but this does not help for backups.
 
probably timer should be 1000 instead of 250

what? were did you got that from, as kernel ticks shouldn't influence this at all...

There is an issue with ZFS on Proxmox because this does not happen outside Proxmox.

Which ZFS version on which Distro with which Kernel did you test (long time) were this does not happen? maybe we can look what difference's there could solve this.

There are several threads with that issue when you have high disk load that the kernel freezes (120s messages
yes, but "task hung for 120s" is an general error and can be a result of a lot of things... We and upstream ZFS fixed already a lot of those reasons such things have happened and quite a few threads' issues are not valid ones anymore.
 
This happend on Archlinux too and the fix was to increase the timer to 1000. What wories me .. the config of Proxmox is 250 .. but when you measure it you get 100 and I don't know why this is. The easiest way would be to make a kernel with 1000 and retry. I have this issue on all my ZFS proxmox installations - just make a dd to a vm image and after several 10s fo GB it stops.

You see reports from ZFS x.7 to x.15 with this issue .. mainly it comes from that zfs does not give back the scheduler to the disk io back to kernel and so it freezes. I did some debug with a scheduled task which sets sys to unfreeze zfs and so you get the control back when it freezes but only for a short call.


120s comes 100% from zfs .. i saw it blocking disk io completely and so only programs can run which are in the memory. There are reports that when you make two zfs .. one for root and one for vm .. than an vm cannot block it any longer (as now) but if you run backups and access root it still happens.
 
Last edited:
This happend on Archlinux too and the fix was to increase the timer to 1000.

Got any links, sources on that, because I couldn't find anything and Archlinux Kernel is configured to tick with 300Hz.
And yes we're using 250Hz in combination with "CONFIG_NO_HZ_IDLE" (disable ticks on idle CPU cores), which is a good trade-off of timer accuracy and wake-ups per seconds (which are costly, CPU and power wise (only the former is really important for PVE, the latter is more for mobile devices).

We're using dynamic ticks now, so you get already anything between 100 and 1500, depending on needs, see:
https://elinux.org/Kernel_Timer_Systems#Dynamic_ticks

mainly it comes from that zfs does not give back the scheduler to the disk io back to kernel and so it freezes

Sounds like a bug in ZFS, which you could report to ZFS on Linux, such a thing would be a general issue and not solved by increasing timer ticks, maybe reduced in chance but that's never a solution for such bugs.

120s comes 100% from zfs ..

It can come from ZFS (never state that it couldn't be from ZFS) and does in your and some other cases, but it also can come from bugs in certain NICs or their driver/firmware (which you will also find reports here) or from anything else, well, blocking for longer than 120s, like doing IO on a dead NFS mount. Also the ZFS one can come from different issues (that's what I tried to say in my last reply), some of them already solved...

I have this issue on all my ZFS proxmox installations - just make a dd to a vm image and after several 10s fo GB it stops.

That's the issue, I cannot reproduce this:

Code:
# zfs create -V $[128 * 1<<30] toms-big-pool/foo                                                  # creates a vdev with ~130GB size
# dd if=/dev/urandom of=/dev/toms-big-pool/foo bs=1M count=$[1<<16]             # write random data (from urandom, so unblocked), ~ 64 GB
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB, 64 GiB) copied, 543.355 s, 252 MB/s

or do you do something else? Clear steps for us to ensure there's no difference would be best for us to trying to reproduce this..

What hardware do you have (the more details the better, disks, ram, cpu, vendor, HW raid or not, ...)?
 
Last edited:
Hello,

here ist the way how you get this reproduced:

Standard Proxmox ZFS root install on a server from iso (Hetzner EX42, 64 GB RAM, 4 Cores)
a) 4TB
b) 4TB
Enterprise SATA with ZFS RAID Z+1

After install, add 100 GB parttion to read and 100 GB log cache SSD but it happes without this step too. NO dedupe, no compress

Create a VM with a 1,5 TB zfs-disk with virtio
Boot VM with any linux rescue (eg. Sysrescuecd)
stream from a remote KVM server a partion (here 1.5 TB size) inside the VM to local disk (which is a ZFS diskimage)
ssh root@remoteserver.com "dd if=/dev/vg1/partiton bs=4M | gzip -1 -" | gunzip -1 - | pv -s 1500G | dd of=/dev/sda bs=4m

It starts with 60 MByte/s ...after about 10-20 minutes it gets slower to 200KB/s - 2 MByte/s and later it freezes completely with no CPU load.
 
Last edited:
Seams that there is a mitigation .. when virtio is not activated the diskspeed goes to about 60 % down and it doesn't come to this freeze.
 
Hello,

the mitigation without virtio is only for lower load as above. If the load get's increased it stilll freezes with SATA. See example without gzip
ssh root@remoteserver.com "dd if=/dev/vg1/partiton bs=4M | pv -s 1500G | dd of=/dev/sda bs=4M
 
Hello,

got the extrem issue now, after reboot I do not come up because ZFS blocks kernel (120s message) ...

ZFS on Proxmox is dead for me .. far too instable for production.
 
Enterprise SATA with ZFS RAID Z+1

what, are you really doing a raidz1 with only two disks? That'd make no sense and our installer doesn't even allows it... With two disks only mirror mode (RAID1) is supported and makes sense, as long as one wants a production system and not a accident-waiting-to-happen. So how did you even install Proxmox VE to get such a setup? Can you please post the output of the:
Code:
pve# zpool status
pve# pveversion -v

Create a VM with a 1,5 TB zfs-disk with virtio

virtio what, SCSI on VirtIO-SCSI, or VirtIO Block? Because I'd really suggest SCSI on virtio-scsi bus, as VirtioBlk not only lacks some features, it's bit older and not as much worked on since a quite a bit.

Your reproducers are quite complicated, can you reduce this to something host-only wise, removing VMs from the factors completely?

Code:
pve# zfs create -V 1.5T POOLNAME/freeze-test
pve# dd if=/dev/urandom status=progress of=/dev/POOLNAME/freeze-test bs=1M count=1G

(FYI: the status=progress option of newer dd allows to omit things like "pv", it reports directly how many and how fast it wrote)

got the extrem issue now, after reboot I do not come up because ZFS blocks kernel (120s message) ...
ZFS on Proxmox is dead for me .. far too instable for production.

If you really have a RaidZ1 setup I'd heavily suggest to retry with a RAID1 setup or add a third disk so that a RAIDZ1 starts to make sense.
We internal, and 1000s of our users run ZFS on PVE successfully in production, there are surely issues (as with every storage tech on some setups), but lets not generalize the few issues.
 
  • Like
Reactions: at24106
Dear,

sorry it was not RAIDZ+1 .. this was on the second server on that one it was RAID1.

Again, thanks for help, highly appreciated!

Just installing with Debian Strech base RAID1 and LVM on and Proxmox per Repository top. I'd like to find out if it works without ZFS on that Hardware.
 
sorry it was not RAIDZ+1 .. this was on the second server on that one it was RAID1.

OK, then it makes sense again, thanks for clarifying.

Hmm, at least this:
Code:
pve# zfs create -V 1.5T toms-big-pool/freeze-test
# write at least 1.5 TiB
pve# dd if=/dev/urandom status=progress of=/dev/toms-big-pool/freeze-test bs=1M count=2M
1649213964288 bytes (1.6 TB, 1.5 TiB) copied, 13731 s, 120 MB/s           
dd: error writing '/dev/toms-big-pool/freeze-test': No space left on device
1572865+0 records in
1572864+0 records out
1649267441664 bytes (1.6 TB, 1.5 TiB) copied, 13752.4 s, 120 MB/s

went through without issues on 4TB spinning HDDs, speed stayed quite stable..
Just installing with Debian Strech base RAID1 and LVM on and Proxmox per Repository top. I'd like to find out if it works without ZFS on that Hardware.
If you could, it would be great to do the same test as I did (see above code), to look if we can produce this through host only, without a VM in between.
 
Last edited:
OK, then it makes sense again, thanks for clarifying.

Hmm, at least this:
Code:
pve# zfs create -V 1.5T toms-big-pool/freeze-test
# write at least 1.5 TiB
pve# dd if=/dev/urandom status=progress of=/dev/toms-big-pool/freeze-test bs=1M count=2M
1649213964288 bytes (1.6 TB, 1.5 TiB) copied, 13731 s, 120 MB/s          
dd: error writing '/dev/toms-big-pool/freeze-test': No space left on device
1572865+0 records in
1572864+0 records out
1649267441664 bytes (1.6 TB, 1.5 TiB) copied, 13752.4 s, 120 MB/s

went through without issues on 4TB spinning HDDs, speed stayed quite stable..

If you could, it would be great to do the same test as I did (see above code), to look if we can produce this through host only, without a VM in between.

Dear,

this command does the same freeze on the host.

In the meantime I migrated to LVM/LVM thin with SSD cache on the same hardware and everything works fine for 3 weeks (0 issues).


The hardware is following - maybe one other person with the same problem has the same hardware components. It is following root server from Hetzner.de : https://www.hetzner.de/dedicated-rootserver/ex42 (please do not see this as advertisement)

Intel® Core™ i7-6700 Quad-Core
64 GB DDR4 RAM
2 x 4 TB SATA Enterprise Hard Drive 7200 rpm
00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)

DMI: FUJITSU /D3401-H2, BIOS V5.0.0.12 R1.19.0.SR.1 for D3401-H2x 08/24/2018

dmesg in attached file
 

Attachments

Hi,

I've read now this post and as i wrote in this post https://forum.proxmox.com/threads/proxmox-5-4-stops-to-work-zfs-issue.63849/#post-298631
I have the same identical problem (last time two days ago) and I don't know what to do anymore. I have update bios, proxmox (with pve-subscription) and everything to the last 5.4 version:

pveversion -v:

Code:
proxmox-ve: 5.4-2 (running kernel: 4.15.18-26-pve)
pve-manager: 5.4-13 (running version: 5.4-13/aee6f0ec)
pve-kernel-4.15: 5.4-14
pve-kernel-4.15.18-26-pve: 4.15.18-54
pve-kernel-4.15.18-24-pve: 4.15.18-52
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-16-pve: 4.15.18-41
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.17-1-pve: 4.15.17-9
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-12
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-56
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-14
libpve-storage-perl: 5.0-44
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-7
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-28
pve-cluster: 5.0-38
pve-container: 2.0-41
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-22
pve-firmware: 2.0-7
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 3.0.1-4
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-55
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

zpool status:

Code:
  pool: HDD-Pool
 state: ONLINE
  scan: scrub repaired 0B in 22h26m with 0 errors on Sun Mar  8 22:50:54 2020
config:

    NAME                                               STATE     READ WRITE CKSUM
    HDD-Pool                                           ONLINE       0     0     0
      raidz2-0                                         ONLINE       0     0     0
        scsi-35000cca26ac15b54                         ONLINE       0     0     0
        scsi-35000cca26ad40cbc                         ONLINE       0     0     0
        scsi-35000cca26ac4cfdc                         ONLINE       0     0     0
        scsi-35000cca26abed87c                         ONLINE       0     0     0
        scsi-35000cca26ad81e74                         ONLINE       0     0     0
        scsi-35000cca26ace9470                         ONLINE       0     0     0
    logs
      nvme-eui.334842304b3049160025384100000004-part1  ONLINE       0     0     0
    cache
      nvme-eui.334842304b3049160025384100000004-part3  ONLINE       0     0     0

errors: No known data errors

  pool: SSD-Pool
 state: ONLINE
  scan: scrub repaired 0B in 0h34m with 0 errors on Sun Mar  8 00:58:36 2020
config:

    NAME                                               STATE     READ WRITE CKSUM
    SSD-Pool                                           ONLINE       0     0     0
      raidz2-0                                         ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604708  ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604709  ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604707  ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604677  ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604706  ONLINE       0     0     0
        ata-SAMSUNG_MZ7KM480HMHQ-00005_S3F4NX0K604672  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sun Mar  8 00:24:18 2020
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdm2    ONLINE       0     0     0
        sdn2    ONLINE       0     0     0

errors: No known data errors

Contiune...