ZFS Bug / INFO: task txg_sync:1943 blocked for more than 120 seconds.

stuartbh

Active Member
Dec 2, 2019
112
9
38
59
ProxMox forum members,

I recently have started to see this in my kern.log when trying to copy a large (4TB) file between two USB devices both formatted with ZFS under the latest ProxMox (fully updated). The rsync is running from the command line on the ProxMox node, not in a VM. This error has occurred several times during the course of the copy (the copy is still ongoing).

2023-09-10T14:29:59.434665-07:00 pve4-x3650-m3 kernel: [ 5926.088403] INFO: task txg_sync:1943 blocked for more than 120 seconds.
2023-09-10T14:29:59.434692-07:00 pve4-x3650-m3 kernel: [ 5926.095979] Tainted: P O 6.2.16-12-pve #1
2023-09-10T14:29:59.449081-07:00 pve4-x3650-m3 kernel: [ 5926.102588] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-09-10T14:29:59.449088-07:00 pve4-x3650-m3 kernel: [ 5926.111093] task:txg_sync state:D stack:0 pid:1943 ppid:2 flags:0x00004000
2023-09-10T14:29:59.458889-07:00 pve4-x3650-m3 kernel: [ 5926.120221] Call Trace:
2023-09-10T14:29:59.464104-07:00 pve4-x3650-m3 kernel: [ 5926.123358] <TASK>
2023-09-10T14:29:59.464110-07:00 pve4-x3650-m3 kernel: [ 5926.126126] __schedule+0x402/0x1510
2023-09-10T14:29:59.468379-07:00 pve4-x3650-m3 kernel: [ 5926.130400] schedule+0x63/0x110
2023-09-10T14:29:59.472638-07:00 pve4-x3650-m3 kernel: [ 5926.134648] schedule_timeout+0x95/0x170
2023-09-10T14:29:59.482274-07:00 pve4-x3650-m3 kernel: [ 5926.139250] ? __pfx_process_timeout+0x10/0x10
2023-09-10T14:29:59.482282-07:00 pve4-x3650-m3 kernel: [ 5926.144341] io_schedule_timeout+0x51/0x80
2023-09-10T14:29:59.487087-07:00 pve4-x3650-m3 kernel: [ 5926.149104] __cv_timedwait_common+0x140/0x180 [spl]
2023-09-10T14:29:59.492723-07:00 pve4-x3650-m3 kernel: [ 5926.154735] ? __pfx_autoremove_wake_function+0x10/0x10
2023-09-10T14:29:59.498606-07:00 pve4-x3650-m3 kernel: [ 5926.160609] __cv_timedwait_io+0x19/0x30 [spl]
2023-09-10T14:29:59.508403-07:00 pve4-x3650-m3 kernel: [ 5926.165702] zio_wait+0x13d/0x300 [zfs]
2023-09-10T14:29:59.513296-07:00 pve4-x3650-m3 kernel: [ 5926.170416] dsl_pool_sync+0xce/0x4e0 [zfs]
2023-09-10T14:29:59.518053-07:00 pve4-x3650-m3 kernel: [ 5926.175358] spa_sync+0x560/0x1010 [zfs]
2023-09-10T14:29:59.518944-07:00 pve4-x3650-m3 kernel: [ 5926.180117] ? spa_txg_history_init_io+0x120/0x130 [zfs]
2023-09-10T14:29:59.529539-07:00 pve4-x3650-m3 kernel: [ 5926.186322] txg_sync_thread+0x21a/0x3e0 [zfs]
2023-09-10T14:29:59.530341-07:00 pve4-x3650-m3 kernel: [ 5926.191554] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
2023-09-10T14:29:59.541558-07:00 pve4-x3650-m3 kernel: [ 5926.197325] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
2023-09-10T14:29:59.541565-07:00 pve4-x3650-m3 kernel: [ 5926.203567] thread_generic_wrapper+0x5c/0x70 [spl]
2023-09-10T14:29:59.550892-07:00 pve4-x3650-m3 kernel: [ 5926.209114] kthread+0xe6/0x110
2023-09-10T14:29:59.550898-07:00 pve4-x3650-m3 kernel: [ 5926.212894] ? __pfx_kthread+0x10/0x10
2023-09-10T14:29:59.559422-07:00 pve4-x3650-m3 kernel: [ 5926.217286] ret_from_fork+0x29/0x50
2023-09-10T14:29:59.559430-07:00 pve4-x3650-m3 kernel: [ 5926.221488] </TASK>
root@pve4-x3650-m3:/var/log#

Any ideas from anyone?

Thanks in advance as always to anyone putting time on the concern at issue herein.

Stuart
 
And there is enough memory? Because high IO and running out our RAM could cause this in the past.
Or maybe just too slow drives like SMR HDDs that can't write the transaction group from RAM to disk in under 120 seconds when there are alot of transctions piling up?
 
Last edited:
And there is enough memory? Because high IO and running out our RAM could cause this in the past.
Or maybe just too slow drives like SMR HDDs that can't write the transaction group from RAM to disk in under 120 seconds when there are alot of transctions piling up?

Perhaps, but I doubt that is the overarching issue in the present case before us.

follows hereupon an execution of the free command:
Code:
$ free -h --giga --committed
               total        used        free      shared  buff/cache   available
Mem:            177G        109G         68G         56M        678M         67G
Swap:           7.9G          0B        7.9G
Comm:            96G        2.2G         94G

Would an SSD ZIL disk help?
Can I offer ZFS more than 120 seconds to write the transactions?

The drives are SAS drives all 10,000RPM.

These drives are connected via an IBM 1015 controller (not in IT mode, but going to be cross-flashed soon).

Code:
$ for dasd in $(ls /dev/sd[a-h]); do sudo smartctl -i ${dasd} | grep -i 'rotation'; done
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm
Rotation Rate:        10025 rpm

One USB drive is 5400 RPMs
the other USB drives are in an enclosure as a 4xraid-z and are 7200 or 10,000 RPM (though connected via USB).

Both USB connections are USB2.

Stuart
 
Last edited:
Would an SSD ZIL disk help?
I don't think so. By default ZFS will collect all writes in RAM and then will write it every 5 seconds from RAM to the disks. This action is the txg_sync which is somehow stuck (maybe disk/usb not working properly) or just too slow.
The drives are SAS drives all 10,000RPM.
You said you are copying from USB to USB. So there are 10K RPM SAS disks attached to some kind of USB enclosure?
 
Dunuin, et alia:

I misspoke. The 10,000RPM drives are a pool that is connected via an IBM/LSI SAS controller internally. There are then two USB disk, one is a Western Digital 8TB drive and then other is a 4 drive enclosure (OWC) with 4 8TB Seagate drives in it. I believe that the WD and the Seagate drives are all 7200RPM drives.

Some of the files are being copied from the 10,000RPM pool and some from the Seagate pool all to the WD 8TB drive.

I also updated the original post just preceding your last reply as well.

Stuart
 
Last edited:
And did you check that these WD/Seagage disks are actually using CMR and not SMR? Because Seagate is selling 8TB SMR disks.
 
Last edited:
Dunuin,

As far as I can tell:

Both of these models are 5400RPM I realize now.
The 8TB Seagate drives (model ST8000DM004) are SMR.
The WDC WD80EZZX-11CSGA0 is CMR.

Stuart
 
Last edited:
SMR HDDs got a terrible write performance once the caches are full. Similar to what you see with QLC SSDs...but even worse.
SMR HDDs shouldn't be used with Copy-on-Write filesystems like ZFS.
So this could be problematic.
I for example got a ST4000DM004 and that one is so slow, that the whole machine will stop working for hours when writing more than a few GBs at once that the average access time in measured in minutes instead of miliseconds...
At least with my ST4000DM004 it is not uncommon that a write operation will take more than 120 seconds so ZFS thinks the disk is dead because it is so busy that it can't answer in time.
 
Last edited:
SMR HDDs got a terrible write performance once the caches are full. Similar to what you see with QLC SSDs...but even worse.
SMR HDDs shouldn't be used with Copy-on-Write filesystems like ZFS.
So this could be problematic.
I for example got a ST4000DM004 and that one is so slow, that the whole machine will stop working for hours when writing more than a few GBs at once that the average access time in measured in minutes instead of miliseconds...
At least with my ST4000DM004 it is not uncommon that a write operation will take more than 120 seconds so ZFS thinks the disk is dead because it is so busy that it can't answer in time.

Since I am using these for backups / archives, perhaps there is a way to tell ZFS to increase the 120 second write time as I am not concerned about how long a write takes. Alternatively, I could try to convert the drives to use btrfs I suppose.

Stuart
 
Since I am using these for backups / archives,
Yes, but if the 8TB model is as slow as my 4TB model of the same series they especially won't work for backups where you usually write alot of data at a time.
perhaps there is a way to tell ZFS to increase the 120 second write time as I am not concerned about how long a write takes.
I don't know of such an option. Maybe you could set a "sync=always" so no async writes won'T be used that could pile up in RAM when the disks are too slow to actually write it down.
I could try to convert the drives to use btrfs I suppose.
Maybe not thatmuch overhead but still Copy-on-Write. I'm not into btrfs but I would guess they also don't recommend to use SMR HDDs.
 
Dunuin, et alia:

I am still researching it, but it does seem that BTFRS now has code to properly deal with SMR drives. I have to admit, it is going to be a royal pain to move all my data around to reformat my 4x8TB array from ZFS to BTRFS (if that is the solution I end up using), but, I suppose it could be worth the aggravation in the end.

It seem that the drives I am using (Seagate ST8000DM004-2CX188) are drive managed. I am not yet sure what impact their being drive managed has on ZFS or BTRFS.

I found this documentation rather instructive too.
https://zonedstorage.io/docs/getting-started/smr-disk
https://btrfs.readthedocs.io/en/latest/Zoned-mode.html
https://youtu.be/eWZD6OwvJ1M?si=8xBHgmUJxfAqH_Xa

Interesting disquisition on the issue I found thus far:

https://www.youtube.com/watch?v=6s7BJrT00pg

Stuart
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!