ZFS mirror on 2x Crucial T705 (PCIe 5.0) causing txg_sync hangs under write load – no NVMe errors in dmesg

cpaglietti · Mar 1, 2026

Hi,
I’m running into repeatable ZFS I/O stalls on a Proxmox host and I’d like some technical feedback before I start swapping hardware.
Hardware

CPU: Ryzen 9 7900
Motherboard: ASUS Pro WS B850M-ACE SE (AM5)
RAM: 64GB DDR5 (non-ECC)
Storage: 2x Crucial T705 2TB (CT2000T705SSD3)
Firmware: PACR5111 (both drives)
Both NVMe drives running at PCIe 5.0 x4 (32GT/s confirmed via lspci)
Pool: ZFS mirror (rpool).
Software
Proxmox VE (latest kernel 6.17.x)
ZFS mirror on the two T705
Guest: Ubuntu VM with LVM inside

The Problem
Under heavy write load (e.g. vzdump backup, snapshot, large writes), the system eventually:

Load average spikes (~10+)
Multiple ZFS threads enter D state:
- txg_sync
- zvol_tq-*
- flush-zfs
Even unrelated processes end up blocked
SSH eventually drops
No NVMe reset or I/O error in dmesg
zpool status still shows ONLINE, no errors
Only recovery is full reboot (power cycle sometimes required)

Example of stuck processes:
D [txg_sync]
D [zvol_tq-0]
D [dbuf_evict]
D flush-zfs
D vzdump
No:

nvme timeout
controller reset
blk_update_request error

Observations

Both drives are PCIe Gen5 x4
No ASPM enabled in BIOS
No explicit NVMe power saving tuning
Scrub completes fine when idle
Issue appears only under sustained write / flush pressure
Happens even when backup target is local (so not network-related)

Interrupts still active on both NVMe devices.

Has anyone seen txg_sync hangs on Phison E26 (T705) under ZFS?
Would forcing PCIe Gen4 instead of Gen5 be a reasonable stability test?
Is this a known flush latency issue with consumer Gen5 NVMe?
Any ZFS tunables worth testing (before replacing hardware)?

I’m considering:

Forcing both slots to PCIe Gen4
Temporarily detaching one disk and testing single-device pool
Updating firmware (if newer than PACR5111 exists)

Any technical input appreciated.

ness1602 · Mar 1, 2026

Yeah i have 2x these nvmes are depending on case and ventilation,they can go up to 90c and restart.

cpaglietti · Mar 1, 2026

Update – possible root cause identified

After several freezes under ZFS load (snapshots and vzdump), I forced both NVMe slots from PCIe Gen5 (32GT/s) to Gen4 (16GT/s) in the BIOS.
Since downgrading to Gen4:

No more tasks stuck in D state
No more ZFS txg_sync stalls
SMART queries no longer hang
Backups complete successfully
System remains responsive under sustained write load

At this point, PCIe Gen5 link instability seems to have been the trigger (Ryzen 9 7900 + ASUS board + dual Crucial T705 Gen5 in ZFS mirror).
SMART shows no media errors, temperatures are normal, and ZFS reports no data corruption.
I will monitor the system for 48 hours under load before considering the issue definitively resolved.

_gabriel · Mar 1, 2026

Heat sinks present, isn't it ?

cpaglietti · Mar 1, 2026

Temp: 36° original asus heat sinks.

cpaglietti · Mar 2, 2026

Update:
I checked the ASPM policy:
cat /sys/module/pcie_aspm/parameters/policy
Result:
[default] performance powersave powersupersave

ASPM was active.
I then disabled it entirely via kernel parameter.

Edit GRUB

File: /etc/default/grub
Changed:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"

Then:

update-grub
reboot

Verification after reboot:
cat /proc/cmdline
Output included:
pcie_aspm=off

After:

Forcing Gen4 in BIOS
Disabling ASPM with pcie_aspm=off

The system:

Completed heavy Proxmox backups
Ran overnight under load
No more txg_sync in D state
No freeze
No NVMe errors
No AER / PCIe errors in logs

Current Conclusion (provisional)

Forcing Gen4 alone did NOT fix the issue.

Disabling PCIe ASPM appears to have resolved it (so far).

spirit · Mar 2, 2026

(small reminder: don't use zfs on consumer ssd/nvme . they can't handle a lot of fsync because they don't have a PLP/powercapacitor), and zfs do a lot of sync. It's really like 200~1000 iops max with this kind of drive.

cpaglietti · Mar 4, 2026

SOLVED!

ASPM was active.
I then disabled it entirely via kernel parameter.

File: /etc/default/grub
Changed:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off"

cpaglietti · Mar 19, 2026

Afrter some week the problem came back !

IsThisThingOn · Mar 19, 2026

cpaglietti said:
Under heavy write load (e.g. vzdump backup, snapshot, large writes), the system eventually:

vzdump onto the same disk? Snapshot or large sequential writes should not be a heavy load.

cpaglietti said:
Would forcing PCIe Gen4 instead of Gen5 be a reasonable stability test?

I would say yes.

cpaglietti said:
Is this a known flush latency issue with consumer Gen5 NVMe?

No, but it would be far from the first time consumer SSDs have problems. Especially the ones that only buy Phison controllers and slap on a buggy firmware.

What @spirit was tryint to tell you is that while gamer drive like the 705 have huge sequential numbers, they don't necessarly have good 16k random sync write performance. So something like a Seagate IronWolf 525, while on paper only being able to read at 5GB/s, might for Proxmox be faster.

If you are interested: Or in other words: While on paper the 850 EVO is 5 times faster, in this workload the Intel S3510 is 37 times faster!

So I would get one server or at least Prosumer or NAS drive like the Seagate or Samsung 990 Pro or whatever and switch one of the drives for that.
It is in general a good idea to have different drives. We had many bugs in the past, from broken sync writes on Phison controllers, to Samsung shutdowns. If two SSDs suffer a bug in the same mirror, that is bad. If only one goes south, that is fine.

spirit · Mar 20, 2026

10year old ceph blog, also apply for zfs

https://ceph.io/en/news/blog/2014/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
https://www.sebastien-han.fr/blog/2...-if-your-ssd-is-suitable-as-a-journal-device/

don't expect any good performance with consumer ssd/nvme with zfs && ceph. you can look at the forum, the same question come back again && again with same problem. and th response is still the same: don't use consumer ssd/nvme without plp.

note that your are going to burn them fast too because of write over-amplification.

cpaglietti · Apr 3, 2026

Case resolved: the issue was the RAM. Both memory modules were failing (confirmed by MemTest). After replacing the memory, no further faults occurred.

ZFS mirror on 2x Crucial T705 (PCIe 5.0) causing txg_sync hangs under write load – no NVMe errors in dmesg

cpaglietti

New Member

ness1602

Famous Member

cpaglietti

New Member

_gabriel

Distinguished Member

cpaglietti

New Member

cpaglietti

New Member

Edit GRUB

Current Conclusion (provisional)

spirit

Distinguished Member

cpaglietti

New Member

cpaglietti

New Member

IsThisThingOn

Renowned Member

spirit

Distinguished Member

cpaglietti

New Member

We value your privacy

ZFS mirror on 2x Crucial T705 (PCIe 5.0) causing txg_sync hangs under write load – no NVMe errors in dmesg

New Member

Famous Member

New Member

Distinguished Member

New Member

New Member

Edit GRUB​

Current Conclusion (provisional)​

Distinguished Member

New Member

New Member

Renowned Member

Distinguished Member

New Member

We value your privacy

Edit GRUB

Current Conclusion (provisional)