ZFS issues after installing amd epyc 7532

Jsingh

Well-Known Member
Oct 23, 2018
35
0
46
I was running the following config:

AMD EPYC 7351P
128 GB RAM (2666 MHz) 8 * 16 GB
Asrock Rack EPYCD8-2T
AMD S7150x2
Adaptec 71605 in HBA mode running 15 4tb seagate 7e8 hdds
ZFS with an Intel 750 series ssd as slog device.
Proxmox 7.2

Recently I decided to upgrade the processor to the AMD 7532 32 core. I installed the processor. Initially the system displayed a blank screen with BIOS code b2 even through the IPMI was detecting the processor and all 8 sticks of ram and pci-e cards, after which I decided to reseat the processor and reset the BIOS.

I was still getting stuck at code b2, so I removed the graphics card and the system booted up. The system booted up fine with the the nvidia p40 as well. I reinstalled the amd firepro s1750x2 and set the oprom mode to legacy and the system booted up.

I got errors initially that reported the the system is out of memory and that zfs stopped rebuilding L2ARC. I removed my cache and log device and it removed this error

With only one or two windows vms running and a practically empty zfs pool, I am getting major IO delays and random stuttering and zfs blocked messages.

PS. I am getting the errors even after replacing the amd s7150x2 with 1070ti.

Dec 12 01:03:30 hoserver kernel: INFO: task txg_sync:1544 blocked for more than 241 seconds.
Dec 12 01:04:50 hoserver kernel: Tainted: P O 5.15.74-1-pve #1
Dec 12 01:04:50 hoserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 01:04:50 hoserver kernel: task:txg_sync state:D stack: 0 pid: 1544 ppid: 2 flags:0x00004000
Dec 12 01:04:50 hoserver kernel: Call Trace:
Dec 12 01:04:50 hoserver kernel: <TASK>
Dec 12 01:04:50 hoserver kernel: __schedule+0x34e/0x1740
Dec 12 01:04:50 hoserver kernel: ? lock_timer_base+0x3b/0xd0
Dec 12 01:04:50 hoserver kernel: ? __mod_timer+0x271/0x440
Dec 12 01:04:50 hoserver kernel: schedule+0x69/0x110
Dec 12 01:04:50 hoserver kernel: schedule_timeout+0x87/0x140
Dec 12 01:04:50 hoserver kernel: ? __bpf_trace_tick_stop+0x20/0x20
Dec 12 01:04:50 hoserver kernel: io_schedule_timeout+0x51/0x80
Dec 12 01:04:50 hoserver kernel: __cv_timedwait_common+0x135/0x170 [spl]
Dec 12 01:04:50 hoserver kernel: ? wait_woken+0x70/0x70
Dec 12 01:04:50 hoserver kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 12 01:04:50 hoserver kernel: zio_wait+0x137/0x300 [zfs]
Dec 12 01:04:50 hoserver kernel: ? __cond_resched+0x1a/0x50
Dec 12 01:04:50 hoserver kernel: dsl_pool_sync+0xcc/0x4f0 [zfs]
Dec 12 01:04:50 hoserver kernel: ? spa_suspend_async_destroy+0x60/0x60 [zfs]
Dec 12 01:04:50 hoserver kernel: ? add_timer+0x20/0x30
Dec 12 01:04:50 hoserver kernel: spa_sync+0x55a/0x1020 [zfs]
Dec 12 01:04:50 hoserver kernel: ? spa_txg_history_init_io+0x10a/0x120 [zfs]
Dec 12 01:04:50 hoserver kernel: txg_sync_thread+0x278/0x400 [zfs]
Dec 12 01:04:50 hoserver kernel: ? txg_init+0x2c0/0x2c0 [zfs]
Dec 12 01:04:50 hoserver kernel: thread_generic_wrapper+0x64/0x80 [spl]
Dec 12 01:04:50 hoserver kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 12 01:04:50 hoserver kernel: kthread+0x12a/0x150
Dec 12 01:04:50 hoserver kernel: ? set_kthread_struct+0x50/0x50
Dec 12 01:04:50 hoserver kernel: ret_from_fork+0x22/0x30
Dec 12 01:04:50 hoserver kernel: </TASK>
 
Last edited:
Yea, I was using 2.7 which was the custom one I asked with fan control. I used the 2.6 and 2.75 beta with fan control on asrock's website.
Nothing helpled.
I was not using PCI-e passthrough though it was my plan and I was using SR-IOV from s7150x2.

I think, updating it to 5.19 might fix it, I had faced a similar error on my NAS running but not so severe.

I was installing the kernel and in the process the zfs pool crashed. Though the kernel installation was fine but I can't see half of my drives anymore in the /dev/ and the zpool is gone, any way I can recover it?
 
Last edited:
I was installing the kernel and in the process the zfs pool crashed. Though the kernel installation was fine but I can't see half of my drives anymore in the /dev/ and the zpool is gone, any way I can recover it?

What does it look like, if you boot with the/an older kernel again: [1]?

But to me the whole history starting with the CPU-exchange sounds suspicious. Did you get the CPU new from a well-known reputable shop or was it used?
I would check the hardware and its stability first. Maybe with some other (live)-Linux-ISOs or even some Windows. Doing several (long) stress-tests of all of the components.
Maybe swap the cards in the PCIe-slots around, especially the HBA.
Check if the ~10 year old HBA has still good Linux(-driver) support or if there may exist some known problems (lately).
You already did a full clear-CMOS, yes?
Maybe even put the old CPU back in again to counter-check.

I am afraid, I can not really help further, sorry. :(

[1] https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin
 
It's a used CPU but a new pull. I was thinking that the CPU has issues which I am testing. But I have another platform based on ryzen 3900x along with a x470 d4u motherboard. It does not have an HBA. This platform I know for a fact works well with no issues.

I created a simple pool of two HDDs and I faced similar issues in the other box as well. Otherwise it was rock solid when not using the pool.

This seems to suggest that it is a zfs or kernel problem.
I will try booting into older kernels 5.13 or earlier and check. I booted my ryzen 3900x box with 5.13 kernel and zfs reported no issues.
 
Last edited:
Not completely, The issue seems to have disappeared my ryzen based systems.
For my EPYC based system the pool using HBA seems to have started working fine after I changed BIOS to older version.
However, the epyc system still has problems on the rpool which is a mirror of sata ssds for any file larger than 5GB

I am still running tests, though
 
Last edited: