ZFS issues after installing amd epyc 7532

Jsingh · Dec 11, 2022

I was running the following config:

AMD EPYC 7351P
128 GB RAM (2666 MHz) 8 * 16 GB
Asrock Rack EPYCD8-2T
AMD S7150x2
Adaptec 71605 in HBA mode running 15 4tb seagate 7e8 hdds
ZFS with an Intel 750 series ssd as slog device.
Proxmox 7.2

Recently I decided to upgrade the processor to the AMD 7532 32 core. I installed the processor. Initially the system displayed a blank screen with BIOS code b2 even through the IPMI was detecting the processor and all 8 sticks of ram and pci-e cards, after which I decided to reseat the processor and reset the BIOS.

I was still getting stuck at code b2, so I removed the graphics card and the system booted up. The system booted up fine with the the nvidia p40 as well. I reinstalled the amd firepro s1750x2 and set the oprom mode to legacy and the system booted up.

I got errors initially that reported the the system is out of memory and that zfs stopped rebuilding L2ARC. I removed my cache and log device and it removed this error

With only one or two windows vms running and a practically empty zfs pool, I am getting major IO delays and random stuttering and zfs blocked messages.

PS. I am getting the errors even after replacing the amd s7150x2 with 1070ti.

Dec 12 01:03:30 hoserver kernel: INFO: task txg_sync:1544 blocked for more than 241 seconds.
Dec 12 01:04:50 hoserver kernel: Tainted: P O 5.15.74-1-pve #1
Dec 12 01:04:50 hoserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 01:04:50 hoserver kernel: task:txg_sync state stack: 0 pid: 1544 ppid: 2 flags:0x00004000
Dec 12 01:04:50 hoserver kernel: Call Trace:
Dec 12 01:04:50 hoserver kernel: <TASK>
Dec 12 01:04:50 hoserver kernel: __schedule+0x34e/0x1740
Dec 12 01:04:50 hoserver kernel: ? lock_timer_base+0x3b/0xd0
Dec 12 01:04:50 hoserver kernel: ? __mod_timer+0x271/0x440
Dec 12 01:04:50 hoserver kernel: schedule+0x69/0x110
Dec 12 01:04:50 hoserver kernel: schedule_timeout+0x87/0x140
Dec 12 01:04:50 hoserver kernel: ? __bpf_trace_tick_stop+0x20/0x20
Dec 12 01:04:50 hoserver kernel: io_schedule_timeout+0x51/0x80
Dec 12 01:04:50 hoserver kernel: __cv_timedwait_common+0x135/0x170 [spl]
Dec 12 01:04:50 hoserver kernel: ? wait_woken+0x70/0x70
Dec 12 01:04:50 hoserver kernel: __cv_timedwait_io+0x19/0x20 [spl]
Dec 12 01:04:50 hoserver kernel: zio_wait+0x137/0x300 [zfs]
Dec 12 01:04:50 hoserver kernel: ? __cond_resched+0x1a/0x50
Dec 12 01:04:50 hoserver kernel: dsl_pool_sync+0xcc/0x4f0 [zfs]
Dec 12 01:04:50 hoserver kernel: ? spa_suspend_async_destroy+0x60/0x60 [zfs]
Dec 12 01:04:50 hoserver kernel: ? add_timer+0x20/0x30
Dec 12 01:04:50 hoserver kernel: spa_sync+0x55a/0x1020 [zfs]
Dec 12 01:04:50 hoserver kernel: ? spa_txg_history_init_io+0x10a/0x120 [zfs]
Dec 12 01:04:50 hoserver kernel: txg_sync_thread+0x278/0x400 [zfs]
Dec 12 01:04:50 hoserver kernel: ? txg_init+0x2c0/0x2c0 [zfs]
Dec 12 01:04:50 hoserver kernel: thread_generic_wrapper+0x64/0x80 [spl]
Dec 12 01:04:50 hoserver kernel: ? __thread_exit+0x20/0x20 [spl]
Dec 12 01:04:50 hoserver kernel: kthread+0x12a/0x150
Dec 12 01:04:50 hoserver kernel: ? set_kthread_struct+0x50/0x50
Dec 12 01:04:50 hoserver kernel: ret_from_fork+0x22/0x30
Dec 12 01:04:50 hoserver kernel: </TASK>

Neobin · Dec 11, 2022

Is your bios/UEFI at least on 2.50: [1]? Better on the most recent one.
Are all the firmwares in general up-to-date?

If yes to all, you could try the opt-in 5.19 kernel: [2].

Are you using PCIe-passthrough?

[1] https://www.asrockrack.com/general/productdetail.asp?Model=EPYCD8-2T#CPU
[2] https://forum.proxmox.com/threads/opt-in-linux-5-19-kernel-for-proxmox-ve-7-x-available.115090

Jsingh · Dec 11, 2022

Yea, I was using 2.7 which was the custom one I asked with fan control. I used the 2.6 and 2.75 beta with fan control on asrock's website.
Nothing helpled.
I was not using PCI-e passthrough though it was my plan and I was using SR-IOV from s7150x2.

I think, updating it to 5.19 might fix it, I had faced a similar error on my NAS running but not so severe.

I was installing the kernel and in the process the zfs pool crashed. Though the kernel installation was fine but I can't see half of my drives anymore in the /dev/ and the zpool is gone, any way I can recover it?

Neobin · Dec 12, 2022

Jsingh said:
I was installing the kernel and in the process the zfs pool crashed. Though the kernel installation was fine but I can't see half of my drives anymore in the /dev/ and the zpool is gone, any way I can recover it?

What does it look like, if you boot with the/an older kernel again: [1]?

But to me the whole history starting with the CPU-exchange sounds suspicious. Did you get the CPU new from a well-known reputable shop or was it used?
I would check the hardware and its stability first. Maybe with some other (live)-Linux-ISOs or even some Windows. Doing several (long) stress-tests of all of the components.
Maybe swap the cards in the PCIe-slots around, especially the HBA.
Check if the ~10 year old HBA has still good Linux(-driver) support or if there may exist some known problems (lately).
You already did a full clear-CMOS, yes?
Maybe even put the old CPU back in again to counter-check.

I am afraid, I can not really help further, sorry.

[1] https://pve.proxmox.com/wiki/Host_Bootloader#sysboot_kernel_pin

Jsingh · Dec 13, 2022

It's a used CPU but a new pull. I was thinking that the CPU has issues which I am testing. But I have another platform based on ryzen 3900x along with a x470 d4u motherboard. It does not have an HBA. This platform I know for a fact works well with no issues.

I created a simple pool of two HDDs and I faced similar issues in the other box as well. Otherwise it was rock solid when not using the pool.

This seems to suggest that it is a zfs or kernel problem.
I will try booting into older kernels 5.13 or earlier and check. I booted my ryzen 3900x box with 5.13 kernel and zfs reported no issues.

slowpork · Jan 8, 2023

hi. are you able to solve ZFS issue? There is a long thread on Github .

Jsingh · Jan 18, 2023

Not completely, The issue seems to have disappeared my ryzen based systems.
For my EPYC based system the pool using HBA seems to have started working fine after I changed BIOS to older version.
However, the epyc system still has problems on the rpool which is a mirror of sata ssds for any file larger than 5GB

I am still running tests, though

Search

Search

ZFS issues after installing amd epyc 7532

Jsingh

Well-Known Member

Neobin

Distinguished Member

Jsingh

Well-Known Member

Neobin

Distinguished Member

Jsingh

Well-Known Member

slowpork

Member

Jsingh

Well-Known Member

We value your privacy