Kernel 6.8.x NFS server bug

Power2All

Active Member
Sep 13, 2017
19
0
41
42
www.power2all.com
Hi,

I'm having trouble after upgrading with NFS.
It's literally crashing every time, I'm still gonna try 6.8.8 but I might need to switch to 6.5 if there is still no fix for this.

Code:
2024-09-11T19:22:31.190378+02:00 server2 kernel: [72134.823446] INFO: task nfsd:2056 blocked for more than 1228 seconds.
2024-09-11T19:22:31.190416+02:00 server2 kernel: [72134.823462] Tainted: P OE 6.8.12-1-pve #1
2024-09-11T19:22:31.190420+02:00 server2 kernel: [72134.823467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2024-09-11T19:22:31.190422+02:00 server2 kernel: [72134.823470] task:nfsd state:D stack:0 pid:2056 tgid:2056 ppid:2 flags:0x00004000
2024-09-11T19:22:31.190427+02:00 server2 kernel: [72134.823483] Call Trace:
2024-09-11T19:22:31.190430+02:00 server2 kernel: [72134.823488] <TASK>
2024-09-11T19:22:31.190433+02:00 server2 kernel: [72134.823510] __schedule+0x3ec/0x1520
2024-09-11T19:22:31.190436+02:00 server2 kernel: [72134.823551] schedule+0x33/0xf0
2024-09-11T19:22:31.190439+02:00 server2 kernel: [72134.823562] schedule_timeout+0x157/0x170
2024-09-11T19:22:31.190443+02:00 server2 kernel: [72134.823585] wait_for_completion+0x88/0x150
2024-09-11T19:22:31.190445+02:00 server2 kernel: [72134.823607] __flush_workqueue+0x131/0x3d0
2024-09-11T19:22:31.190448+02:00 server2 kernel: [72134.823634] ? nfsd4_run_cb+0x30/0x70 [nfsd]
2024-09-11T19:22:31.190451+02:00 server2 kernel: [72134.823836] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
2024-09-11T19:22:31.190454+02:00 server2 kernel: [72134.823966] nfsd4_destroy_session+0x18e/0x270 [nfsd]
2024-09-11T19:22:31.190457+02:00 server2 kernel: [72134.824110] nfsd4_proc_compound+0x3cb/0x730 [nfsd]
2024-09-11T19:22:31.190472+02:00 server2 kernel: [72134.824285] nfsd_dispatch+0x106/0x220 [nfsd]
2024-09-11T19:22:31.191332+02:00 server2 kernel: [72134.824427] svc_process_common+0x309/0x720 [sunrpc]
2024-09-11T19:22:31.191353+02:00 server2 kernel: [72134.824648] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
2024-09-11T19:22:31.191356+02:00 server2 kernel: [72134.824783] svc_process+0x132/0x1b0 [sunrpc]
2024-09-11T19:22:31.191359+02:00 server2 kernel: [72134.824945] svc_recv+0x828/0xa00 [sunrpc]
2024-09-11T19:22:31.191362+02:00 server2 kernel: [72134.825112] ? __pfx_nfsd+0x10/0x10 [nfsd]
2024-09-11T19:22:31.192286+02:00 server2 kernel: [72134.825260] nfsd+0x8b/0xf0 [nfsd]
2024-09-11T19:22:31.192303+02:00 server2 kernel: [72134.825431] kthread+0xf2/0x120
2024-09-11T19:22:31.192307+02:00 server2 kernel: [72134.825442] ? __pfx_kthread+0x10/0x10
2024-09-11T19:22:31.192309+02:00 server2 kernel: [72134.825455] ret_from_fork+0x47/0x70
2024-09-11T19:22:31.192311+02:00 server2 kernel: [72134.825464] ? __pfx_kthread+0x10/0x10
2024-09-11T19:22:31.192314+02:00 server2 kernel: [72134.825475] ret_from_fork_asm+0x1b/0x30
Linux server2 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux

Please look into this, as 6.8.13 is also affected.
 
Looks more than a problem of your host as we use nfs as a part of storage from 1 pve node to cluster with all the 6.* kernels without problems also, so it's a little bit sorry to you. But otherwise you can solve that by yourself hw/sw and you are not dependent on next release to fix anything which is the good side yet.
 
Looks more than a problem of your host as we use nfs as a part of storage from 1 pve node to cluster with all the 6.* kernels without problems also, so it's a little bit sorry to you. But otherwise you can solve that by yourself hw/sw and you are not dependent on next release to fix anything which is the good side yet.
I'm afraid so too.
Gonna try without NFS, and mounting through a bind mount in LXC (CT).
Hopefully I can get Docker working as expected in an Debian 12 container.
I think the HDD that the OS is running on, is the fault, but I cannot be certain, so will try different approach I guess.
 
Well, I've been looking into this, seems the OS HDD is just the culprit right now, since it stores (so it seems) status updates and caching on it, and eventually it's too slow, killing NFS. When I replaced this HDD with an SSD (or change the place it stores it's status data to an SSD or /dev/shm), it shouldn't be a problem anymore.

From my understanding with other people, this is a hardware affected issue, and when the system cannot keep up with the requests, it destroys NFS. However, this can still be solved by optionally changing the cache/status storage to an faster place, and returning an "error", instead of locking up NFS completely. I still see work being done on the NFS 4.1 stack, while I use NFS 4.2, so would be cool to have this issue being solved better, other then locking up the whole NFS stack completely, making even rebooting a system, slow, very slow...
 
Last edited:
task nfsd:2056 blocked for more than 1228 seconds
Don't take a mistake as nfsd waited 20min on I/O requests to serve results to/from nfs client and so looks not at all as a nfs problem.
PS: nfs4.1 is meant as parallel nfs while nfs4 and nfs4.2 the non-parallel variants. With 4.1 you get much more complexity and probably access errors when hw/sw in the then multi-host stack errors occur than you thought about in your dreams - distributed data are distributed problems also :)
 
Don't take a mistake as nfsd waited 20min on I/O requests to serve results to/from nfs client and so looks not at all as a nfs problem.
PS: nfs4.1 is meant as parallel nfs while nfs4 and nfs4.2 the non-parallel variants. With 4.1 you get much more complexity and probably access errors when hw/sw in the then multi-host stack errors occur than you thought about in your dreams - distributed data are distributed problems also :)
Yah, the idea for NFS was to simply deploy an Docker stack with an apache, php and such containers, and auto-mount the persistent data through NFS. This worked very well for a long time, until these issues started happening out of the blue.
But I guess I should have used NFS 4.1 ? Problem with NFS 3.x, 4.0 and 4.1 was the stale files problem, they all 3 have this issue, but NFS 4.2 doesn't.
Maybe you could explain me why that happens, and how to fight it ?
 
Stale filesystem on nfs clients came with to long temporary unaccessible nfs file server which can have different causes like fileserver reboots and network outages. Default a nfs (v3+4.*) mount is done in hard and bg (background) mode which is normally always good as long as your nfs4 filesystem is not switching the nfs fileserver (but I assume you use just one else you should define a fsid also) but could even help for reconnecting in 1-server configurations.
But even when you are in an unreliable OS nfs configuration mix involved you could even do a short while loop systemd script on your nfs clients which check status with mountpoint (-q|-d) "/<mount>" and umount staled mount and remount if fileserver is reachable. The systemd script should be not oneshot type and instead should restart your script automatically if died. (So that's a kind of self healing by self monitoring) :)
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!