pvestatd locks up if network share missbehaves

masgo

Active Member
Jun 24, 2019
66
14
28
74
The NAS which is used by my proxmox as a backup datastorage had a problem. The problem led to the nas being very unresponsive/hanging. It looks like this behaviour also influenced pvestatd is such a way that it has became a zombie process.

dmesg shows the following errors. I can say for sure that the cifs nas had a problem and that it (most likley) really did not answer for 120 sec (or even more). Since there are no VMs stored on it and no backup was running at that time, I would expect that such a failure should have little to no impact on proxmox. Unfortunatley it somehow made pvestatd hung in such a way that I can not restart it, not even kill it. ps shows it as being "D" = dead, and the parent process is PID 1, so only a reboot will help me here.

How can this happen? Is there a way to mitigate such a problem should it happen again in the future?

Code:
[23225.274454] CIFS VFS: Server xxx.xxx.xxx.xxx has not responded in 120 seconds. Reconnecting...
[23323.060336] INFO: task kworker/12:0:14135 blocked for more than 120 seconds.
[23323.061232]       Tainted: P           O      5.0.21-1-pve #1
[23323.061948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23323.062588] kworker/12:0    D    0 14135      2 0x80000000
[23323.062646] Workqueue: cifsiod smb2_reconnect_server [cifs]
[23323.062648] Call Trace:
[23323.062656]  __schedule+0x2d4/0x870
[23323.062659]  schedule+0x2c/0x70
[23323.062661]  schedule_preempt_disabled+0xe/0x10
[23323.062662]  __mutex_lock.isra.10+0x2e4/0x4c0
[23323.062666]  __mutex_lock_slowpath+0x13/0x20
[23323.062666]  mutex_lock+0x2c/0x30
[23323.062682]  smb2_reconnect+0x102/0x7d0 [cifs]
[23323.062688]  ? lock_timer_base+0x6b/0x90
[23323.062692]  ? wait_woken+0x80/0x80
[23323.062707]  smb2_reconnect_server+0x18c/0x2d0 [cifs]
[23323.062710]  process_one_work+0x20f/0x410
[23323.062712]  worker_thread+0x34/0x400
[23323.062714]  kthread+0x120/0x140
[23323.062715]  ? process_one_work+0x410/0x410
[23323.062716]  ? __kthread_parkme+0x70/0x70
[23323.062718]  ret_from_fork+0x35/0x40
[23323.062733] INFO: task pvestatd:8008 blocked for more than 120 seconds.
[23323.063336]       Tainted: P           O      5.0.21-1-pve #1
[23323.064099] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23323.064837] pvestatd        D    0  8008   2633 0x80000004
[23323.064839] Call Trace:
[23323.064844]  __schedule+0x2d4/0x870
[23323.064848]  schedule+0x2c/0x70
[23323.064852]  schedule_preempt_disabled+0xe/0x10
[23323.064854]  __mutex_lock.isra.10+0x2e4/0x4c0
[23323.064864]  __mutex_lock_slowpath+0x13/0x20
[23323.064865]  mutex_lock+0x2c/0x30
[23323.064887]  cifs_mark_open_files_invalid+0x5b/0xa0 [cifs]
[23323.064908]  smb2_reconnect+0x149/0x7d0 [cifs]
[23323.064929]  smb2_plain_req_init+0x34/0x260 [cifs]
[23323.064946]  SMB2_open_init+0x69/0x760 [cifs]
[23323.064963]  SMB2_open+0x148/0x510 [cifs]
[23323.064980]  open_shroot+0x170/0x210 [cifs]
[23323.064997]  ? open_shroot+0x170/0x210 [cifs]
[23323.065014]  smb2_query_path_info+0x137/0x1c0 [cifs]
[23323.065016]  ? _cond_resched+0x19/0x30
[23323.065018]  ? _cond_resched+0x19/0x30
[23323.065022]  ? kmem_cache_alloc_trace+0x153/0x1d0
[23323.065047]  cifs_get_inode_info+0x283/0xb40 [cifs]
[23323.065067]  ? build_path_from_dentry_optional_prefix+0xc4/0x410 [cifs]
[23323.065090]  cifs_revalidate_dentry_attr+0xdd/0x3a0 [cifs]
[23323.065113]  cifs_getattr+0x5a/0x1a0 [cifs]
[23323.065120]  vfs_getattr_nosec+0x73/0x90
[23323.065123]  vfs_getattr+0x36/0x40
[23323.065124]  vfs_statx+0x8d/0xe0
[23323.065126]  __do_sys_newstat+0x3d/0x70
[23323.065128]  __x64_sys_newstat+0x16/0x20
[23323.065131]  do_syscall_64+0x5a/0x110
[23323.065133]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 
Hi,
How can this happen? Is there a way to mitigate such a problem should it happen again in the future?
The problem is the NFS protocol.
It is designed for latency insensitivity, which means it waits forever.
And it is kernel implemented what makes it uninterruptible if it is waiting for IO.
 
Oh sorry, I forgot to mention that I use CIFS/SMB to access the NAS. It was configured via the GUI, therefore it uses the default settings imposed by the GUI.
 
Most likely the same issue I had after upgrading to Proxmox 6.0.

Try mounting your share via fstab instead and force SMB protocol version 3.0 (instead of 3.1.1 being the default on Kernel 5.0 and higher) using the mount option vers=3. That solved the problem for me (and probably others; this is not specific to Proxmox but to any servers running Linux kernel 5.0 or higher.
 
  • Like
Reactions: masgo

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!