pvestatd locks up if network share missbehaves

masgo · Sep 30, 2019

The NAS which is used by my proxmox as a backup datastorage had a problem. The problem led to the nas being very unresponsive/hanging. It looks like this behaviour also influenced pvestatd is such a way that it has became a zombie process.

dmesg shows the following errors. I can say for sure that the cifs nas had a problem and that it (most likley) really did not answer for 120 sec (or even more). Since there are no VMs stored on it and no backup was running at that time, I would expect that such a failure should have little to no impact on proxmox. Unfortunatley it somehow made pvestatd hung in such a way that I can not restart it, not even kill it. ps shows it as being "D" = dead, and the parent process is PID 1, so only a reboot will help me here.

How can this happen? Is there a way to mitigate such a problem should it happen again in the future?

Code:

[23225.274454] CIFS VFS: Server xxx.xxx.xxx.xxx has not responded in 120 seconds. Reconnecting...
[23323.060336] INFO: task kworker/12:0:14135 blocked for more than 120 seconds.
[23323.061232]       Tainted: P           O      5.0.21-1-pve #1
[23323.061948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23323.062588] kworker/12:0    D    0 14135      2 0x80000000
[23323.062646] Workqueue: cifsiod smb2_reconnect_server [cifs]
[23323.062648] Call Trace:
[23323.062656]  __schedule+0x2d4/0x870
[23323.062659]  schedule+0x2c/0x70
[23323.062661]  schedule_preempt_disabled+0xe/0x10
[23323.062662]  __mutex_lock.isra.10+0x2e4/0x4c0
[23323.062666]  __mutex_lock_slowpath+0x13/0x20
[23323.062666]  mutex_lock+0x2c/0x30
[23323.062682]  smb2_reconnect+0x102/0x7d0 [cifs]
[23323.062688]  ? lock_timer_base+0x6b/0x90
[23323.062692]  ? wait_woken+0x80/0x80
[23323.062707]  smb2_reconnect_server+0x18c/0x2d0 [cifs]
[23323.062710]  process_one_work+0x20f/0x410
[23323.062712]  worker_thread+0x34/0x400
[23323.062714]  kthread+0x120/0x140
[23323.062715]  ? process_one_work+0x410/0x410
[23323.062716]  ? __kthread_parkme+0x70/0x70
[23323.062718]  ret_from_fork+0x35/0x40
[23323.062733] INFO: task pvestatd:8008 blocked for more than 120 seconds.
[23323.063336]       Tainted: P           O      5.0.21-1-pve #1
[23323.064099] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[23323.064837] pvestatd        D    0  8008   2633 0x80000004
[23323.064839] Call Trace:
[23323.064844]  __schedule+0x2d4/0x870
[23323.064848]  schedule+0x2c/0x70
[23323.064852]  schedule_preempt_disabled+0xe/0x10
[23323.064854]  __mutex_lock.isra.10+0x2e4/0x4c0
[23323.064864]  __mutex_lock_slowpath+0x13/0x20
[23323.064865]  mutex_lock+0x2c/0x30
[23323.064887]  cifs_mark_open_files_invalid+0x5b/0xa0 [cifs]
[23323.064908]  smb2_reconnect+0x149/0x7d0 [cifs]
[23323.064929]  smb2_plain_req_init+0x34/0x260 [cifs]
[23323.064946]  SMB2_open_init+0x69/0x760 [cifs]
[23323.064963]  SMB2_open+0x148/0x510 [cifs]
[23323.064980]  open_shroot+0x170/0x210 [cifs]
[23323.064997]  ? open_shroot+0x170/0x210 [cifs]
[23323.065014]  smb2_query_path_info+0x137/0x1c0 [cifs]
[23323.065016]  ? _cond_resched+0x19/0x30
[23323.065018]  ? _cond_resched+0x19/0x30
[23323.065022]  ? kmem_cache_alloc_trace+0x153/0x1d0
[23323.065047]  cifs_get_inode_info+0x283/0xb40 [cifs]
[23323.065067]  ? build_path_from_dentry_optional_prefix+0xc4/0x410 [cifs]
[23323.065090]  cifs_revalidate_dentry_attr+0xdd/0x3a0 [cifs]
[23323.065113]  cifs_getattr+0x5a/0x1a0 [cifs]
[23323.065120]  vfs_getattr_nosec+0x73/0x90
[23323.065123]  vfs_getattr+0x36/0x40
[23323.065124]  vfs_statx+0x8d/0xe0
[23323.065126]  __do_sys_newstat+0x3d/0x70
[23323.065128]  __x64_sys_newstat+0x16/0x20
[23323.065131]  do_syscall_64+0x5a/0x110
[23323.065133]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

wolfgang · Oct 2, 2019

Hi,

masgo said:
How can this happen? Is there a way to mitigate such a problem should it happen again in the future?

The problem is the NFS protocol.
It is designed for latency insensitivity, which means it waits forever.
And it is kernel implemented what makes it uninterruptible if it is waiting for IO.

masgo · Oct 2, 2019

Oh sorry, I forgot to mention that I use CIFS/SMB to access the NAS. It was configured via the GUI, therefore it uses the default settings imposed by the GUI.

ckt · Oct 2, 2019

Most likely the same issue I had after upgrading to Proxmox 6.0.

Try mounting your share via fstab instead and force SMB protocol version 3.0 (instead of 3.1.1 being the default on Kernel 5.0 and higher) using the mount option vers=3. That solved the problem for me (and probably others; this is not specific to Proxmox but to any servers running Linux kernel 5.0 or higher.

Search

Search

pvestatd locks up if network share missbehaves

masgo

Well-Known Member

wolfgang

Proxmox Retired Staff

masgo

Well-Known Member

ckt

New Member