Soft / interruptible mounts for backup targets

ozdjh · Jan 15, 2020

During some testing tonight sending backups to a CIFS volume, there was a problem with samba on the target server (it consumed over 80GB of RAM and SWAP which trashed the server). The unavailability of the CIFS server impacted on the backups that were running. I'd expected the backup tasks to fail, but they didn't. They just blocked.

Trying to stop the backups through the GUI did not work. Also, trying to kill the vzdump process on the pve nodes did not work. Everything was locked up because of the CIFS volume. Looking back through the logs there lots of messages about hung tasks etc. Restarting smbd and even rebooting the CIFS server didn't help the situation.

We ended up stopping the VMs and rebooting the nodes. But even a reboot would complete as things were still hung up trying to unmount the CIFS volume. We ended up having to do a hard reset of the pve node to get it functional again. A hard reboot just to get over a memory leak in Samba on the box we're sending backups to.

Is this expected behaviour or is there something wrong with our setup? I thought CIFS mounts were soft or interruptible by default. Shouldn't all mounts for volumes like CIFS and NFS be soft or at least interruptible in case something goes wrong? Having to crash a node just because a fileserver had issues is pretty drastic for a production environment.

Thanks
David

czechsys · Jan 15, 2020

The same problem was with nfs and soft mount. Unkillable pve tasks.

ozdjh · Jan 15, 2020

Really? I noticed snmpd stopped responding when we had an nfs server unavailable but I didn't check to see if it was using soft mounts. If processes don't fail and we can't kill them when a CIFS or NFS volume goes away that's a major problem.

czechsys · Jan 15, 2020

Yes, even every monitored service by snmp get time-out, when monitored nfs storage gone away from pve nodes. It's snmp problem, but annoying.

Sprinterfreak · Sep 4, 2020

Have the same trouble with my lab servers doing offsite-backups via unreliable connections. If cifs sees fluctuations the kernel is blocked by the hung cifs module entirely. Leaving the server alone a while it doesn't even respond to network traffic anymore. CIFS totally locks up the kernel. Only recoverable by rebooting the host via ipmi or pull the plug.

CIFS seems to be implemented in kernel wich is in this case really bad.

SNMPd not responding in the first place is caused by timeouts while it tries to gather information on mount points. Same applies to storage related REST calls on the WebGUI.

So NFS and CIFS are both only suitable for local networks between HRLE servers. Both have the attribute to badly lockup everything on connection loss.

Btw. killing processes does not work at all, `unmount -l -a -t cifs` doesn't work either. Stuck processes continue hanging the kernel. Only solution found is reboot and fingers crossed that it doesn't happen too often.

What really bothers me is, that there is no protocoll left wich allows me to push my backups to a remote target without risking the host to completely die on connection hickups.

Thats the result from syslog perspective. Sit back and watch CIFS tear apart the running kernel.

RudyBzh · Sep 4, 2020

Same here... https://forum.proxmox.com/threads/backup-vzdump-fails-and-hangs-forever.75372/
And, for me, cifs is on a server on the local LAN, so apparently, not a connectivity issue.
Searching on the forum, lots of similar issues exists without solution (at least, I didn’t found one).
Why it’s hanging and, even more, why it’s not falling in timeout to prevent having to hard reset the all proxmox is an answer I’d like to have...

Sprinterfreak · Sep 12, 2020

This is obviously not a Proxmox issue alone but general design issue of cifs and nfs implementation.
https://stackoverflow.com/questions...ed-with-cifs-hangs-when-disconnected/19101647

niziak · Dec 28, 2020

It is not Proxmox issue, but well known NFS/CIFS issue in Linux. I remember this kind of problems since Kernel 2.0 and all problems still exists!

it seems that CIFS storage should be "forbidden" for production.

In my case remote CIFS storage gets full and problems starts accumulating.
Every host command which "touch" mount points hangs.

Not reachable CIFS mounts on hosts brokes also LXC guests!
Simple call of df on host or inside LXC (which shouldn't see host mounts) also hangs.

On PVE hots I found over 900 kworkers:

Code:

root     4030552  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:102-cifsiod]
root     4030553  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:107-cifsiod]
root     4030554  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:108-cifsiod]
root     4030555  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:109-cifsiod]

As workaround I've temporarily disabled CIFS storage in PVE web GUI and wait until all CIFS timeouts expires. After a while all hung kworkers disappears.

Please consider to modify CIFS Storage plugin to mount CIFS with option echo_interval=1 (by default is not set, so the value is 60).

echo_interval=n

sets the interval at which echo requests are sent to the server on an idling connection. This setting also affects the time required for a connection to an unresponsive server to timeout. Here n is the echo interval in seconds. The reconnection happens at twice the value of the echo_interval set for an unresponsive server. If this option is not given then the default value of 60 seconds is used. The minimum tunable value is 1 second and maximum can go up to 600 seconds.

His.Dudeness · Jan 14, 2022

niziak said:
It is not Proxmox issue, but well known NFS/CIFS issue in Linux. I remember this kind of problems since Kernel 2.0 and all problems still exists!

it seems that CIFS storage should be "forbidden" for production.

In my case remote CIFS storage gets full and problems starts accumulating.
Every host command which "touch" mount points hangs.

Not reachable CIFS mounts on hosts brokes also LXC guests!
Simple call of df on host or inside LXC (which shouldn't see host mounts) also hangs.

On PVE hots I found over 900 kworkers:

Code:

root 4030552 0.0 0.0 0 0 ? I 04:47 0:00 [kworker/2:102-cifsiod] root 4030553 0.0 0.0 0 0 ? I 04:47 0:00 [kworker/2:107-cifsiod] root 4030554 0.0 0.0 0 0 ? I 04:47 0:00 [kworker/2:108-cifsiod] root 4030555 0.0 0.0 0 0 ? I 04:47 0:00 [kworker/2:109-cifsiod]

As workaround I've temporarily disabled CIFS storage in PVE web GUI and wait until all CIFS timeouts expires. After a while all hung kworkers disappears.

Please consider to modify CIFS Storage plugin to mount CIFS with option echo_interval=1 (by default is not set, so the value is 60).

Hi!

is there any way I can set this echo_interval=1 manually in PVE ?

cheers
Michael

Search

Search

Soft / interruptible mounts for backup targets

ozdjh

Well-Known Member

czechsys

Renowned Member

ozdjh

Well-Known Member

czechsys

Renowned Member

Sprinterfreak

Active Member

Attachments

RudyBzh

Member

Sprinterfreak

Active Member

niziak

Member

His.Dudeness

Member