Soft / interruptible mounts for backup targets

ozdjh

Well-Known Member
Oct 8, 2019
117
27
48
During some testing tonight sending backups to a CIFS volume, there was a problem with samba on the target server (it consumed over 80GB of RAM and SWAP which trashed the server). The unavailability of the CIFS server impacted on the backups that were running. I'd expected the backup tasks to fail, but they didn't. They just blocked.

Trying to stop the backups through the GUI did not work. Also, trying to kill the vzdump process on the pve nodes did not work. Everything was locked up because of the CIFS volume. Looking back through the logs there lots of messages about hung tasks etc. Restarting smbd and even rebooting the CIFS server didn't help the situation.

We ended up stopping the VMs and rebooting the nodes. But even a reboot would complete as things were still hung up trying to unmount the CIFS volume. We ended up having to do a hard reset of the pve node to get it functional again. A hard reboot just to get over a memory leak in Samba on the box we're sending backups to.

Is this expected behaviour or is there something wrong with our setup? I thought CIFS mounts were soft or interruptible by default. Shouldn't all mounts for volumes like CIFS and NFS be soft or at least interruptible in case something goes wrong? Having to crash a node just because a fileserver had issues is pretty drastic for a production environment.


Thanks
David
 
Really? I noticed snmpd stopped responding when we had an nfs server unavailable but I didn't check to see if it was using soft mounts. If processes don't fail and we can't kill them when a CIFS or NFS volume goes away that's a major problem.
 
Yes, even every monitored service by snmp get time-out, when monitored nfs storage gone away from pve nodes. It's snmp problem, but annoying.
 
Have the same trouble with my lab servers doing offsite-backups via unreliable connections. If cifs sees fluctuations the kernel is blocked by the hung cifs module entirely. Leaving the server alone a while it doesn't even respond to network traffic anymore. CIFS totally locks up the kernel. Only recoverable by rebooting the host via ipmi or pull the plug.

CIFS seems to be implemented in kernel wich is in this case really bad.

SNMPd not responding in the first place is caused by timeouts while it tries to gather information on mount points. Same applies to storage related REST calls on the WebGUI.

So NFS and CIFS are both only suitable for local networks between HRLE servers. Both have the attribute to badly lockup everything on connection loss.

Btw. killing processes does not work at all, `unmount -l -a -t cifs` doesn't work either. Stuck processes continue hanging the kernel. Only solution found is reboot and fingers crossed that it doesn't happen too often.

What really bothers me is, that there is no protocoll left wich allows me to push my backups to a remote target without risking the host to completely die on connection hickups.

Thats the result from syslog perspective. Sit back and watch CIFS tear apart the running kernel.
 

Attachments

  • syslog.txt
    10.4 KB · Views: 2
Last edited:
It is not Proxmox issue, but well known NFS/CIFS issue in Linux. I remember this kind of problems since Kernel 2.0 and all problems still exists!

it seems that CIFS storage should be "forbidden" for production.


In my case remote CIFS storage gets full and problems starts accumulating.
Every host command which "touch" mount points hangs.

Not reachable CIFS mounts on hosts brokes also LXC guests!
Simple call of df on host or inside LXC (which shouldn't see host mounts) also hangs.

On PVE hots I found over 900 kworkers:

Code:
root     4030552  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:102-cifsiod]
root     4030553  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:107-cifsiod]
root     4030554  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:108-cifsiod]
root     4030555  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:109-cifsiod]

As workaround I've temporarily disabled CIFS storage in PVE web GUI and wait until all CIFS timeouts expires. After a while all hung kworkers disappears.

Please consider to modify CIFS Storage plugin to mount CIFS with option echo_interval=1 (by default is not set, so the value is 60).

echo_interval=n

sets the interval at which echo requests are sent to the server on an idling connection. This setting also affects the time required for a connection to an unresponsive server to timeout. Here n is the echo interval in seconds. The reconnection happens at twice the value of the echo_interval set for an unresponsive server. If this option is not given then the default value of 60 seconds is used. The minimum tunable value is 1 second and maximum can go up to 600 seconds.
 
It is not Proxmox issue, but well known NFS/CIFS issue in Linux. I remember this kind of problems since Kernel 2.0 and all problems still exists!

it seems that CIFS storage should be "forbidden" for production.

In my case remote CIFS storage gets full and problems starts accumulating.
Every host command which "touch" mount points hangs.

Not reachable CIFS mounts on hosts brokes also LXC guests!
Simple call of df on host or inside LXC (which shouldn't see host mounts) also hangs.

On PVE hots I found over 900 kworkers:

Code:
root     4030552  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:102-cifsiod]
root     4030553  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:107-cifsiod]
root     4030554  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:108-cifsiod]
root     4030555  0.0  0.0      0     0 ?        I    04:47   0:00 [kworker/2:109-cifsiod]

As workaround I've temporarily disabled CIFS storage in PVE web GUI and wait until all CIFS timeouts expires. After a while all hung kworkers disappears.

Please consider to modify CIFS Storage plugin to mount CIFS with option echo_interval=1 (by default is not set, so the value is 60).

Hi!

is there any way I can set this echo_interval=1 manually in PVE ?

cheers
Michael
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!