Soft / interruptible mounts for backup targets

ozdjh

Member
Oct 8, 2019
60
14
8
During some testing tonight sending backups to a CIFS volume, there was a problem with samba on the target server (it consumed over 80GB of RAM and SWAP which trashed the server). The unavailability of the CIFS server impacted on the backups that were running. I'd expected the backup tasks to fail, but they didn't. They just blocked.

Trying to stop the backups through the GUI did not work. Also, trying to kill the vzdump process on the pve nodes did not work. Everything was locked up because of the CIFS volume. Looking back through the logs there lots of messages about hung tasks etc. Restarting smbd and even rebooting the CIFS server didn't help the situation.

We ended up stopping the VMs and rebooting the nodes. But even a reboot would complete as things were still hung up trying to unmount the CIFS volume. We ended up having to do a hard reset of the pve node to get it functional again. A hard reboot just to get over a memory leak in Samba on the box we're sending backups to.

Is this expected behaviour or is there something wrong with our setup? I thought CIFS mounts were soft or interruptible by default. Shouldn't all mounts for volumes like CIFS and NFS be soft or at least interruptible in case something goes wrong? Having to crash a node just because a fileserver had issues is pretty drastic for a production environment.


Thanks
David
 

ozdjh

Member
Oct 8, 2019
60
14
8
Really? I noticed snmpd stopped responding when we had an nfs server unavailable but I didn't check to see if it was using soft mounts. If processes don't fail and we can't kill them when a CIFS or NFS volume goes away that's a major problem.
 

czechsys

Active Member
Nov 18, 2015
205
8
38
Yes, even every monitored service by snmp get time-out, when monitored nfs storage gone away from pve nodes. It's snmp problem, but annoying.
 

Sprinterfreak

Member
Mar 26, 2018
8
0
6
32
Have the same trouble with my lab servers doing offsite-backups via unreliable connections. If cifs sees fluctuations the kernel is blocked by the hung cifs module entirely. Leaving the server alone a while it doesn't even respond to network traffic anymore. CIFS totally locks up the kernel. Only recoverable by rebooting the host via ipmi or pull the plug.

CIFS seems to be implemented in kernel wich is in this case really bad.

SNMPd not responding in the first place is caused by timeouts while it tries to gather information on mount points. Same applies to storage related REST calls on the WebGUI.

So NFS and CIFS are both only suitable for local networks between HRLE servers. Both have the attribute to badly lockup everything on connection loss.

Btw. killing processes does not work at all, `unmount -l -a -t cifs` doesn't work either. Stuck processes continue hanging the kernel. Only solution found is reboot and fingers crossed that it doesn't happen too often.

What really bothers me is, that there is no protocoll left wich allows me to push my backups to a remote target without risking the host to completely die on connection hickups.

Thats the result from syslog perspective. Sit back and watch CIFS tear apart the running kernel.
 

Attachments

  • syslog.txt
    10.4 KB · Views: 1
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!