Samba server in container - failed transfers when sending from same node

Nelluk

New Member
Dec 7, 2024
2
1
3
I just added a second Proxmox 8.3 node with a 14tb HDD intended to be used as network storage. I followed this guide to set up an unprivileged container running a Samba server to share the HDD to my mixed-OS network.

This works fine when sending from other devices. However I also mounted the share on the local node and tried backing up my local VMs, and noticed this would fail after a few GB of copying, lock up the fileserver LXC, and require a forceful reset of the node. (I tried a variety of things to recover gracefully and a hard reset is the only thing I've found success with to recover)

I started testing with `dd` to eliminate the backup job part of it, and found a few things:

- Limiting to ~60 MB/s transfer speed is successful. Anything above ~80MB/s the smbd will lock up after 8-12GB.
- Transfers from other devices (other proxmox node, Mac, PC) all seem to succeed fine, and get 150-250MB/s speeds
- If I set the `dd` to compressible 0-filled data it will have no problem and hit high speeds. Problem crops up if I use random data.
- Tried various changes to smb.conf as found online and no improvements
- Tried increasing the resources for the fileserver container from 512mb RAM/2 cores to 2048mb ram/4 cores, and slight improvement (gets a few more GB before failing)
- When in failed state the `smbd` process is pegged at max CPU of the container. Even if the container terminal is responsive, I can't do anything in there to kill the server and recover gracefully. (zombie smbd process consuming 200% CPU when 2 cores allocated)
- I tried spinning up a different Samba container - this time using a turnkey linux filesharing template. This fails in the exact same way.

I could work around this by just not using the samba share on the local node, or limiting the speeds, but I'd like to have a better understanding of whats going on or a cleaner solution if possible.
 
I just added a second Proxmox 8.3 node with a 14tb HDD intended to be used as network storage. I followed this guide to set up an unprivileged container running a Samba server to share the HDD to my mixed-OS network.

This works fine when sending from other devices. However I also mounted the share on the local node and tried backing up my local VMs, and noticed this would fail after a few GB of copying, lock up the fileserver LXC, and require a forceful reset of the node. (I tried a variety of things to recover gracefully and a hard reset is the only thing I've found success with to recover)

I started testing with `dd` to eliminate the backup job part of it, and found a few things:

- Limiting to ~60 MB/s transfer speed is successful. Anything above ~80MB/s the smbd will lock up after 8-12GB.
- Transfers from other devices (other proxmox node, Mac, PC) all seem to succeed fine, and get 150-250MB/s speeds
- If I set the `dd` to compressible 0-filled data it will have no problem and hit high speeds. Problem crops up if I use random data.
- Tried various changes to smb.conf as found online and no improvements
- Tried increasing the resources for the fileserver container from 512mb RAM/2 cores to 2048mb ram/4 cores, and slight improvement (gets a few more GB before failing)
- When in failed state the `smbd` process is pegged at max CPU of the container. Even if the container terminal is responsive, I can't do anything in there to kill the server and recover gracefully. (zombie smbd process consuming 200% CPU when 2 cores allocated)
- I tried spinning up a different Samba container - this time using a turnkey linux filesharing template. This fails in the exact same way.

I could work around this by just not using the samba share on the local node, or limiting the speeds, but I'd like to have a better understanding of whats going on or a cleaner solution if possible.
The above description matches my situation with the exception that the disk is a nvme ssd which is the same disk PVE resides on and I have created a zfspool there which I am having a Ubuntu Docker VM accessing via smb from Cockpit LXC. I am not sure exactly when/what triggers the state but it´s always during file transfers between nodes and once it hits, I have to physically power down to recover (although VMs and LXCs are responsive for other stuff".

Did you ever resolve this?

I am day two on my vacation and do not have physical access to the host so currently will just "live" with parts of services now frozen in order to keep the rest of the PVE functional until I get physical access which is quite frustrating.
 
Last edited:
The above description matches my situation with the exception that the disk is a nvme ssd which is the same disk PVE resides on and I have created a zfspool there which I am having a Ubuntu Docker VM accessing via smb from Cockpit LXC. I am not sure exactly when/what triggers the state but it´s always during file transfers between nodes and once it hits, I have to physically power down to recover (although VMs and LXCs are responsive for other stuff".

Did you ever resolve this?

I am day two on my vacation and do not have physical access to the host so currently will just "live" with parts of services now frozen in order to keep the rest of the PVE functional until I get physical access which is quite frustrating.
I never found a good solution, unfortunately. If I remember correctly, I kept using the SMB server as a target for other servers on the network, but stopped trying to mount it on the local node for backups. I believe I just gave the SMB server a directory and a ZFS mount for the node to use for backups.
Note that in my situation transferring between nodes never seemed to be a problem - it was only writing to the SMB from the node the SMB server was on.
 
  • Like
Reactions: ernholm
Note that SMB is async and your storage HAS to respect and use fsync for Samba to work properly.

You are describing behavior I found when the storage was also async, what happens then is your server writes full speed to RAM, until the session ends (a SMB session ends every 15 minutes) then your new SMB session won’t be able to start until all your data is written to disk, then your session will continue. In that time, your client has to hang or retry, or may timeout in some cases resulting in corrupt data being written. Note that SMB does not crash, it just can’t start a new session while an old session is writing to disk.

The SMB protocol is quite bad, if you need reliable data transfer, use NFS.
 
  • Like
Reactions: ernholm and UdoB
Sorry for the confusion, I meant between VM/LXC on the same (and only) node.

Outside the node (to other clients on network) SMB seemed to work fine.

Will try to set up NFS as a workaround (or run the containers shuffling data as a Docker LXC with mounting the zfspool as a directory directly).
 
Did any of you try setting a local tmpdir in the /etc/vzdump.conf file, as in the docs. This will cause the backup to temporarily be created locally & only then be transferred to the backup storage (SMB). This is also recommended when target is NFS/CIFS under certain conditions - see those docs.
 
I didn’t use backups here, the data was written by a VM (typically ffmpeg merging files etc and then other services transfer the finished files elsewhere outside the smb share) so not sure it would be applicable in my case?