[SOLVED] [Unsupported] NFS doesn't work on PBS

Nov 16, 2022
70
8
8
Sweden
manager is the correct one. I wrote from the top of my head at work, now I've checked. Works here, but still fails during backup.

This time:

ERROR: backup write data failed: command error: connection reset

I'm starting to suspect a default scheduled task that aborts the whole thing. I didn't add any, so maybe PBS has something built in that restarts a service every now and then?
 

fabian

Proxmox Staff Member
Staff member
Jan 7, 2016
8,680
1,776
174
manager is the correct one. I wrote from the top of my head at work, now I've checked. Works here, but still fails during backup.

This time:

ERROR: backup write data failed: command error: connection reset

I'm starting to suspect a default scheduled task that aborts the whole thing. I didn't add any, so maybe PBS has something built in that restarts a service every now and then?
there is no task that aborts backups (for obvious reasons ;)). even a package upgrade will reload the service, with old backup tasks still handled by the old process.

what do the logs on the PBS side say (both journal, and the backup job task there)?
 
Nov 16, 2022
70
8
8
Sweden
there is no task that aborts backups
No, what I meant was some kind of job that messes with the disk/daemon/network that in turn aborts the backup since it's done over LAN. Because 1 ms seems to be enough for the chunks to get corrupted, or something else bad happening.

I also find it strange that it manages to backup smaller VMs, like max 40 GB, but not larger ones over 150 GB. I tried raising RAM to 16 GB on the PBS server (was 4 GB) but it didn't help. I also removed DHCP on the firewall (it had both static and DHCP before, same address), but that didn't help either.

Got a nights sleep, and now I'm starting to think that it's not the QNAP mount over NFS that's the issue here - it's the connection between PVE and PBS. I don't know if I told you yet, but PBS is running as a VM on PVE. Same setup worked on Veeam for years, and I don't see why it wouldn't work here as well? But again, you seem picky with your choices. ;)

PVE = 192.168.1.20
PBS = 192.168.1.21

192.168.1.X is a regular LAN without and VLAN tags. I use it for management, and it's over 10 GBe, as the rest of the installation. In my normal setups I use DHCP on the client and lock that in the firewall, but I noticed that that doesn't rhyme well with Proxmox, so I remove DHCP all togehter ono those hosts. I still run DHCP on other stuff in the VLAN though. I use latest OPNsense for firewall.

Another strange thing is that when I get those "timeouts" (PBS datastore can't be reached from PVE + web dies + ssh dies) I can still log in through the console and ls -la /mnt/qnap. So that's I'm starting to think there's something up with running PBS as a VM on the same host that do the actual backups.

Anyway, here are the most recent logs, and the config for PBS:

1670020607997.png

1670020581357.png

Btw, thanks for noticing this issue even if it's not supported! :) Would indeed be nice to solve this.
 
Last edited:
Nov 16, 2022
70
8
8
Sweden
Also, it always seems to happen after this shows up in syslog:
Dec 01 07:23:18 pbs proxmox-backup-proxy[1071]: starting rrd data sync Dec 01 07:23:18 pbs proxmox-backup-proxy[1071]: rrd journal successfully committed (20 files in 0.025 seconds)

But I'm just wild guessing here.
 
Nov 16, 2022
70
8
8
Sweden
And connection seems to be OK when doing iPerf between PBS and PVE:

------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 128 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.1.20 port 5001 connected with 192.168.1.21 port 35604 [ ID] Interval Transfer Bandwidth [ 4] 0.0000-9.9962 sec 15.9 GBytes 13.6 Gbits/sec

Curious though, can I force NFS to talk TCP only? My mount looks like this:

10.255.255.10:/PBS on /mnt/qnap type nfs (rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.255.10,mountvers=3,mountport=30000,mountproto=tcp,local_lock=none,addr=10.255.255.10) tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=1637832k,nr_inodes=409458,mode=700,inode64)
 
Nov 16, 2022
70
8
8
Sweden
Question, would it be possible to run PBS on the same physical server as PVE by just installing the Debian packages? Is that against recomendation for some reason?

If it works, that would remove a layer between PBS and the backup server. In that case I'll just mount QNAP to the PVE instead. Just a thought.
 

Dunuin

Famous Member
Jun 30, 2020
9,532
2,501
156
Germany
Question, would it be possible to run PBS on the same physical server as PVE by just installing the Debian packages? Is that against recomendation for some reason?
That works and is even covered by the PBS documentation: https://pbs.proxmox.com/docs/installation.html#install-proxmox-backup-server-on-proxmox-ve
Downside would be that PVE host backups are on the todo list but in the future you of cause wouldn't be able to restore a PVE host from your PBS when the PBS is installed bare metal on the PVE host that failed.
 
Nov 16, 2022
70
8
8
Sweden
OK @Dunuin, since you have this configured since PBS 1.0 and it's working good you say, how did you do it?

This is a basic sketch of the layout:
1670106946339.png

The purple text refers to this:
1670106987693.png
It happens randomly, and a reboot of PBS is needed for it to come back. I get succesful status on garbage collection even it the state is as above, so that tells me the datastore is alive. It's rather the connection between PBS and PVE that now seems to be the issue.

PVE backups works without a hitch but PBS fails due to the 500 error. I use the setup as shown above for the PBS VM, and I haven't yet tried to use another NIC like Intel 1000 or something else.

I can also mention that the qemu-guest-agent is installed on PBS.

Any ideas?
 

Dunuin

Famous Member
Jun 30, 2020
9,532
2,501
156
Germany
fstab in PBS VM:
Code:
#NFS PBS
192.168.49.4:/mnt/HDDpool/VeryLongDatasetName/VLT/NRM/PBS /mnt/pbs  nfs      defaults,nfsvers=3    0       0

datastore.cfg:
Code:
cat /etc/proxmox-backup/datastore.cfg
datastore: PBS_DS1
        comment for weekly and manual stop backups
        gc-schedule sun 07:00
        path /mnt/pbs

PBS Dataset in TrueNAS:
truenas1.png


Dataset rights in TrueNAS:
truenas2.png

NFS share in TrueNAS:
truenas3.png

NFS service:
truenas4.png

PBS Datastore:
pbs1.png

PBS VM on TrueNAS Core:
vm1.png

So nothing really special.
 
Last edited:
  • Like
Reactions: enoch85
Nov 16, 2022
70
8
8
Sweden
Thanks @Dunuin! Seems like you only backup smaller VMs?

So, just tried to remove all the large VMs (over 150 GB) and everything went smooth, all successful.
So now trying another run with 40 GB RAM to the PBS VM since I noticed my 16 GB got smashed when the large ones were starting.

Crossing fingers now!
 
Last edited:

Dunuin

Famous Member
Jun 30, 2020
9,532
2,501
156
Germany
Thanks @Dunuin! Seens like you only backup smaller VMs?
Jup, biggest backups is 2x 200GB disks. I don't store a large amount of data on my guests. If there is any cold data, the guest will use SMB/NFS shares. And the cold data is backuped by ZFS replication to another server with ZFS. So my guests only contain the system, DBs and so on.
 
  • Like
Reactions: keeka
Nov 16, 2022
70
8
8
Sweden
Hmm, I think I need to re-think my structure a bit. I was building this setup to save energy, and combined a VMware ESXi host and TrueNAS fileserver into one, and still benefit from ZFS. But right now backups are suffering and I'm starting to think it might be a good idea to start the old TrueNAS host again and make it a Proxmox Backup server....

So much for saving energy. :(

I can also conclude there must be a memory leak somewhere, because I gave PBS 40 GB of RAM, and as soon as it started with the larger VM (after around 50 GB) it started to eat RAM. As soon as the 40 GB was finished, it timed out. This is a screenshot from the current I/O:

1670149950010.png

And the average usage (day)
1670150026947.png

And the backup log:

INFO: Starting Backup of VM 130 (qemu) INFO: Backup started at 2022-12-04 01:12:19 INFO: status = running INFO: VM Name: Windows10PRO INFO: include disk 'scsi0' 'mainstorage:vm-130-disk-0' 100G INFO: include disk 'scsi1' 'mainstorage:vm-130-disk-1' 4T INFO: include disk 'scsi2' 'mainstorage:vm-130-disk-2' 2T INFO: backup mode: snapshot INFO: ionice priority: 7 INFO: creating Proxmox Backup Server archive 'vm/130/2022-12-04T00:12:19Z' INFO: issuing guest-agent 'fs-freeze' command INFO: issuing guest-agent 'fs-thaw' command INFO: started backup task '646c9bdd-e9cf-47d3-897e-963cc245e38f' INFO: resuming VM again INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared INFO: scsi1: dirty-bitmap status: existing bitmap was invalid and has been cleared INFO: scsi2: dirty-bitmap status: existing bitmap was invalid and has been cleared INFO: 0% (360.0 MiB of 6.1 TiB) in 3s, read: 120.0 MiB/s, write: 109.3 MiB/s INFO: 1% (62.4 GiB of 6.1 TiB) in 11m 56s, read: 89.2 MiB/s, write: 69.3 MiB/s INFO: 2% (124.9 GiB of 6.1 TiB) in 23m 53s, read: 89.2 MiB/s, write: 60.6 MiB/s INFO: 2% (142.9 GiB of 6.1 TiB) in 28m 14s, read: 70.6 MiB/s, write: 63.3 MiB/s ERROR: backup write data failed: command error: connection reset INFO: aborting backup job INFO: resuming VM again ERROR: Backup of VM 130 failed - backup write data failed: command error: connection reset INFO: Failed at 2022-12-04 01:40:37

I also changed to file for the transfer instead of filesystem.
 
Last edited:
Nov 16, 2022
70
8
8
Sweden
OK, update:

Since RAM was building on the PBS server, I started to investigate NFS cache. I checked the export options on the QNAP export, and added sync with wdelay according to this: https://www.qnap.com/en/how-to/faq/...wdelay-and-secure-in-nfs-host-access-settings

I also added this in the mount on PBS: https://stackoverflow.com/a/57916352

I then lowered RAM on PBS to 8 GB, and started a new sync. So far it's been steady at 1.5 GB RAM usage, and even if the transfer is slower, it works (200 GB transferred)!

Next step will be to see if I can enable any cache at all on either side - but this backup will take 24 hours+ if it succeeds, so new update will be tomorrow.
 

Dunuin

Famous Member
Jun 30, 2020
9,532
2,501
156
Germany
Did you check what that RAM is used for?

I've got a dedicated host just for backups and wrote a script that will power on the backup host, boot it, unlock the encrypted ZFS pools, monitor the backups/replication and scrub tasks using the APIs nd then shutdown that host and cut the power when everything has finished. That way it is still energy efficient, as the backup server doesn't have to run longer than actually needed.
 
Last edited:
  • Like
Reactions: enoch85
Nov 16, 2022
70
8
8
Sweden
Did you check what that RAM is used for?

Running htop only showed 5 GB used, but free stated that it (around 30 GB) was used for cache. dmesg showed something about FS-Cache so that's why I started to look into that.

Probably it was a combination of async on client side and no sync (wdelay) on server side that was once of the issues.
Right now I can report that it stopped a ~300 GB using 8 GB RAM. I will try to do some more tweaks and see where I end up.
 
Nov 16, 2022
70
8
8
Sweden
Current status; are now using Jumbo Frames (9000) on the dedicated DAC between the PBS and QNAP. Also removed default in /etc/fstab to avoid the async option.

Running mount on PBS: 10.255.255.10:/PBS on /mnt/qnap type nfs (rw,relatime,sync,vers=3,rsize=262144,wsize=262144,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,noac,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.255.10,mountvers=3,mountport=30000,mountproto=tcp,lookupcache=none,local_lock=none,addr=10.255.255.10)

cat /etc/exports on QNAP: "/share/CACHEDEV1_DATA/PBS" 10.255.255.12(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf) 192.168.1.20(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf) 192.168.1.21(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf)

Current /etc/fstab on PBS:
# NFS 10.255.255.10:/PBS /mnt/qnap nfs rw,suid,dev,exec,auto,sync,nouser,fg,noac,lookupcache=none,mountproto=tcp,nfsvers=3

Also set "ballooning device" on PBS with 8 GB minimum, and 32 GB max.

Running now, let's see how it turns out.
 
Nov 16, 2022
70
8
8
Sweden
Ok, it died in the middle of a transfer, no aparent reason. RAM wasn't filled, it just died.

Now, reverting Jumbo Frames, and remove ballooning device. Giving in 32 GB static.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!