[SOLVED] [Unsupported] NFS doesn't work on PBS

Dunuin · Dec 1, 2022

Above I see

enoch85 said:
proxmox-backup-manager datastore create truenasnfs /path/to/nfs/ahsre

as well as

proxmox-backup-client datastore create truenasnfs /path/to/nfs/ahsre

enoch85 · Dec 1, 2022

manager is the correct one. I wrote from the top of my head at work, now I've checked. Works here, but still fails during backup.

This time:

ERROR: backup write data failed: command error: connection reset

I'm starting to suspect a default scheduled task that aborts the whole thing. I didn't add any, so maybe PBS has something built in that restarts a service every now and then?

enoch85 · Dec 1, 2022

Works directly from PVE, but not from PBS. Below is PVE. Notice the read speed. On PBS that's around 50 MiB/s. Also strange to me. They use the same cable.

ZooKeeper · Dec 2, 2022

enoch85 said:
Are you running the command from within PBS?

View attachment 44037

yes, I am running from pbs. By the way, if you look at your command and @Dunuin command under reply, seems someone edited it.

Here is new error:

fabian · Dec 2, 2022

enoch85 said:
manager is the correct one. I wrote from the top of my head at work, now I've checked. Works here, but still fails during backup.

This time:

ERROR: backup write data failed: command error: connection reset

I'm starting to suspect a default scheduled task that aborts the whole thing. I didn't add any, so maybe PBS has something built in that restarts a service every now and then?

there is no task that aborts backups (for obvious reasons

). even a package upgrade will reload the service, with old backup tasks still handled by the old process.

what do the logs on the PBS side say (both journal, and the backup job task there)?

enoch85 · Dec 2, 2022

fabian said:
there is no task that aborts backups

No, what I meant was some kind of job that messes with the disk/daemon/network that in turn aborts the backup since it's done over LAN. Because 1 ms seems to be enough for the chunks to get corrupted, or something else bad happening.

I also find it strange that it manages to backup smaller VMs, like max 40 GB, but not larger ones over 150 GB. I tried raising RAM to 16 GB on the PBS server (was 4 GB) but it didn't help. I also removed DHCP on the firewall (it had both static and DHCP before, same address), but that didn't help either.

Got a nights sleep, and now I'm starting to think that it's not the QNAP mount over NFS that's the issue here - it's the connection between PVE and PBS. I don't know if I told you yet, but PBS is running as a VM on PVE. Same setup worked on Veeam for years, and I don't see why it wouldn't work here as well? But again, you seem picky with your choices.

PVE = 192.168.1.20
PBS = 192.168.1.21

192.168.1.X is a regular LAN without and VLAN tags. I use it for management, and it's over 10 GBe, as the rest of the installation. In my normal setups I use DHCP on the client and lock that in the firewall, but I noticed that that doesn't rhyme well with Proxmox, so I remove DHCP all togehter ono those hosts. I still run DHCP on other stuff in the VLAN though. I use latest OPNsense for firewall.

Another strange thing is that when I get those "timeouts" (PBS datastore can't be reached from PVE + web dies + ssh dies) I can still log in through the console and ls -la /mnt/qnap. So that's I'm starting to think there's something up with running PBS as a VM on the same host that do the actual backups.

Anyway, here are the most recent logs, and the config for PBS:

Btw, thanks for noticing this issue even if it's not supported!

Would indeed be nice to solve this.

enoch85 · Dec 3, 2022

Also, it always seems to happen after this shows up in syslog:

Dec 01 07:23:18 pbs proxmox-backup-proxy[1071]: starting rrd data sync
Dec 01 07:23:18 pbs proxmox-backup-proxy[1071]: rrd journal successfully committed (20 files in 0.025 seconds)

But I'm just wild guessing here.

enoch85 · Dec 3, 2022

And connection seems to be OK when doing iPerf between PBS and PVE:

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
[  4] local 192.168.1.20 port 5001 connected with 192.168.1.21 port 35604
[ ID] Interval       Transfer     Bandwidth
[  4] 0.0000-9.9962 sec  15.9 GBytes  13.6 Gbits/sec

Curious though, can I force NFS to talk TCP only? My mount looks like this:

10.255.255.10:/PBS on /mnt/qnap type nfs (rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.255.10,mountvers=3,mountport=30000,mountproto=tcp,local_lock=none,addr=10.255.255.10)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=1637832k,nr_inodes=409458,mode=700,inode64)

enoch85 · Dec 3, 2022

Question, would it be possible to run PBS on the same physical server as PVE by just installing the Debian packages? Is that against recomendation for some reason?

If it works, that would remove a layer between PBS and the backup server. In that case I'll just mount QNAP to the PVE instead. Just a thought.

Dunuin · Dec 3, 2022

enoch85 said:
Question, would it be possible to run PBS on the same physical server as PVE by just installing the Debian packages? Is that against recomendation for some reason?

That works and is even covered by the PBS documentation: https://pbs.proxmox.com/docs/installation.html#install-proxmox-backup-server-on-proxmox-ve
Downside would be that PVE host backups are on the todo list but in the future you of cause wouldn't be able to restore a PVE host from your PBS when the PBS is installed bare metal on the PVE host that failed.

enoch85 · Dec 3, 2022

OK @Dunuin, since you have this configured since PBS 1.0 and it's working good you say, how did you do it?

This is a basic sketch of the layout:

The purple text refers to this:

It happens randomly, and a reboot of PBS is needed for it to come back. I get succesful status on garbage collection even it the state is as above, so that tells me the datastore is alive. It's rather the connection between PBS and PVE that now seems to be the issue.

PVE backups works without a hitch but PBS fails due to the 500 error. I use the setup as shown above for the PBS VM, and I haven't yet tried to use another NIC like Intel 1000 or something else.

I can also mention that the qemu-guest-agent is installed on PBS.

Any ideas?

Dunuin · Dec 3, 2022

fstab in PBS VM:

Code:

#NFS PBS
192.168.49.4:/mnt/HDDpool/VeryLongDatasetName/VLT/NRM/PBS /mnt/pbs  nfs      defaults,nfsvers=3    0       0

datastore.cfg:

Code:

cat /etc/proxmox-backup/datastore.cfg
datastore: PBS_DS1
        comment for weekly and manual stop backups
        gc-schedule sun 07:00
        path /mnt/pbs

PBS Dataset in TrueNAS:

Dataset rights in TrueNAS:

NFS share in TrueNAS:

NFS service:

PBS Datastore:

PBS VM on TrueNAS Core:

So nothing really special.

enoch85 · Dec 4, 2022

Thanks @Dunuin! Seems like you only backup smaller VMs?

So, just tried to remove all the large VMs (over 150 GB) and everything went smooth, all successful.
So now trying another run with 40 GB RAM to the PBS VM since I noticed my 16 GB got smashed when the large ones were starting.

Crossing fingers now!

Dunuin · Dec 4, 2022

enoch85 said:
Thanks @Dunuin! Seens like you only backup smaller VMs?

Jup, biggest backups is 2x 200GB disks. I don't store a large amount of data on my guests. If there is any cold data, the guest will use SMB/NFS shares. And the cold data is backuped by ZFS replication to another server with ZFS. So my guests only contain the system, DBs and so on.

enoch85 · Dec 4, 2022

Hmm, I think I need to re-think my structure a bit. I was building this setup to save energy, and combined a VMware ESXi host and TrueNAS fileserver into one, and still benefit from ZFS. But right now backups are suffering and I'm starting to think it might be a good idea to start the old TrueNAS host again and make it a Proxmox Backup server....

So much for saving energy.

I can also conclude there must be a memory leak somewhere, because I gave PBS 40 GB of RAM, and as soon as it started with the larger VM (after around 50 GB) it started to eat RAM. As soon as the 40 GB was finished, it timed out. This is a screenshot from the current I/O:

And the average usage (day)

And the backup log:

INFO: Starting Backup of VM 130 (qemu)
INFO: Backup started at 2022-12-04 01:12:19
INFO: status = running
INFO: VM Name: Windows10PRO
INFO: include disk 'scsi0' 'mainstorage:vm-130-disk-0' 100G
INFO: include disk 'scsi1' 'mainstorage:vm-130-disk-1' 4T
INFO: include disk 'scsi2' 'mainstorage:vm-130-disk-2' 2T
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating Proxmox Backup Server archive 'vm/130/2022-12-04T00:12:19Z'
INFO: issuing guest-agent 'fs-freeze' command
INFO: issuing guest-agent 'fs-thaw' command
INFO: started backup task '646c9bdd-e9cf-47d3-897e-963cc245e38f'
INFO: resuming VM again
INFO: scsi0: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi1: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO: scsi2: dirty-bitmap status: existing bitmap was invalid and has been cleared
INFO:   0% (360.0 MiB of 6.1 TiB) in 3s, read: 120.0 MiB/s, write: 109.3 MiB/s
INFO:   1% (62.4 GiB of 6.1 TiB) in 11m 56s, read: 89.2 MiB/s, write: 69.3 MiB/s
INFO:   2% (124.9 GiB of 6.1 TiB) in 23m 53s, read: 89.2 MiB/s, write: 60.6 MiB/s
INFO:   2% (142.9 GiB of 6.1 TiB) in 28m 14s, read: 70.6 MiB/s, write: 63.3 MiB/s
ERROR: backup write data failed: command error: connection reset
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 130 failed - backup write data failed: command error: connection reset
INFO: Failed at 2022-12-04 01:40:37

I also changed to file for the transfer instead of filesystem.

enoch85 · Dec 4, 2022

OK, update:

Since RAM was building on the PBS server, I started to investigate NFS cache. I checked the export options on the QNAP export, and added sync with wdelay according to this: https://www.qnap.com/en/how-to/faq/...wdelay-and-secure-in-nfs-host-access-settings

I also added this in the mount on PBS: https://stackoverflow.com/a/57916352

I then lowered RAM on PBS to 8 GB, and started a new sync. So far it's been steady at 1.5 GB RAM usage, and even if the transfer is slower, it works (200 GB transferred)!

Next step will be to see if I can enable any cache at all on either side - but this backup will take 24 hours+ if it succeeds, so new update will be tomorrow.

Dunuin · Dec 4, 2022

Did you check what that RAM is used for?

I've got a dedicated host just for backups and wrote a script that will power on the backup host, boot it, unlock the encrypted ZFS pools, monitor the backups/replication and scrub tasks using the APIs nd then shutdown that host and cut the power when everything has finished. That way it is still energy efficient, as the backup server doesn't have to run longer than actually needed.

enoch85 · Dec 4, 2022

Dunuin said:
Did you check what that RAM is used for?

Running htop only showed 5 GB used, but free stated that it (around 30 GB) was used for cache. dmesg showed something about FS-Cache so that's why I started to look into that.

Probably it was a combination of async on client side and no sync (wdelay) on server side that was once of the issues.
Right now I can report that it stopped a ~300 GB using 8 GB RAM. I will try to do some more tweaks and see where I end up.

enoch85 · Dec 4, 2022

Current status; are now using Jumbo Frames (9000) on the dedicated DAC between the PBS and QNAP. Also removed default in /etc/fstab to avoid the async option.

Running mount on PBS:

10.255.255.10:/PBS on /mnt/qnap type nfs (rw,relatime,sync,vers=3,rsize=262144,wsize=262144,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,noac,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.255.255.10,mountvers=3,mountport=30000,mountproto=tcp,lookupcache=none,local_lock=none,addr=10.255.255.10)

cat /etc/exports on QNAP:

"/share/CACHEDEV1_DATA/PBS" 10.255.255.12(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf) 192.168.1.20(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf) 192.168.1.21(sec=sys,rw,sync,wdelay,secure,no_subtree_check,no_root_squash,fsid=79f31dbada044d5b1ad055457a33c8bf)

Current /etc/fstab on PBS:

# NFS
10.255.255.10:/PBS /mnt/qnap  nfs rw,suid,dev,exec,auto,sync,nouser,fg,noac,lookupcache=none,mountproto=tcp,nfsvers=3

Also set "ballooning device" on PBS with 8 GB minimum, and 32 GB max.

Running now, let's see how it turns out.

enoch85 · Dec 4, 2022

Ok, it died in the middle of a transfer, no aparent reason. RAM wasn't filled, it just died.

Now, reverting Jumbo Frames, and remove ballooning device. Giving in 32 GB static.

[SOLVED] [Unsupported] NFS doesn't work on PBS

Distinguished Member

Member

Member

Active Member

Proxmox Staff Member

Member

Member

Member

Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Member

Distinguished Member

Member

Member

Member

We value your privacy