Rclone sync and rclone check crashing ubuntu server VM in ProxMox

atkuzmanov · Aug 13, 2024

Hope everyone is doing well!

Thanks to everyone in ProxMox for their amazing work and effort, huge fan, and thanks to anyone willing to help, I really appreciate it, means a lot to me!

I am trying to use rclone to first sync and then check several source and destination directories.

The destinations is an SSD.

I am using rclone to sync around 2TB of data to an SSD.

When I tried to run the rclone check command to the SSD destination the Ubuntu machine froze up and became absolutely non responsive.
The only way was to do a HARD shut down and start it again.

However lowering the checkers even to 1 did not help (`--checkers=1`).

I played with all parameter/flags I could find which seemed relevant in the rclone documentation, but could not solve the problem.

It looks like when it's running against a faster media such as an SSD it is going over some kind of a limit and it's crashing the Ubuntu machine.

I thought it could be an out of memory issue, however all indicators of the Ubuntu Server machine seem fine.

This is the rclone command which is causing the issue:

The rclone command which is failing:

Code:

rclone check "$src" "$dest" \

--checkers=1 \

--fast-list \

--multi-thread-streams=0 \

--buffer-size=0 \

--one-way \

--checksum \

--log-file="$LOG_FILE" \

--log-level=DEBUG \

--retries 3 \

--retries-sleep 3s \

--progress

The full script I use to run the rclone sync and rclone check commands is in pastebin below.

https://pastebin.com/PxLjAPv5

---

Ubuntu Server version:

Code:

ubuntu 24.04 (64 bit)

---

Rclone version:

Code:

rclone --version

rclone v1.67.0

- os/version: ubuntu 24.04 (64 bit)

- os/kernel: 6.8.0-40-generic (x86_64)

- os/type: linux

- os/arch: amd64

- go/version: go1.22.4

- go/linking: static

- go/tags: none

---

Rclone config:

https://pastebin.com/FVWG21Ab

---

Here are some logs in pastebin:

The log files was more than 90MB, I had to cut out the mundane output from the sync and check to fit the log in pastebin's limits.

But yeah kind of just cuts of while the check is going, the last part of the log is genuine and has not been cut down to fit in pastebin.

https://pastebin.com/HRpim1yX

This log is a bit more different as it has some weird symbols in the end:

This is the latest log, for some reason it's not showing in full in pastebin unless you click on raw to view it in it's entirety:

https://pastebin.com/aA0ERkmu

The log files from the rclone sync or rclone check commands don't have anything which seems to point to any issue.

When the crash happens, the log file just stops at the point of the file which it was currently processing.

---

Its a brand new SSD and I did a SMART test on the SSD and all seems good:

Code:

sudo smartctl -H /dev/sdc

[sudo] password for usertemp:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-40-generic] (local build)

Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

---

The setup:

Code:

ProxMox PVE 8.2.4

All sources are locally mounted `TrueNas Scale` SMB shares.

TrueNas Scale is also another VM running on the same `ProxMox PVE`, as is the `Ubuntu Server` running `Rclone`.

The TrueNas Scale VM exhibits no sign of any load before, after and ruing when the crash happens, just seems to be chugging along not feeling stressed.

The HDDs and SSD are all directly connected to the computer via a `SAS controller`. The SAS controller is exclusively given to the VM running Ubuntu Server and I see no errors on this side.

Some of the shares have normal or larger files, others have loads of small files.

Ubuntu Server VM:

Code:

- 8 Cores

- 8GB RAM

TrueNas Scale VM:

Code:

- 4 Cores

- 22GB RAM

- 2 x 6TB NAS HDDs 5400RPM in RAID1

- 1 x 1TB SSD for read cache

---

I researched how to debug Ubuntu Server the best I can, and here is what I came up with so far:

Ubuntu Server DEBUG information:

Code:

sudo cat /var/log/kern.log | grep error

https://pastebin.com/S5VS61Tx

Code:

sudo cat /var/log/syslog | grep error

https://pastebin.com/kM3jxVwb

---

It seems like it's not rclone that is freezing the OS, but something related to I/O reading/writing large amounts of data, but I don't know how to debug it.

I even tried using `ionice -c2 -n7 nice -n 10 rclone check` to reduce the priority of the rclone command to try and throttle the read/write ops, but no joy.

When the crash of the Ubuntu Server VM happens it becomes absolutely non responsive, the guest agent stops running, I cannot SSH into the machine. I also cannot shutdown the machine. Only issuing `STOP` to the VM works in shutting it down, or shutting down the whole ProxMox PVE node, to try and do a more graceful of the VM itself.

I just want it to complete successfully.

I am going crazy trying to figure out what am I doing wrong.

Any help in debugging and fixing this is appreciated!

atkuzmanov · Aug 18, 2024

I guess I will be deleting this thread and posting a new, more concise one, since I did not get any replies.

Can someone please delete this thread, I thought I could do it myself, but I don't see how to delete it?

leesteken · Aug 18, 2024

atkuzmanov said:
Can someone please delete this thread, I thought I could do it myself, but I don't see how to delete it?

Deleting threads is not possible (except for staff members but they only seem do that for spam).

It's not clear to me what SSD drive you are using. ZFS on a QLC SSD is known to become unresponsive. ZFS with heavy swapping on the host can also lead to some kind of deadlock.

atkuzmanov · Aug 18, 2024

leesteken said:
Deleting threads is not possible (except for staff members but they only seem do that for spam).

It's not clear to me what SSD drive you are using. ZFS on a QLC SSD is known to become unresponsive. ZFS with heavy swapping on the host can also lead to some kind of deadlock.

Hi and thank you for your reply!
I see...

So the SSD is a Samsung 870 EVO, 4TB, 2.5", SATA III, I just checked and this is what I found:

"The 870 EVO uses TLC (or 3bit MLC, as dubbed by Samsung) 3D V-NAND, is available in capacities from 256GB up to 4TB, and features the company’s newest in-house controller." https://www.storagereview.com/review/samsung-870-evo-ssd-review

So, the flow of the data should be like this:

Code:

TrueNas Scale VM:                                                |           Ubuntu Server VM: 
[SOURCE: NAS-HDD5400RPM]->[CAHCE: SSD-SAMSUNG-870EVO-1TB]---SMB-mouted-to-Rclone--->[TARGET: SSD-SAMSUNG-870EVO-1TB - ExFAT formatted]

ProxMox itself is running on an (SSD) KINGSTON NV2 M.2-2280 PCIe 4.0 NVMe 2000GB, which I guess is QLC, but if that was the issue wouldn't the whole host freeze up?

leesteken · Aug 18, 2024

atkuzmanov said:
ProxMox itself is running on an (SSD) KINGSTON NV2 M.2-2280 PCIe 4.0 NVMe 2000GB, which I guess is QLC, but if that was the issue wouldn't the whole host freeze up?

Please note that It's called Proxmox VE or PVE for short or just Proxmox but not ProxMox.
Are you running Proxmox on ZFS (maybe show /etc/pve/storage.cfg also)? I would not expect that to matter much if the VM and the source and destination are not on that drive. Unless Proxmox is logging a lot and that brings down the QLC drive.
Unless you use VirtIO SCSI single with IO Thread on each virtual disk, the virtual network and disks share a single thread for I/O (maybe show the relevant VM configuration files also) and that might make it unresponsive.

atkuzmanov · Aug 18, 2024

leesteken said:
Please note that It's called Proxmox VE or PVE for short or just Proxmox but not ProxMox.
Are you running Proxmox on ZFS (maybe show /etc/pve/storage.cfg also)? I would not expect that to matter much if the VM and the source and destination are not on that drive. Unless Proxmox is logging a lot and that brings down the QLC drive.
Unless you use VirtIO SCSI single with IO Thread on each virtual disk, the virtual network and disks share a single thread for I/O (maybe show the relevant VM configuration files also) and that might make it unresponsive.

Thank you again for your time, it means a lot to me!

Apologies, noted, for some reason typing it out as "ProxMox" comes naturally to me while typing, its not by intent, I will try to be more careful.

I am a newbie to Proxmox, so please let me know if there are other files I can also provide for debug info.

I am not sure if I am using "VirtIO SCSI single with IO Thread on each virtual disk", can you tell if it is so from the info below, or please let me know what other debug info do you need?

Here is the /etc/pve/storage.cfg:

Bash:

dir: local
  path /var/lib/vz
  content backup,vztmpl,iso

zfspool: local-zfs
  pool rpool/data
  content images,rootdir
  sparse 1

pbs: pbs-store-1
  datastore tnsproxbkps1
  server 192.168.0.45
  content backup
  fingerprint XXX
  prune-backups keep-all=1
  username root@pam!rootpam-pbs-store-1

And here is the /etc/pve/qemu-server/100.conf:

Bash:

agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;net0
cores: 4
cpu: x86-64-v2-AES
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:08:00,pcie=1
machine: q35
memory: 8192
meta: creation-qemu=9.0.0,ctime=1720772404
name: oddjob-workhorse-ubus-1
net0: virtio=BC:24:11:CC:55:4F,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: snap-man-vm100-oddjob-2024-07-30-1148
scsi0: local-zfs:vm-100-disk-1,cache=writethrough,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=cf16d928-21db-43da-877f-88b9f05bd29e
sockets: 1
tablet: 0
vmgenid: cae7a5bf-0063-4e33-9425-dfc0b8faf21c

[snap-man-vm100-oddjob-2024-07-30-1147]
#snap-man-vm100-oddjob-2024-07-30-1147 RAM inc
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:08:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 8192
meta: creation-qemu=9.0.0,ctime=1720772404
name: oddjob-workhorse-ubus-1
net0: virtio=BC:24:11:CC:55:4F,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
runningcpu: qemu64,+aes,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+pni,+popcnt,+sse4.1,+sse4.2,+ssse3
runningmachine: pc-q35-9.0+pve0
scsi0: local-zfs:vm-100-disk-1,cache=writethrough,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=cf16d928-21db-43da-877f-88b9f05bd29e
snaptime: 1722329259
sockets: 1
tablet: 0
vmgenid: cae7a5bf-0063-4e33-9425-dfc0b8faf21c
vmstate: local-zfs:vm-100-state-snap-man-vm100-oddjob-2024-07-30-1147

[snap-man-vm100-oddjob-2024-07-30-1148]
#snap-man-vm100-oddjob-2024-07-30-1148 no RAM
agent: 1
balloon: 0
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 8
cpu: x86-64-v2-AES
efidisk0: local-zfs:vm-100-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:08:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 8192
meta: creation-qemu=9.0.0,ctime=1720772404
name: oddjob-workhorse-ubus-1
net0: virtio=BC:24:11:CC:55:4F,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: snap-man-vm100-oddjob-2024-07-30-1147
runningcpu: qemu64,+aes,enforce,+kvm_pv_eoi,+kvm_pv_unhalt,+pni,+popcnt,+sse4.1,+sse4.2,+ssse3
runningmachine: pc-q35-9.0+pve0
scsi0: local-zfs:vm-100-disk-1,cache=writethrough,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=cf16d928-21db-43da-877f-88b9f05bd29e
snaptime: 1722329283
sockets: 1
tablet: 0
vmgenid: cae7a5bf-0063-4e33-9425-dfc0b8faf21c
vmstate: local-zfs:vm-100-state-snap-man-vm100-oddjob-2024-07-30-1148

Here are also some screenshots of the configs:

atkuzmanov · Aug 20, 2024

leesteken said:
Please note that It's called Proxmox VE or PVE for short or just Proxmox but not ProxMox.
Are you running Proxmox on ZFS (maybe show /etc/pve/storage.cfg also)? I would not expect that to matter much if the VM and the source and destination are not on that drive. Unless Proxmox is logging a lot and that brings down the QLC drive.
Unless you use VirtIO SCSI single with IO Thread on each virtual disk, the virtual network and disks share a single thread for I/O (maybe show the relevant VM configuration files also) and that might make it unresponsive.

@leesteken, sorry to bother you, but have you had a chance to look at the information I provided in the previous comment? ( :

leesteken · Aug 20, 2024

atkuzmanov said:
@leesteken, sorry to bother you, but have you had a chance to look at the information I provided in the previous comment? ( :

I have no idea, sorry.

atkuzmanov · Aug 20, 2024

leesteken said:
I have no idea, sorry.

That's, ok, I appreciate your help anyways! : )

Just to make sure I understand it correctly - I am using VirtIO SCSI single as far as I understand from my config, and this is the better option as it gives me one thread per virtual disk, or I should switch to the other option?

leesteken · Aug 20, 2024

atkuzmanov said:
Just to make sure I understand it correctly - I am using VirtIO SCSI single as far as I understand from my config, and this is the better option as it gives me one thread per virtual disk, or I should switch to the other option?

As long as you enable IO Thread on each virtual disk, then yes.

atkuzmanov · Aug 20, 2024

leesteken said:
As long as you enable IO Thread on each virtual disk, then yes.

Awesome, thank you!
So, I have this iothread=1 in the /etc/pve/qemu-server/100.conf and it has just one disk.
Do I need to find a way to enable iothread=1 for the mounted SSD disk which is attached via SATA via the SAS controller to the VM? Does it count as a virtual disk?
And also to find a way to enable iothread=1 for the SMB share mounts? Do they count as a virtual disks?

atkuzmanov · Aug 21, 2024

leesteken said:
... Unless Proxmox is logging a lot and that brings down the QLC drive. ...

@leesteken , hi, so I tried to run it with minimum, next to none, logging from rclone and it still crashed.
There is probably another logging by Proxmox or the VM which is still going very high and bringing it down, can you please point me where to look for it, and how to reduce it?

leesteken · Aug 21, 2024

atkuzmanov said:
There is probably another logging by Proxmox or the VM which is still going very high and bringing it down, can you please point me where to look for it, and how to reduce it?

I am really of ideas and I don't know what you mean by this. Maybe there is some kind of hardware instability in your system?

atkuzmanov · Aug 21, 2024

leesteken said:
I am really of ideas and I don't know what you mean by this. Maybe there is some kind of hardware instability in your system?

That's ok, thanks anyways, I appreciate your previous help, it has given me some ideas.

I am trying to find a way to prove what is the problem, if it's the QCL drive, then I am trying to find a way to prove it.

I will replace it as soon as I can, but it will take some time as it will be costly, as there are actually two NVMes running in RAID managed by Proxmox, that's how I set it up when I was installing Proxmox.

Search

Search

Rclone sync and rclone check crashing ubuntu server VM in ProxMox

atkuzmanov

New Member

Attachments

atkuzmanov

New Member

leesteken

Distinguished Member

atkuzmanov

New Member

leesteken

Distinguished Member

atkuzmanov

New Member

Attachments

atkuzmanov

New Member

leesteken

Distinguished Member

atkuzmanov

New Member

leesteken

Distinguished Member

atkuzmanov

New Member

atkuzmanov

New Member

leesteken

Distinguished Member

atkuzmanov

New Member