VM hosted SMB/CIFS storage cause ISO/template download corruption

pvefanmark

New Member
Jan 29, 2024
16
0
1
Hi, there

I have a ubuntu VM in PVE (pve1) that serves as a samba server. Let's call it fs (file server). In PVE/storage, I created a SMB/CIFS repository called fsRepo that uses the samba share. FsRepo stores backup, ISO and templates. Backup/restore all work well.

However, I noticed that if I use PVE GUI to download ISO or container templates into the SMB/CIFS share, the images are frequently corrupted (about 90% of the time). checksums would not match though the file size matches. I further found I can reproduce the corruption by executing wget with the samba share as the target, in pve shell.

It doesn't matter what ISO, what templates, or what URLs. There is no suspicious errors in the samba server log. Files as small as 100-200MB would corrupt.

I did some experiments to troubleshoot the issue. I would like to hear suggestions, and whether I should open a case in bug reports.

The environment:
  1. PVE version: pve-manager/8.1.3/b46aac3b42da5d15 (running kernel: 6.5.11-4-pve)
  2. Install from 8.1.3 PVE ISO, no special customization or update done.
To summarize the experiments:
  1. In a pve host, if wget is executed in the pve host root shell with the SMB/CIFS mount as the target, corruption will happen unless I limit the download rate to 5MB/s.
  2. In a pve host, if wget is excuted in one vm with the SMB/CIFS mount as the target, and the samba share is in another vm, wget download will not corrupt even at high download rate.
  3. #1 and #2 can be reproduced independently on two pve hosts, one is Mac Mini, another is a Zotac mini PC.
  4. If wget is exectued in one pve host, and the samba server is on another pve host, corruption will not happen.
  5. Even if the samba server shares a ramdisk (e.g. /tmp/sharedFolder), the same pattern of #1 and #2 hold.
It appears that only when wget runs in the pve host, with a target in a samba hosted in the same PVE, data corruption will happen. The corruption can be eliminated by restricting the download rate (thus the saving rate to samba).

The samba server is a ubuntu VM. I have been using it for the past 6-7 years for all file storage/access. There has been no issues. To rule out, I plan to use a Turnkey file server LXC to create a samba share. If this pattern can be reproduced again with the LXC instance, it would rule out the samba server.

Thanks.
 
Last edited:
One possibility is that you have low-end consumer disks. When you pull data on the same server, the reads and writes go to same disk. May be its overwhelmed and is lying about completion. May be there is also some sort of caching enabled along the path.
Just a guess. IMHO, its unlikely to be either SMB/VM or the OS/wget. Most likely disk related. You can do a controlled test it by attaching another disk to node1 and pulling the ISO from disk1 to disk2. Or if you have enough ram - create a ramdisk for temporary storage.

Again, just a guess.
Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
One possibility is that you have low-end consumer disks. When you pull data on the same server, the reads and writes go to same disk. May be its overwhelmed and is lying about completion. May be there is also some sort of caching enabled along the path.
Just a guess. IMHO, its unlikely to be either SMB/VM or the OS/wget. Most likely disk related. You can do a controlled test it by attaching another disk to node1 and pulling the ISO from disk1 to disk2. Or if you have enough ram - create a ramdisk for temporary storage.

Again, just a guess.
Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Good suggestion. I forgot to document I did exactly that. Let me add to the original post.

One of the tests was a samba server using ramdisk /tmp/bugRepo. The corruption still exists with the same pattern - i.e. wget from pve host will corrupt, wget from a vm in the pve host will not corrupt.
 
Few points:
- Corruption is a pretty vague problem description. I dont necessarily need to know every detail, but it'd be good to understand what exactly you mean by that word.
- There are no other reports similar to yours, so its more likely than not that its limited to your environment. The causes could range from bad disk, to bad memory, to motherboard, to something completely different.
- Since the issue is scoped to your specific environment (running consumer hardware), its up to you to do the leg work of trying to narrow down the problem.
- You can report the issue in bugzilla.proxmox.com. However, without reliable reproduction procedure and not being entitled to support, this issue is unlikely to receive much attention.
- If I were you and I felt very strongly about trying to find the cause, things that can be done:
a) use fio to create files of various sizes (100m/200m/500m/1g/etc) with predictable pattern. Try to determine at what point "corruption" occurres
b) run tcpdump on each side to determine whether good data is received by target and is changed on write, or already changed at sent
c) use other protocols (ftp,nfs,ssh) to determine any commonality in behavior.
d) use fio to do extensive read/write/verify operations on hypervisor, VM and between the two.

Good luck.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
  • Like
Reactions: Dunuin
Few points:
- Corruption is a pretty vague problem description. I dont necessarily need to know every detail, but it'd be good to understand what exactly you mean by that word.
- There are no other reports similar to yours, so its more likely than not that its limited to your environment. The causes could range from bad disk, to bad memory, to motherboard, to something completely different.
- Since the issue is scoped to your specific environment (running consumer hardware), its up to you to do the leg work of trying to narrow down the problem.
- You can report the issue in bugzilla.proxmox.com. However, without reliable reproduction procedure and not being entitled to support, this issue is unlikely to receive much attention.
- If I were you and I felt very strongly about trying to find the cause, things that can be done:
a) use fio to create files of various sizes (100m/200m/500m/1g/etc) with predictable pattern. Try to determine at what point "corruption" occurres
b) run tcpdump on each side to determine whether good data is received by target and is changed on write, or already changed at sent
c) use other protocols (ftp,nfs,ssh) to determine any commonality in behavior.
d) use fio to do extensive read/write/verify operations on hypervisor, VM and between the two.

Good luck.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Thank you for the suggestion. I will dig more.
 
I have some updates on this. The problem now occurs on a 3rd brand new MiniPC as well with a brand new samba VM built with a new ubuntu 22.04 install.

1. I have used vdiff to compare the corrupted ISO file with the clean one. The corruption happens often happens after 100MB clean bytes, then corrupt for a stretch. There is no definitive pattern.
2. I could reproduce the corruption using wget in the PVE host shell with a target directory in the SMB share. If I use aria2c to download, the corruption doesn't happen.

Now on a 3rd brand new Mini PC that I have deployed PVE, the same problem also occurs. The problem is even more peculiar:

1. I have created a brand new SMB server using the latest ubuntu 22.04 install.
2. PVE built-in ISO download feature will corrupt the ISO if the target is the new SMB share. AND the download passes its built-in SHA256sum check, though the final stored file in the SMB share is corrupt.
3. On the PVE host, wget of the ISO with a target dir in the samba share also corrupts.

I initially reported in this thread, the corruption only happens if the PVE host downloads into a samba share from a VM hosted on the same PVE host. If the wget is issued from another VM on the same host or another PVE host shell, the corruption doesn't happen. Purposely slow down the download speed (wget option) also prevent corruption.

This is no longer true: I now can have corruption if the samba vm is on machine #3 (which is 30x faster than 1 or 2), but the download is from another PVE host (wget or built-in download). I suspect it's because the samba VM is much faster now.

Sorry I haven't found any new insight other than the download corrupts in all 3 of my PVE hosts, as long as I use samba share as the storage.
 
Last edited:
is this for a specific file, or all files?
-edit- if the downloads passes checksum, you're getting the same file as was provided. maybe the file is corrupt at the source?
Practically all ISO files, as long as it's not too small (i.e. <100M).

On the second point, I am scratching my head as well. The file is not corrupt at the source. In this case on my 3rd mini PC, I downloaded TrueNas Core, which has a sha256sum I must enter into the PVE download UI to verify after the download. It passes that check! And yet the resulting file is corrupt:

1. the TrueNas core install from the ISO in the SMB share complains about file corruption.
2. if I login to the samba server and do sha256sum locally, it doesn't match. No wonder the install fails.

I even purposely modified the sha256sum on the PVE downloader page to ensure it actually checks. Yes. it catches that the checksum doesn't match.

I can't imagine a scenario where the PVE built-in downloader passes the checksum but the actual file on disk doesn't. Unless Samba server or samba client cache the data and send it to the downloader is different from what is on the disk ?

I have also tested on copying out the ISO from the samba share to a windows machine. This copy never corrupts. I surely hope so because I have been using this samba server/vm for the past 6-7 years with a large quantity of personal data. It would be a nightmare if the corruption is frequent.
 
Last edited:
May be you back to "your disk is lying to you" (or application).


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
Yeah. Except I tested the samba server using RAM disk as the storage (/tmp/isoRepo), it corrupts too.

I start to suspect there is sth in the pve's samba client is at fault. The samba client mounts the samba share through adding the SMB to PVE/Storage menu.
This would explain the unexplicable built-in downloader passes checksum test and the yet the file on disk is corrupt.

I've checked PVE wget version, it's up to date with the wget I use in other test (wget in a VM to Samba share).
 
I start to suspect there is sth in the pve's samba client is at fault.
there is nothing "special" about the samba client on pve; its the generic samba client provided by the samba debian package and is a stable and mature product.

what is providing the target samba server (hardware/software)? is it experiencing the same type of behavior if you use nfs/sshfs instead of smb? the hardware part is particular interest- especially the nic.
 
there is nothing "special" about the samba client on pve; its the generic samba client provided by the samba debian package and is a stable and mature product.

what is providing the target samba server (hardware/software)? is it experiencing the same type of behavior if you use nfs/sshfs instead of smb? the hardware part is particular interest- especially the nic.
The 3 machines in questions are:

1. zotac ID81 mini PC
2. Mac Mini 2009 edition
3. Beelink SER5 5560U model (new machine).

They all have the same PVE version (the latest) and no special customization done. I assume you want to know the NIC card brand/models ?

I should add one more piece of new information (I have edited the last post to include it):

I initially reported the corruption only happens if the PVE host downloads into a samba share from a VM hosted on the same PVE host. If the wget is issued from another VM on the same host or another host, the corruption doesn't happen. Purposely slow down the download speed (wget option) also prevent corruption.

This is no longer true. I now can have corruption if the samba vm is on machine #3 (which is 30x faster than 1 or 2), but the download is from another PVE host (wget or built-in download). I suspect it's because the samba VM is much faster now.

In summary, it appears that as long as the download client is on a PVE host (wget in the host shell or built-in ISO dowload feature in web GUI), with a samba share as the target, the corruption happens.

I haven't tried NFS or SSHFS. I may try those but I suspect it won't happen. As I said, the samba vm (ubuntu 16) has been moving huge amount of personal data in the past 7 years problem-free. The new samba VM (freshly installed ubuntu 22.04LTS, with samba minimally configured) also has the corruption issue. The common thread seems to be wget on the PVE host. Aria2c in pve host doesn't corrupt the download when all other factors kept the same with corrupting wget tests.

I will do some experiments with SSHFS.
 
Last edited:
Here is my situation:

I run proxmox on two machines, let's call them "H" and "D", where H is a "home server" and D is a "desktop with multi-seat".

- H runs a TrueNAS VM, let's call this "VMhNAS".
- D runs my Linux desktop VM (with GPU passthrough) as well as a Windows desktop VM (again with GPU passthrough, it has two GPUs).

Now, D has mounted SMB/CIFS storage from VMhNAS that runs on H. I have ISOs for various Linux distros there.

I found a huge ISO that is 6GB to perform testing here: Qube OS image

When I try to download a new ISO image from the Proxmox UI of D and target storage on VMhNAS, the file becomes corrupted.

  • Downloading from UI: checksum simply fails. In this case D is downloading from the internet, then sending the data to VMhNAS which is running on H.
So I broke up the process to see where the corruption could be coming from:
  • First, I downloaded straight to VMhNAS by logging into TrueNAS. Here VMhNAS is running on H, downloading from the internet and storing to disks attached to H. This is much faster as there is one less indirection:
Code:
root@NAS:/mnt/family-tank/Incoming/pve_storage/template/iso# wget https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
--2024-04-07 14:07:47--  https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
Resolving ftp.qubes-os.org (ftp.qubes-os.org)... 147.75.102.29, 2604:1380:4601:c500::1
Connecting to ftp.qubes-os.org (ftp.qubes-os.org)|147.75.102.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6628073472 (6.2G) [application/octet-stream]
Saving to: ‘Qubes-R4.2.1-x86_64.iso’

Qubes-R4.2.1-x86_64.iso                        100%[=================================================================================================>]   6.17G  97.9MB/s    in 64s     

2024-04-07 14:08:51 (98.8 MB/s) - ‘Qubes-R4.2.1-x86_64.iso’ saved [6628073472/6628073472]

root@NAS:/mnt/family-tank/Incoming/pve_storage/template/iso# sha512sum Qubes-R4.2.1-x86_64.iso                       
f4315893e7189782f56653197b4a2ab4be163900f31a4a2506bf157e2f318447bca09d6acb6f4abb13fb91d7bfa0687af333c20854085a2cc9490fe0f3e07784  Qubes-R4.2.1-x86_64.iso

So I can download and write to the disks at a speed of nearly 100MB/s with no issues. The checksum above is correct.

  • My next experiment was to download to D from command line, but store to local storage. Here D is downloading from the internet but storing locally (as opposed to writing to the remote CIFS share on VMhNAS). This also works perfectly fine, even though it is much slower I have 1Gbps link, but the internal link from H to Dgoes via WiFi 6 so is slower):
Code:
root@pve:~# wget https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso^C
root@pve:~# mkdir t
root@pve:~# cd t
root@pve:~/t# wget https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
--2024-04-07 14:55:08--  https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
Resolving ftp.qubes-os.org (ftp.qubes-os.org)... 147.75.102.29, 2604:1380:4601:c500::1
Connecting to ftp.qubes-os.org (ftp.qubes-os.org)|147.75.102.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6628073472 (6.2G) [application/octet-stream]
Saving to: ‘Qubes-R4.2.1-x86_64.iso’

Qubes-R4.2.1-x86_64.iso                        100%[=================================================================================================>]   6.17G  32.0MB/s    in 3m 40s 

2024-04-07 14:58:48 (28.7 MB/s) - ‘Qubes-R4.2.1-x86_64.iso’ saved [6628073472/6628073472]

root@pve:~/t# sha512sum Qubes-R4.2.1-x86_64.iso
f4315893e7189782f56653197b4a2ab4be163900f31a4a2506bf157e2f318447bca09d6acb6f4abb13fb91d7bfa0687af333c20854085a2cc9490fe0f3e07784  Qubes-R4.2.1-x86_64.iso

Still no corruption as long as only one host (D) is involved.
  • So next experiment is to download on D from command line and store on VMhNAS. This should be similar to what proxmox UI does on D, where it downloads from the internet but then stores to remote storage on VMhNAS. Indeed, this produced a file with an incorrect checksum:
Code:
root@pve:/mnt/pve/pve_storage/template/iso# mkdir t
root@pve:/mnt/pve/pve_storage/template/iso# cd t
root@pve:/mnt/pve/pve_storage/template/iso/t# wget https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
--2024-04-07 14:10:33--  https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
Resolving ftp.qubes-os.org (ftp.qubes-os.org)... 147.75.102.29, 2604:1380:4601:c500::1
Connecting to ftp.qubes-os.org (ftp.qubes-os.org)|147.75.102.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6628073472 (6.2G) [application/octet-stream]
Saving to: ‘Qubes-R4.2.1-x86_64.iso’

Qubes-R4.2.1-x86_64.iso                        100%[=================================================================================================>]   6.17G  27.4MB/s    in 7m 53s 

2024-04-07 14:18:42 (13.4 MB/s) - ‘Qubes-R4.2.1-x86_64.iso’ saved [6628073472/6628073472]

root@pve:/mnt/pve/pve_storage/template/iso/t# sha512sum Qubes-R4.2.1-x86_64.iso
0bee5cc684a9c62c5ea0ca169daeb15b8da51a90eb4d1875064d0218b41fdb4e88a55091fe595e1998a02645d25d6ef06fc6e686efa1c2b207c0b6679905042a  Qubes-R4.2.1-x86_64.iso

As you can see now the checksum differs to the two above. I can download straight to VMhNAS on H, or straight to D on local storage, but I cannot download on D and store to VMhNAS.

  • My next experiment was to do a simple command-line copy from the local storage on D to the remote storage folder. So here I am taking the known good image I downloaded previously to D's local storage in `/root/t` and sending it to VMhNAS storage on H: so there is no "internet downloading" part. Here is the result, the new file which I copied to the same location as the corrupt one, is perfectly fine:

Code:
root@pve:/mnt/pve/pve_storage/template/iso/t# cp /root/t/Qubes-R4.2.1-x86_64.iso Qubes-R4.2.1-x86_64.iso.CORRECT
root@pve:/mnt/pve/pve_storage/template/iso/t# sha512sum *
0bee5cc684a9c62c5ea0ca169daeb15b8da51a90eb4d1875064d0218b41fdb4e88a55091fe595e1998a02645d25d6ef06fc6e686efa1c2b207c0b6679905042a  Qubes-R4.2.1-x86_64.iso
f4315893e7189782f56653197b4a2ab4be163900f31a4a2506bf157e2f318447bca09d6acb6f4abb13fb91d7bfa0687af333c20854085a2cc9490fe0f3e07784  Qubes-R4.2.1-x86_64.iso.CORRECT

So it is the combination of downloading from remote AND saving to remote that does not work, even though the disk I/O is slower in this case...

  • Finally, just for completenes, I used the GUI on H to download the same storage shared by VMhNAS using the proxmox GUI. Here H is downloading from the Internet and then storing to "remote" storage of VMhNAS, though this VM is running on the very same host D. So there is a level of indirection, but there is no WiFi network involved. Tihs works oK:
Code:
...
6422528K ........ ........ ........ ........ 99% 99.9M 0s
6455296K ........ ........ .                100%  106M=65s
2024-04-07 16:40:07 (98.0 MB/s) - '/mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.tmp_dwnl.4023825' saved [6628073472/6628073472]
download of 'https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso' to '/mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso' finished
TASK OK

As you can see, the problem seems to happen in my case, when the WiFi link gets saturated?

Does anoyone have any ideas on what to check? Anything else I could do to provide more information?
 
Hello,

I seem to be experiencing the same issue. Did you ever get to the bottom of this?
Hi, I have not found the root cause. However, I have done a lot of experiments and concluded that as long as I use SMB/CIFS as a storage for PVE, downloading would even fail the checksum or corrupt on disk.

Your experiments are interesting. I still suspect there is sth wrong with the built-in SMB client on the PVE host.

In your last experiments, your SMB client and SMB server are on the same PVE host, and you didn't experience corruption. I wouldn't too quickly drawn firm conclusions from this one experiment. I suspect there are race conditions that change a small thing will make it disappear. For example, as documented in this thread, the corruption was initially reported to occur only when SMB client and SMB server are co-hosted in the same PVE host. But later, I was able to experience the same corruption occurs when the SMB client and SMB servers are on different PVE hosts. It seems that slower download speed makes it less likely to occur. For example, if I intentionally add a speed limit in wget, the problem disappear.

Later I find airia2c downloader does not cause corruption. This is consistent. It doesn't mean wget is at fault. I think likely airia2c introduces an element of change that makes the corruption go away. Because wget is in wide use and the version with the PVE host is up to date. It's unlikely that wget itself causes corruption.

I'd love to see this problem gets reported and fixed. It's hard to believe this is a very rare bug. What maybe rare is that you and I use CIFS as a storage for PVE ISO images. And in your case, you use trueNas and I use vanila ubuntu 16 or 22 LTS as the CIFS server, both of us encoutner the same issue. So I'd think the problem is on the PVE host (SMB client) side, my speculation.
 
Last edited:
...In your last experiments, your SMB client and SMB server are on the same PVE host, and you didn't experience corruption. I wouldn't too quickly drawn firm conclusions from this one experiment. I suspect there are race conditions that change a small thing will make it disappear.

Well, here's another experiment that might surprise you. I used both wget and curl to download the file just now from command line. Both were ran on host D and wrote remotely to the samba share mounted from VMhNAS. They both worked:

Code:
root@pve:~# curl -o /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.curl https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6321M  100 6321M    0     0  17.0M      0  0:06:10  0:06:10 --:--:-- 9367k
root@pve:~# sha512sum /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.curl
f4315893e7189782f56653197b4a2ab4be163900f31a4a2506bf157e2f318447bca09d6acb6f4abb13fb91d7bfa0687af333c20854085a2cc9490fe0f3e07784  /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.curl


root@pve:~# wget --progress=dot:giga -O /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.wget https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
--2024-04-08 00:15:58--  https://ftp.qubes-os.org/iso/Qubes-R4.2.1-x86_64.iso
Resolving ftp.qubes-os.org (ftp.qubes-os.org)... 147.75.102.29, 2604:1380:4601:c500::1
Connecting to ftp.qubes-os.org (ftp.qubes-os.org)|147.75.102.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6628073472 (6.2G) [application/octet-stream]
Saving to: ‘/mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.wget’

     0K ........ ........ ........ ........  0% 53.4M 1m58s
 32768K ........ ........ ........ ........  1% 48.3M 2m3s
 ...
 6422528K ........ ........ ........ ........ 99% 16.2M 1s
6455296K ........ ........ .                100% 19.9M=4m56s

2024-04-08 00:20:54 (21.3 MB/s) - ‘/mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.wget’ saved [6628073472/6628073472]

root@pve:~# sha512sum /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.wget
f4315893e7189782f56653197b4a2ab4be163900f31a4a2506bf157e2f318447bca09d6acb6f4abb13fb91d7bfa0687af333c20854085a2cc9490fe0f3e07784  /mnt/pve/pve_storage/template/iso/Qubes-R4.2.1-x86_64.iso.wget

Boith curl and wget were able to download successfully...

Very weird bug.
 
Just an update. I have recently upgraded from PVE 8.1 to 8.2.4. After the upgrade, the download corruption bug persists. Trying to download anything large (say over 100MB) into a samba storage will corrupt and fail the checksum.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!