[Feature request] Ability to enable rxbounce option in Ceph RBD storage for windows VMs

janlostak · Oct 10, 2024

Currently, at our university, we have a large standalone Ceph cluster that serves as storage for our virtual machines. We utilize this Ceph cluster for multiple purposes, including: Virtual Machine Storage, S3 Object Storage and CephFS for Samba.

Overall, the performance is stable, and the cluster operates smoothly for most workloads. However, we are encountering significant issues with Windows virtual machines when using RBD as the primary disk storage (both for boot and data disks).

Issue Description

During high I/O workloads on Windows VMs, we experience severe performance degradation accompanied by numerous errors in the PVE kernel logs. The errors appear as:

Code:

[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286
[Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature
[Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature
[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000eeebfe2b data crc 3093401367 != exp. 937192709
[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000113dffc8 data crc 830701436 != exp. 4041960045
[Thu Oct 10 13:23:42 2024] libceph: osd80 (1)158.*.*.*:6805 bad crc/signature

These errors typically occur during I/O-intensive operations (e.g., large file transfers or database workloads) on Windows VMs. Interestingly, Linux VMs do not exhibit the same problem even under similar workloads, indicating a Windows-specific issues.

Root Cause Analysis

After investigating Ceph’s RBD documentation and reviewing kernel logs, we identified that these errors are likely related to how Windows handles large I/O operations. Specifically, the issue arises because Windows may map a temporary “dummy” page into the destination buffer, making the buffer unstable during data transfer. This instability causes libceph to miscalculate checksums, resulting in bad crc/signature errors and performance degradation.

Solution: Enable rxbounce

According to the Ceph RBD documentation, enabling the rxbounce option can resolve these issues. The rxbounce option forces the RBD client to use a bounce buffer when receiving data, which stabilizes the destination buffer during I/O operations. This is particularly necessary when dealing with Windows' I/O:

rxbounce: Use a bounce buffer when receiving data (introduced in kernel version 5.17). The default behavior is to read directly into the destination buffer, but this can cause problems if the destination buffer isn't stable. A bounce buffer is needed if the destination buffer may change during read operations, such as when Windows maps a temporary page to generate a single large I/O. Otherwise, libceph: bad crc/signature or libceph: integrity error messages can occur, along with performance degradation.

Source: Ceph RBD Documentation

Implementation

To enable this option, I modified the RBDPlugin.pm file to include the rxbounce parameter inside @options variable. After applying the change, the errors stopped appearing, and the performance of Windows VMs improved significantly. Here’s a brief overview of the changes made:

Perl:

sub map_volume {
    my ($class, $storeid, $scfg, $volname, $snapname) = @_;


    my ($vtype, $img_name, $vmid) = $class->parse_volname($volname);


    my $name = $img_name;
    $name .= '@'.$snapname if $snapname;


    my $kerneldev = get_rbd_dev_path($scfg, $storeid, $name);


    return $kerneldev if -b $kerneldev; # already mapped


    # features can only be enabled/disabled for image, not for snapshot!
    $krbd_feature_update->($scfg, $storeid, $img_name);

    # added this option variable as proof of concept
    my @options = (
        '--options' , 'rxbounce',
    );
    my $cmd = $rbd_cmd->($scfg, $storeid, 'map', @options, $name);
    run_rbd_command($cmd, errmsg => "can't map rbd volume $name");


    return $kerneldev;
}

Outcome

With rxbounce enabled, the performance of Windows VMs is now on par with Linux VMs, and the kernel logs are free from the bad crc/signature errors. This solution has stabilized our Windows virtual machine workloads, making the Ceph cluster reliable for mixed OS environments.

Suggestion

It would be usefull to have field in RBD storage configuration to add custom map options.

PVE version

Code:

proxmox-ve: 8.2.0 (running kernel: 6.5.11-8-pve)
pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b)
proxmox-kernel-helper: 8.1.0
pve-kernel-5.15: 7.4-9
proxmox-kernel-6.8: 6.8.12-2
proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2
proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4
proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6
proxmox-kernel-6.5: 6.5.13-6
proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8
pve-kernel-5.15.131-2-pve: 5.15.131-3
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx9
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.4
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.7
libpve-cluster-perl: 8.0.7
libpve-common-perl: 8.2.3
libpve-guest-common-perl: 5.1.4
libpve-http-server-perl: 5.1.1
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.10
libpve-storage-perl: 8.2.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-4
proxmox-backup-client: 3.2.7-1
proxmox-backup-file-restore: 3.2.7-1
proxmox-firewall: 0.5.0
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.7
pve-container: 5.2.0
pve-docs: 8.2.3
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.0.7
pve-firmware: 3.13-2
pve-ha-manager: 4.0.5
pve-i18n: 3.2.3
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.4
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.6-pve1

LnxBil · Oct 10, 2024

Very nice find, yet the forums is not the way to do a feature request, they go here.

itNGO · Oct 10, 2024

janlostak said:
Currently, at our university, we have a large standalone Ceph cluster that serves as storage for our virtual machines. We utilize this Ceph cluster for multiple purposes, including: Virtual Machine Storage, S3 Object Storage and CephFS for Samba.

Overall, the performance is stable, and the cluster operates smoothly for most workloads. However, we are encountering significant issues with Windows virtual machines when using RBD as the primary disk storage (both for boot and data disks).

Issue Description
During high I/O workloads on Windows VMs, we experience severe performance degradation accompanied by numerous errors in the PVE kernel logs. The errors appear as:

Code:

[Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000d9278a57 data crc 623151463 != exp. 3643241286 [Thu Oct 10 13:23:42 2024] libceph: osd39 (1)158.*.*.*:7049 bad crc/signature [Thu Oct 10 13:23:42 2024] libceph: osd76 (1)158.*.*.*:6978 bad crc/signature [Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000eeebfe2b data crc 3093401367 != exp. 937192709 [Thu Oct 10 13:23:42 2024] libceph: read_partial_message 00000000113dffc8 data crc 830701436 != exp. 4041960045 [Thu Oct 10 13:23:42 2024] libceph: osd80 (1)158.*.*.*:6805 bad crc/signature

These errors typically occur during I/O-intensive operations (e.g., large file transfers or database workloads) on Windows VMs. Interestingly, Linux VMs do not exhibit the same problem even under similar workloads, indicating a Windows-specific issues.

Root Cause Analysis
After investigating Ceph’s RBD documentation and reviewing kernel logs, we identified that these errors are likely related to how Windows handles large I/O operations. Specifically, the issue arises because Windows may map a temporary “dummy” page into the destination buffer, making the buffer unstable during data transfer. This instability causes libceph to miscalculate checksums, resulting in bad crc/signature errors and performance degradation.

Solution: Enable rxbounce
According to the Ceph RBD documentation, enabling the rxbounce option can resolve these issues. The rxbounce option forces the RBD client to use a bounce buffer when receiving data, which stabilizes the destination buffer during I/O operations. This is particularly necessary when dealing with Windows' I/O:

Source: Ceph RBD Documentation

Implementation
To enable this option, I modified the RBDPlugin.pm file to include the rxbounce parameter inside @options variable. After applying the change, the errors stopped appearing, and the performance of Windows VMs improved significantly. Here’s a brief overview of the changes made:

Perl:

sub map_volume { my ($class, $storeid, $scfg, $volname, $snapname) = @_; my ($vtype, $img_name, $vmid) = $class->parse_volname($volname); my $name = $img_name; $name .= '@'.$snapname if $snapname; my $kerneldev = get_rbd_dev_path($scfg, $storeid, $name); return $kerneldev if -b $kerneldev; # already mapped # features can only be enabled/disabled for image, not for snapshot! $krbd_feature_update->($scfg, $storeid, $img_name); # added this option variable as proof of concept my @options = ( '--options' , 'rxbounce', ); my $cmd = $rbd_cmd->($scfg, $storeid, 'map', @options, $name); run_rbd_command($cmd, errmsg => "can't map rbd volume $name"); return $kerneldev; }

Outcome
With rxbounce enabled, the performance of Windows VMs is now on par with Linux VMs, and the kernel logs are free from the bad crc/signature errors. This solution has stabilized our Windows virtual machine workloads, making the Ceph cluster reliable for mixed OS environments.

Suggestion
It would be usefull to have field in RBD storage configuration to add custom map options.

PVE version

Code:

proxmox-ve: 8.2.0 (running kernel: 6.5.11-8-pve) pve-manager: 8.2.7 (running version: 8.2.7/3e0176e6bb2ade3b) proxmox-kernel-helper: 8.1.0 pve-kernel-5.15: 7.4-9 proxmox-kernel-6.8: 6.8.12-2 proxmox-kernel-6.8.12-2-pve-signed: 6.8.12-2 proxmox-kernel-6.8.8-4-pve-signed: 6.8.8-4 proxmox-kernel-6.5.13-6-pve-signed: 6.5.13-6 proxmox-kernel-6.5: 6.5.13-6 proxmox-kernel-6.5.11-8-pve-signed: 6.5.11-8 pve-kernel-5.15.131-2-pve: 5.15.131-3 pve-kernel-5.15.108-1-pve: 5.15.108-2 pve-kernel-5.15.74-1-pve: 5.15.74-1 pve-kernel-5.15.30-2-pve: 5.15.30-3 ceph-fuse: 17.2.7-pve3 corosync: 3.1.7-pve3 criu: 3.17.1-2 glusterfs-client: 10.3-5 ifupdown2: 3.2.0-1+pmx9 ksm-control-daemon: 1.5-1 libjs-extjs: 7.0.0-4 libknet1: 1.28-pve1 libproxmox-acme-perl: 1.5.1 libproxmox-backup-qemu0: 1.4.1 libproxmox-rs-perl: 0.3.4 libpve-access-control: 8.1.4 libpve-apiclient-perl: 3.3.2 libpve-cluster-api-perl: 8.0.7 libpve-cluster-perl: 8.0.7 libpve-common-perl: 8.2.3 libpve-guest-common-perl: 5.1.4 libpve-http-server-perl: 5.1.1 libpve-network-perl: 0.9.8 libpve-rs-perl: 0.8.10 libpve-storage-perl: 8.2.5 libspice-server1: 0.15.1-1 lvm2: 2.03.16-2 lxc-pve: 6.0.0-1 lxcfs: 6.0.0-pve2 novnc-pve: 1.4.0-4 proxmox-backup-client: 3.2.7-1 proxmox-backup-file-restore: 3.2.7-1 proxmox-firewall: 0.5.0 proxmox-kernel-helper: 8.1.0 proxmox-mail-forward: 0.2.3 proxmox-mini-journalreader: 1.4.0 proxmox-offline-mirror-helper: 0.6.7 proxmox-widget-toolkit: 4.2.3 pve-cluster: 8.0.7 pve-container: 5.2.0 pve-docs: 8.2.3 pve-edk2-firmware: 4.2023.08-4 pve-esxi-import-tools: 0.7.2 pve-firewall: 5.0.7 pve-firmware: 3.13-2 pve-ha-manager: 4.0.5 pve-i18n: 3.2.3 pve-qemu-kvm: 8.1.5-6 pve-xtermjs: 5.3.0-3 qemu-server: 8.2.4 smartmontools: 7.3-pve1 spiceterm: 3.3.0 swtpm: 0.8.0+pve1 vncterm: 1.8.0 zfsutils-linux: 2.2.6-pve1

Amazing work... we have the same issue and where still wondering how we got enable rxbounce in PVE Hyperconverged CEPH!

esi_y · Oct 10, 2024

LnxBil said:
Very nice find, yet the forums is not the way to do a feature request, they go here.

To have endless discussions there about why something is not worth their time and to end up DIY it and post it here anyways?

@janlostak Please mark your post as "TUTORIAL", you can do so by editing the "thread" option in the top right corner.

itNGO · Oct 11, 2024

This issue happens often when KRBD is enabled and PBS-Backup-Jobs run towards a fast PBS.
So I believe this is a more "often" issue than most people think and should be filed as "bug"...

itNGO · Oct 11, 2024

So for my understanding,
you edit "/usr/share/perl5/PVE/Storage/RDBPlugin.pm" and add these lines?

Code:

    # added this option variable as proof of concept
    my @options = (
        '--options' , 'rxbounce',
    );

What has to be done after edit? Reboot nodes? Anything else?

LnxBil · Oct 11, 2024

itNGO said:
So for my understanding,
you edit "/usr/share/perl5/PVE/Storage/RDBPlugin.pm" and add these lines?

Code:

# added this option variable as proof of concept my @options = ( '--options' , 'rxbounce', );

What has to be done after edit? Reboot nodes? Anything else?

As far as I understand, you need to stop and start the VM for which you want to activate it. I would need to check for myself.

janlostak · Oct 11, 2024

LnxBil said:
As far as I understand, you need to stop and start the VM for which you want to activate it. I would need to check for myself.

You need also add the @option to command argument:

my $cmd = $rbd_cmd->($scfg, $storeid, 'map', @options, $name);

then run:

systemctl restart pvedaemon

and shutdown the VM. After succesfull shutdown start the VM again and the option will be in effect on RBD device.

janlostak · Oct 11, 2024

itNGO said:
This issue happens often when KRBD is enabled and PBS-Backup-Jobs run towards a fast PBS.
So I believe this is a more "often" issue than most people think and should be filed as "bug"...

Yes, I posted it in the hope that it would gain more attention and eventually be fixed. That’s why I didn’t present it as a tutorial, because I believe there should be a proper way to enable advanced features on RBD storage. Another reason this should not be considered a tutorial for fixing the issue is that when you upgrade to a new version of PVE, the modified script will be most likely overwritten with the original version.

itNGO · Oct 11, 2024

janlostak said:
Yes, I posted it in the hope that it would gain more attention and eventually be fixed. That’s why I didn’t present it as a tutorial, because I believe there should be a proper way to enable advanced features on RBD storage. Another reason this should not be considered a tutorial for fixing the issue is that when you upgrade to a new version of PVE, the modified script will be most likely overwritten with the original version.

I made a ticket @proxmox and asked for "permanent" implementation option....

LnxBil · Oct 11, 2024

itNGO said:
I made a ticket @proxmox and asked for "permanent" implementation option....

I don't see it on the bugtracker

itNGO · Oct 11, 2024

LnxBil said:
I don't see it on the bugtracker

Customer support ticket... not a Bug-Report.... this will be next step if necessary....

LnxBil · Oct 11, 2024

itNGO said:
Customer support ticket... not a Bug-Report.... this will be next step if necessary....

Ah okay. Thank you.

fweber · Oct 11, 2024

Hi, thanks for sharing your troubleshooting steps and good to hear that rxbounce fixes the Windows VM performance issues for you.

janlostak said:
Yes, I posted it in the hope that it would gain more attention and eventually be fixed. That’s why I didn’t present it as a tutorial, because I believe there should be a proper way to enable advanced features on RBD storage. Another reason this should not be considered a tutorial for fixing the issue is that when you upgrade to a new version of PVE, the modified script will be most likely overwritten with the original version.

Yes, I'd also recommend against running patched code. As suggested, could you please open a feature request on our Bugzilla? [0]

Some first thoughts (I'll also post them to the feature request): Always setting rxbounce (as done by the patch) is probably not desirable, as it would also affect setups which don't see problems currently, and rxbounce may have an unnecessary performance impact then.

If you want to enable rxbounce for your pool or specific images without code changes, you can try adding it to rbd_default_map_options either on the pool or on the image level. See [1] [2] for more information. For a mapped device /dev/rbdN you can check the current map options in /sys/block/rbdN/device/config_info.

Just for the record, switching to librbd (disabling KRBD), which is apparently not affected by these issues, can also be a workaround.

[0] https://bugzilla.proxmox.com/
[1] https://docs.ceph.com/en/reef/man/8/rbd/#commands
[2] https://github.com/ceph/ceph/blob/b...26ac9b4c3/src/common/options/rbd.yaml.in#L507

itNGO · Oct 11, 2024

fweber said:
Just for the record, switching to librbd (disabling KRBD), which is apparently not affected by these issues, can also be a workaround.

Thx for sharing and joining the discussion...

Disabling KRBD is what we often did in the past. This problem rises nearly everywhere when Proxmox Backup Server is used, KRBD is enabled and you backup Windows VM... so I consider this as a "known" issue....

But in our cluster "NVME-Only" 3 Nodes... we loose about 20% iops just by disabling KRBD... on the other side, besides the CRC-Error in the logs, there is nothing known about any problems which rise beside the syslog getting flooded....

As the TO stated.. maybe an "option" in the GUI is desirable....

janlostak · Oct 11, 2024

itNGO said:
Thx for sharing and joining the discussion...

Disabling KRBD is what we often did in the past. This problem rises nearly everywhere when Proxmox Backup Server is used, KRBD is enabled and you backup Windows VM... so I consider this as a "known" issue....

But in our cluster "NVME-Only" 3 Nodes... we loose about 20% iops just by disabling KRBD... on the other side, besides the CRC-Error in the logs, there is nothing known about any problems which rise beside the syslog getting flooded....

As the TO stated.. maybe an "option" in the GUI is desirable....

Under heavy storage I/O load (not fully saturated—about 50%) on the PVE node hosting both Linux and Windows virtual machines, we sometimes experience hundreds or even thousands of these bad CRC/signature messages per second, leading to degraded performance. With the rxbounce option enabled, these issues are resolved. I also tried disabling KRBD, but the overall performance dropped by about 20–30%, and I/O latency increased. This problem consistently occurs when we run I/O-intensive scientific calculations inside our VMs.

janlostak · Oct 11, 2024

fweber said:
Hi, thanks for sharing your troubleshooting steps and good to hear that rxbounce fixes the Windows VM performance issues for you.

Yes, I'd also recommend against running patched code. As suggested, could you please open a feature request on our Bugzilla? [0]

Some first thoughts (I'll also post them to the feature request): Always setting rxbounce (as done by the patch) is probably not desirable, as it would also affect setups which don't see problems currently, and rxbounce may have an unnecessary performance impact then.

If you want to enable rxbounce for your pool or specific images without code changes, you can try adding it to rbd_default_map_options either on the pool or on the image level. See [1] [2] for more information. For a mapped device /dev/rbdN you can check the current map options in /sys/block/rbdN/device/config_info.

Just for the record, switching to librbd (disabling KRBD), which is apparently not affected by these issues, can also be a workaround.

[0] https://bugzilla.proxmox.com/
[1] https://docs.ceph.com/en/reef/man/8/rbd/#commands
[2] https://github.com/ceph/ceph/blob/b...26ac9b4c3/src/common/options/rbd.yaml.in#L507

I opened the request in bugzilla https://bugzilla.proxmox.com/show_bug.cgi?id=5779 and included link to this forum post if it is enough.

esi_y · Oct 11, 2024

fweber said:
Yes, I'd also recommend against running patched code. As suggested, could you please open a feature request on our Bugzilla? [0]

Do you guys even ever call anything being a bug?

fweber said:
Some first thoughts (I'll also post them to the feature request): Always setting rxbounce (as done by the patch) is probably not desirable, as it would also affect setups which don't see problems currently, and rxbounce may have an unnecessary performance impact then.

I am not sure I misunderstand here, but:

janlostak said:
Suggestion
It would be usefull to have field in RBD storage configuration to add custom map options.

That's asking for something else.

fweber said:
If you want to enable rxbounce for your pool or specific images without code changes, you can try adding it to rbd_default_map_options either on the pool or on the image level. See [1] [2] for more information. For a mapped device /dev/rbdN you can check the current map options in /sys/block/rbdN/device/config_info.

Just for the record, switching to librbd (disabling KRBD), which is apparently not affected by these issues, can also be a workaround.

I also don't understand there's two different ways staff happen to approach bug reports.

One is that they instantly create a new THEMSELVES in BZ, make themselves assignee and start submitting patches or RFC.

The other is to tell OP to go BZ themselves.

The former usually happens when the report is about "crossing t's and dotting i's" kind of reports.

fweber said:
[0] https://bugzilla.proxmox.com/
[1] https://docs.ceph.com/en/reef/man/8/rbd/#commands
[2] https://github.com/ceph/ceph/blob/b...26ac9b4c3/src/common/options/rbd.yaml.in#L507

itNGO · Oct 11, 2024

Just for Information... we changed two 3-Node-Clusters with this "fix" and just had our PBS-Run.... about 170VMs, around 110 of them running Windows... Not a single CRC error... smooth and fast.... so whatever it takes... this fix has to be implemented permanently as an option!

Before the change the logs where flooded and sometimes the system got so "weird" that we had to split the backup-times to keep the system reliable stable.... these issues are gone now....

VoIP-Ninja · Oct 16, 2024

itNGO said:
This issue happens often when KRBD is enabled and PBS-Backup-Jobs run towards a fast PBS.
So I believe this is a more "often" issue than most people think and should be filed as "bug"...

What is a fast PBS in your words? NVMe / SSD only storage with sufficient bandwidth eg. 40G / 25G / 100G ?

[Feature request] Ability to enable rxbounce option in Ceph RBD storage for windows VMs

New Member

Issue Description​

Root Cause Analysis​

Solution: Enable rxbounce​

Implementation​

Outcome​

Suggestion​

PVE version​

Distinguished Member

Famous Member

Issue Description​

Root Cause Analysis​

Solution: Enable rxbounce​

Implementation​

Outcome​

Suggestion​

PVE version​

Renowned Member

Famous Member

Famous Member

Distinguished Member

New Member

New Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Proxmox Staff Member

Famous Member

New Member

New Member

Renowned Member

Suggestion​

Famous Member

Active Member

We value your privacy

Issue Description

Root Cause Analysis

Solution: Enable rxbounce

Implementation

Outcome

Suggestion

PVE version

Issue Description

Root Cause Analysis

Solution: Enable rxbounce

Implementation

Outcome

Suggestion

PVE version

Suggestion