Strange disk behaviour

lifeboy · Nov 20, 2023

We're experiencing a problem with a FreeBSD KVM guest that works 100% on installation, but after a while starts complaining that it can't write to the disk anymore. What we have done so far:

Moved the disk image off ceph to a lvm-thin volume
Changed the disk from Virtio-SCSI to SATA and also IDE as a test
Tried various disk options (SSD emulation on/off, discard on/off, async_io changes)

None of these make any difference.

Then I noticed that the stage indicator on the GUI of proxmox shows 139.59GB storage used by a disk that is set to 130GB.

Inside the guest there is ample space.

Code:

[iris@simba.xxxxxx /usr/local/iris]$ df -h
Filesystem        Size    Used   Avail Capacity  Mounted on
/dev/gpt/data0    110G     62G     39G    62%    /
devfs             1.0K    1.0K      0B   100%    /dev
[iris@simba.xxxxxx /usr/local/iris]$

The volume itself:

Code:

# lvdisplay /dev/pve/vm-199-disk-0
  --- Logical volume ---
  LV Path                /dev/pve/vm-199-disk-0
  LV Name                vm-199-disk-0
  VG Name                pve
  LV UUID                eQjtqp-fRIm-t4cP-rbiL-3mzQ-Gvy0-AkHEUA
  LV Write Access        read/write
  LV Creation host, time FT1-NodeD, 2023-09-14 14:03:04 +0200
  LV Pool name           data
  LV Status              available
  # open                 0
  LV Size                130.00 GiB
  Mapped size            84.29%
  Current LE             33280
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:20

What is going on here? Something doesn't add up....

LnxBil · Nov 20, 2023

What about other errors, e.g. in your pve host dmesg? Is this thin or thick-lvm?

Chris · Nov 20, 2023

lifeboy said:
We're experiencing a problem with a FreeBSD KVM guest that works 100% on installation, but after a while starts complaining that it can't write to the disk anymore. What we have done so far:

Moved the disk image off ceph to a lvm-thin volume

Changed the disk from Virtio-SCSI to SATA and also IDE as a test

Tried various disk options (SSD emulation on/off, discard on/off, async_io changes)

None of these make any difference.

Then I noticed that the stage indicator on the GUI of proxmox shows 139.59GB storage used by a disk that is set to 130GB.
View attachment 58404

Inside the guest there is ample space.

Code:

[iris@simba.xxxxxx /usr/local/iris]$ df -h Filesystem Size Used Avail Capacity Mounted on /dev/gpt/data0 110G 62G 39G 62% / devfs 1.0K 1.0K 0B 100% /dev [iris@simba.xxxxxx /usr/local/iris]$

The volume itself:

Code:

# lvdisplay /dev/pve/vm-199-disk-0 --- Logical volume --- LV Path /dev/pve/vm-199-disk-0 LV Name vm-199-disk-0 VG Name pve LV UUID eQjtqp-fRIm-t4cP-rbiL-3mzQ-Gvy0-AkHEUA LV Write Access read/write LV Creation host, time FT1-NodeD, 2023-09-14 14:03:04 +0200 LV Pool name data LV Status available # open 0 LV Size 130.00 GiB Mapped size 84.29% Current LE 33280 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:20

What is going on here? Something doesn't add up....

Hi,
please check the size as indicated in the VM config, qm config <VMID>. How did you move the disk from ceph to lvm-thin, was this done via the WebUI? What is the exact error message you get inside the VM. Also double check the partitions on the disk regarding size and run an fsck of the filesystem inside the VM.

lifeboy · Nov 20, 2023

Chris said:
Hi,
please check the size as indicated in the VM config, qm config <VMID>. How did you move the disk from ceph to lvm-thin, was this done via the WebUI? What is the exact error message you get inside the VM. Also double check the partitions on the disk regarding size and run an fsck of the filesystem inside the VM.

The vm config shows 130GB allocation for the disk.

Code:

sata0: speedy:vm-199-disk-0,discard=on,size=130G,ssd=1

The guest is FreeBSD, fsck has been run very often, that's not the issue.

The problem is that the processes are running into a "D" state (waiting on disk). Eventually nothing runs on the machine anymore, everything is waiting on disk (of the processes that write to disk)

When the VM is reinstalled from scratch everything runs just fine, until as some point weeks or months after that, the problem re-occurs.

In the original post the disk allocation in the VM is shown.

Chris · Nov 20, 2023

Oh, I overlooked in the first post that the value from the WebUI is given in GB, while the lvdisplay output is given in GiB, so if you convert the values, everything matches up, no issue there.

Please post the systemd journal of the host from around the time when the IO stalls in the VM appear, journalctl --since <DATETIME> --until <DATETIME> and your pveversion -v as well as the VM config qm config <VMID>. Is there a backup job running when the issue appears? Also, I would suggest to stick to virtio-scsi. Do you notice high IO delay on the host when the issue happens?

lifeboy · Nov 20, 2023

Chris said:
Oh, I overlooked in the first post that the value from the WebUI is given in GB, while the lvdisplay output is given in GiB, so if you convert the values, everything matches up, no issue there.

Please post the systemd journal of the host from around the time when the IO stalls in the VM appear, journalctl --since <DATETIME> --until <DATETIME> and your pveversion -v as well as the VM config qm config <VMID>. Is there a backup job running when the issue appears? Also, I would suggest to stick to virtio-scsi. Do you notice high IO delay on the host when the issue happens?

As to the size shown issue: Can we have this flagged as an inconsistency to be fixed in an upcoming version? Either the aim should be everything in GiB (prefered?) or else everything in GB.

As to the host logs: I'll have to wait for it to happen again and will post it then

lifeboy · Nov 20, 2023

Ok, the user has started his machine again at just after 15:00.

syslog has been attached.

Code:

# pveversion -v
proxmox-ve: 7.4-1 (running kernel: 5.15.108-1-pve)
pve-manager: 7.4-16 (running version: 7.4-16/0f39f621)
pve-kernel-5.15: 7.4-4
pve-kernel-5.13: 7.1-9
pve-kernel-5.3: 6.1-6
pve-kernel-5.15.108-1-pve: 5.15.108-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.4.162-1-pve: 5.4.162-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 17.2.6-pve1
ceph-fuse: 17.2.6-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-network-perl: 0.7.3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.3-1
proxmox-backup-file-restore: 2.4.3-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.2
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-6
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-5
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve1

Code:

# qm config 199
agent: 1
balloon: 4096
bios: ovmf
boot: order=sata0
cores: 9
cpu: qemu64
ide2: cephfs:iso/FreeBSD-12.4-RELEASE-amd64-disc1.iso,media=cdrom,size=982436K
memory: 24576
meta: creation-qemu=7.1.0,ctime=1691437349
name: VO-IRIS-Poller-2
net0: virtio=AA:21:C9:74:4D:AD,bridge=vmbr2,firewall=1
numa: 0
onboot: 1
ostype: other
sata0: speedy:vm-199-disk-0,discard=on,size=140G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f2484309-298e-4b0c-9b4d-6a25288051f7
sockets: 2
tags: freebsd
unused0: local-lvm:vm-199-disk-0
vmgenid: f179581c-bdb1-4bf8-9d14-ab22d395ffa6

We can change back to scsi now I suppose, just haven't done it yet.

There is disk IO happening all the time of this machine. Here's the hourly maximum graph.

The CPU's are also being used all the time.

lifeboy · Nov 20, 2023

I changed the thread name to better describe the issue.

Chris · Nov 20, 2023

lifeboy said:
I changed the thread name to better describe the issue.

Is the storage called speedy related to the disk nvme1 of your host system? It seems that you might have issues with this device, from your journal we see:
Device: /dev/nvme1, Critical Warning (0x04): Reliability

lifeboy · Nov 21, 2023

Chris said:
Is the storage called speedy related to the disk nvme1 of your host system? It seems that you might have issues with this device, from your journal we see:
Device: /dev/nvme1, Critical Warning (0x04): Reliability

Reading the whole tread will be helpful. The nvme1 is part of a ceph pool, but the problem occurs regardless of which pool the VM uses, even if I move it to a local ext4 lvm-thin volume, the problem still occurs. Many other virtual machines are using that pool and they don't have any issues.

Chris · Nov 21, 2023

lifeboy said:
Reading the whole tread will be helpful

I am not sure what you are referring to, but nowhere in the thread was there the mention of which block device was related to which storage. That the issue seems to show up indepenent of the underlying storage seems clear.

Regarding the issue at hand:

Is this VM the same VM you already had trouble with in the past [0]?
How log does it take for the VM to run into the IO issues?
What process is using up all the CPU resources? Can you identify a process inside the VM which is causing the high cpu load, or is this the qemu process on the host itself?
There were issues related to KSM and memory ballooning in the past [1], probably not related but please try to see if disabling ballooning has any effect.
You can try to attach via strace to the qemu process and generate a backtrace using gdb as described in [2].

[0] https://forum.proxmox.com/threads/disk-errors-on-freebsd-12-2-guest.105753/
[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
[2] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-561792

Chris · Nov 21, 2023

lifeboy said:
As to the size shown issue: Can we have this flagged as an inconsistency to be fixed in an upcoming version? Either the aim should be everything in GiB (prefered?) or else everything in GB.

For storage size information SI units (base 10) are used in Proxmox VE, as most of the storage providers give their sizes as such. I sent a patch to fix an inconsistent HD usage displayed in IEC units instead of SI units, https://lists.proxmox.com/pipermail/pve-devel/2023-November/060579.html

lifeboy · Nov 21, 2023

Chris said:
I am not sure what you are referring to, but nowhere in the thread was there the mention of which block device was related to which storage. That the issue seems to show up indepenent of the underlying storage seems clear.

Regarding the issue at hand:

Is this VM the same VM you already had trouble with in the past [0]?

Yes, this has been an ongoing problem. When we moved the storage to non-ceph lvm-thin storage, the problem seemed to go away. However, after a couple of weeks it started re-occuring exactly in the same way that it was on ceph rbd storage.

Chris said:
How log does it take for the VM to run into the IO issues?

There's no specific time. We have had a completed re-install in the past and it took only days to start happening again. When we moved off ceph rbd to lvm-thin, it took more than a month before it happened again.

Chris said:
What process is using up all the CPU resources? Can you identify a process inside the VM which is causing the high cpu load, or is this the qemu process on the host itself?

The system runs an IRIS poller which polls about 2000 devices all over the country on a regular interval. The CPU resources are in line with other pollers doing similar work

Chris said:
There were issues related to KSM and memory ballooning in the past [1], probably not related but please try to see if disabling ballooning has any effect.

We have tried that, but it makes no difference.

Chris said:
You can try to attach via strace to the qemu process and generate a backtrace using gdb as described in [2].

Will have to look into that and will report back. For now we have moved the storage to a different ceph rbd pool (containing spinners and ssd wal and rocksdb storage) and it seems to be running fine. I'm awaiting confirmation of this though.

Chris said:
[0] https://forum.proxmox.com/threads/disk-errors-on-freebsd-12-2-guest.105753/
[1] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/
[2] https://forum.proxmox.com/threads/vms-freeze-with-100-cpu.127459/post-561792

lifeboy · Nov 22, 2023

We have done some more experiments with settings. If I increase the CPU's on this machine to 30, the problem of "D" state processes waiting for the disk practically goes away. However, while this may be a partial workaround, the problem is still that the CPU usage is way too high.

The process on FreeBSD essentially sends out an SNMP probe to a device, waits for a response, creates a .json file and writes it to disk. There are about 2000 devices that get polled like that every 5 minutes. The json files are small.

See the disk (Intel SSD) has a block size of 4096 and a logical sector size of 512. The volume pve-vm-199-disk-0 has the following:

MIN-IO OPT-IO PHY-SEC LOG-SEC
65536  65536    4096     512

Does that mean that if I change the volume to have a MIN-IO of 4096, the writes will be 16x optimal? Is so, how can I change that?

lifeboy · Nov 22, 2023

lifeboy said:
See the disk (Intel SSD) has a block size of 4096 and a logical sector size of 512. The volume pve-vm-199-disk-0 has the following:
MIN-IO OPT-IO PHY-SEC LOG-SEC 65536 65536 4096 512
Does that mean that if I change the volume to have a MIN-IO of 4096, the writes will be 16x optimal? Is so, how can I change that?

More than that, can I create a ceph rbd pool that has a 4096 block size as well, for this type of virtual machine? I don't see any parameter in the pool creation process that would allow me to set that.
I do have this is my ceph.conf

[osd]
     bluestore_min_alloc_size = 4096
     bluestore_min_alloc_size_hdd = 4096
     bluestore_min_alloc_size_ssd = 4096

However:

# rbd info --pool standard vm-199-disk-0
rbd image 'vm-199-disk-0':
    size 140 GiB in 35840 objects
    order 22 (4 MiB objects)

which indicates that I'm writing 4MiB chucks at a time. Can I change that for that volume to only write 4096 Bytes at a time?

lifeboy · Nov 22, 2023

lifeboy said:
Can I change that for that volume to only write 4096 Bytes at a time?

I found rbd migration prepare. However

# rbd migration prepare --object-size 4K --stripe-unit 64K --stripe-count 2 standard/vm-199-disk-0 standard/vm-199-disk-1

give me an error:

Code:

2023-11-22T13:04:18.177+0200 7fd9fe1244c0 -1 librbd::image::CreateRequest: validate_striping: stripe unit is not a factor of the object size
2023-11-22T13:04:18.177+0200 7fd9fe1244c0 -1 librbd::Migration: create_dst_image: header creation failed: (22) Invalid argument
rbd: preparing migration failed: (22) Invalid argument

I've even looked at the sourcecode to try to see what this means, since 64K is 16 x 4K... Can someone shed some light on this please?

Here's the source code that does the validation:

Code:

int validate_striping(CephContext *cct, uint8_t order, uint64_t stripe_unit,
                      uint64_t stripe_count) {
  if ((stripe_unit && !stripe_count) ||
      (!stripe_unit && stripe_count)) {
    lderr(cct) << "must specify both (or neither) of stripe-unit and "
               << "stripe-count" << dendl;
    return -EINVAL;
  } else if (stripe_unit && ((1ull << order) % stripe_unit || stripe_unit > (1ull << order))) {
    lderr(cct) << "stripe unit is not a factor of the object size" << dendl;
    return -EINVAL;
  } else if (stripe_unit != 0 && stripe_unit < 512) {
    lderr(cct) << "stripe unit must be at least 512 bytes" << dendl;
    return -EINVAL;
  }
  return 0;
}

Chris · Nov 23, 2023

lifeboy said:
See the disk (Intel SSD) has a block size of 4096 and a logical sector size of 512.

Not that most SSDs do not report their actual page size, these values have therefore not the meaning you might expect, see https://wiki.archlinux.org/title/Advanced_Format#NVMe_solid_state_drives

lifeboy said:
writing 4MiB chucks at a time

The 4M is the default object size used by RBD, I would not recommend to change this, see https://docs.ceph.com/en/quincy/man/8/rbd/#cmdoption-rbd-object-size

lifeboy said:
validate_striping: stripe unit is not a factor of the object size

As the error states, the stripe unit must be a factor of the object size, but you set 64K as stripe unit.

Are you sure your Ceph storage is performant enough to handle your workload? Does this VM perform a lot of sync writes?

Maybe you should check your baseline Ceph performance to see which configuration changes actually give a performance improvement. You can find example performance tests for Ceph in this document https://www.proxmox.com/images/download/pve/docs/Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf.

As Jakob Bohm stated on the qemu discussion list, this is not such a simple issue, so you will have to have at least some baseline performance metrics to investigate further https://lists.nongnu.org/archive/html/qemu-discuss/2023-09/msg00031.html

Edit: Please also specify which SSD models you are using as OSDs in your Ceph cluster.

lifeboy · Nov 23, 2023

Chris said:
Are you sure your Ceph storage is performant enough to handle your workload? Does this VM perform a lot of sync writes?

If the problem only occures with ceph storage, then I would suspect that my ceph may not be able to handle it. But the intel-ssd is not a ceph volume and it happens there as much as it does on ceph storage.

The poller writes many small files quite often. I'll forward some sample and a directory listing as soon as I receive them

Chris said:
Edit: Please also specify which SSD models you are using as OSDs in your Ceph cluster.

I have one pool (speedy) with Intel 1TB NVMe drives (SSDPEKKA010T8). They are split into 3 LVs each and a 2 more LV for spinners' WAL and RocksDB.
The spinners are in a separate pool.

Code:

nvme0n1                                                                                               259:0    0 953.9G  0 disk
├─NodeD--nvme1-NodeD--nvme--LV--data1                                                                 253:0    0   290G  0 lvm 
├─NodeD--nvme1-NodeD--nvme--LV--data2                                                                 253:1    0   290G  0 lvm 
├─NodeD--nvme1-NodeD--nvme--LV--data3                                                                 253:2    0   290G  0 lvm 
├─NodeD--nvme1-NodeD--nvme--LV--RocksDB1                                                              253:4    0  41.9G  0 lvm 
└─NodeD--nvme1-NodeD--nvme--LV--RocksDB2                                                              253:6    0  41.9G  0 lvm 
nvme1n1                                                                                               259:1    0 953.9G  0 disk
├─NodeD--nvme2-NodeD--nvme--LV--data1                                                                 253:3    0   290G  0 lvm 
├─NodeD--nvme2-NodeD--nvme--LV--data2                                                                 253:5    0   290G  0 lvm 
├─NodeD--nvme2-NodeD--nvme--LV--data3                                                                 253:7    0   290G  0 lvm 
├─NodeD--nvme2-NodeD--nvme--LV--RocksDB1                                                              253:8    0  41.9G  0 lvm 
└─NodeD--nvme2-NodeD--nvme--LV--RocksDB2                                                              253:9    0  41.9G  0 lvm

The SSD that I have test with is a SSDSC2KB240G8 drive.

Currently the machine boots from a volume on the single non-ceph ssd drive and the json files are logged to a ceph volume I created for this purpose.

Code:

# rbd info -p speedy vm-199-disk-2
rbd image 'vm-199-disk-2':
    size 20 GiB in 5242880 objects
    order 12 (4 KiB objects)
    snapshot_count: 0
    id: e97509b78c6a0d
    block_name_prefix: rbd_data.e97509b78c6a0d
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features:
    flags:
    create_timestamp: Wed Nov 22 13:18:30 2023
    access_timestamp: Wed Nov 22 13:18:30 2023
    modify_timestamp: Wed Nov 22 13:18:30 2023
root@FT1-NodeD:~# rbd du -p speedy vm-199-disk-2
NAME           PROVISIONED  USED   
vm-199-disk-2       20 GiB  4.1 GiB

lifeboy · Nov 23, 2023

Chris said:
Not that most SSDs do not report their actual page size, these values have therefore not the meaning you might expect, see https://wiki.archlinux.org/title/Advanced_Format#NVMe_solid_state_drives

Here's what my drives report:

Code:

# nvme id-ns -H /dev/nvme0n1 | grep "Relative Performance"
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)

https://wiki.archlinux.org/title/Advanced_Format#NVMe_solid_state_drives

Chris said:
The 4M is the default object size used by RBD, I would not recommend to change this, see https://docs.ceph.com/en/quincy/man/8/rbd/#cmdoption-rbd-object-size

I used that man page to create a special rbd volume for small writes to see if it improves the situation...

Chris said:
As the error states, the stripe unit must be a factor of the object size, but you set 64K as stripe unit.

Yes, I see now, the stripe unit should be less and a factor of the object size. So if my object size is 4K I could as most have a 4K stripe unit, but it would make more sense to have a 16K object size and a stripe unit of 4K then, correct?

Chris said:
Are you sure your Ceph storage is performant enough to handle your workload? Does this VM perform a lot of sync writes?

I'll do some benchmarks using 4K writes on both the 4M volumes and the 4K volumes and see what I get.

Chris said:
Maybe you should check your baseline Ceph performance to see which configuration changes actually give a performance improvement. You can find example performance tests for Ceph in this document https://www.proxmox.com/images/download/pve/docs/Proxmox-VE_Ceph-Benchmark-202009-rev2.pdf.

As Jakob Bohm stated on the qemu discussion list, this is not such a simple issue, so you will have to have at least some baseline performance metrics to investigate further https://lists.nongnu.org/archive/html/qemu-discuss/2023-09/msg00031.html

Edit: Please also specify which SSD models you are using as OSDs in your Ceph cluster.

Strange disk behaviour

Renowned Member

Distinguished Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

Attachments

Renowned Member

Proxmox Staff Member

Renowned Member

Proxmox Staff Member

Proxmox Staff Member

Renowned Member

Renowned Member

Renowned Member

Renowned Member

Proxmox Staff Member

Renowned Member

Renowned Member

We value your privacy