VM Freezes on vi but not echo - 100% I/O Wait on Ceph RBD (Proxmox)

emp_c · Jun 27, 2025

I am facing a issue for VM IO freeze for specific operation with Ceph RBD of Proxmox. This error has been occurring for more than 1 year, which is affecting the system reliability severely.

Environment:

Proxmox (8.1.3) with Ceph (17.2.7)
VM: RHEL 8
Storage: Ceph RBD (block device) → VM as XFS and ext4 disk
Mount: fstab with defaults
VM Config: VirtIO-SCSI single, discard=on, SSD emulation

Trigger:

random IO freeze error happening once per 1 ~ 6 month randomly with normal VM usage pattern
with fio stress, hourly backup, and memory create block and release stress, it will hang within 1 - 7 days

Symptoms:

1. High I/O Wait: iostat shows like
avg-cpu: %iowait 43.71%, %idle 56.22%
sdX: %util 100.00%
But all I/O metrics are 0: r/s=0, w/s=0, rMB/s=0, wMB/s=0, aqu-sz=0

2. Operations:

echo "abc" > new_file.txt (works)

echo "abc" >> existing_file.txt (works)

vi any_file.txt (VM freezes indefinitely)

cp old_file.txt new_file.txt (VM freezes indefinitely)

3. Workaround:
Live-migrating the VM to another Proxmox node temporarily resolves the freeze,
or reboot the VM resolves the freeze

Tested some other settings which are also failed:

1. disable the KSM on proxmox node
2. Async IO=native / io_uring / threads
3. SCSI Controller: Virtio SCSI Single / Virtio SCSI
4. disable fs-freeze
5. disalbe QEMU guest agent

Question:

Is there idea how to investigate and fix the issue? Is it some know deadlock for the current setting?

Thanks for the help.

LnxBil · Jun 27, 2025

emp_c said:
random IO freeze error happening

Is there an error message, or do you only experience the hang?

Can you try to strace the cp command that freezes?

fba · Jun 27, 2025

As you're using Ceph as storage have a look at it's status with ceph status and ceph health detail

emp_c · Jun 30, 2025

LnxBil said:
Is there an error message, or do you only experience the hang?

Can you try to strace the cp command that freezes?

Thank for you the input, during the hang, even strace cannot be used.

Now another severe hang occurs when:
1. iostat showing "
%util 100.00%
But all I/O metrics are 0: r/s=0, w/s=0, rMB/s=0, wMB/s=0, aqu-sz=0

2. strace -T -ttt -f -yy -o strace_normal.log cp test.txt test2.txt
is sucessfully "once", afterwards, it hang for all cp command

3. timeout 5 strace -T -ttt -f -yy -o strace_cp2.log cp test.txt test3.txt
does not produce any log file and output, it also hang and timeout cannot work

The last "normal" strace in step 2 above is attached.

---
For ceph status:
ceph -s
cluster:
id: 3151d9c6-878b-4e2b-95cd-df771eb6479e
health: HEALTH_OK

services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 27h)
mgr: ceph1(active, since 28h), standbys: ceph3, ceph2
osd: 12 osds: 12 up (since 2d), 12 in (since 2d)

data:
pools: 4 pools, 193 pgs
objects: 2.70M objects, 10 TiB
usage: 30 TiB used, 34 TiB / 64 TiB avail
pgs: 191 active+clean
2 active+clean+scrubbing+deep

io:
client: 11 KiB/s rd, 752 KiB/s wr, 3 op/s rd, 75 op/s wr

emp_c · Jul 3, 2025

The VM freeze with /var/log/message stopped. The message is resumed after live migrate is performed.

One more observation: we are performing backups every 2 hours on the testing site. When we encounter a downtime issue, several occurrences happen shortly after the Proxmox Backup Server runs for the regular snapshot. Is there a possibility that the backup could create a deadlock?

Search

Search

VM Freezes on vi but not echo - 100% I/O Wait on Ceph RBD (Proxmox)

emp_c

New Member

Attachments

LnxBil

Distinguished Member

fba

Renowned Member

emp_c

New Member

Attachments

emp_c

New Member

We value your privacy