VM Freezes on vi but not echo - 100% I/O Wait on Ceph RBD (Proxmox)

emp_c

New Member
Jun 27, 2025
3
0
1
I am facing a issue for VM IO freeze for specific operation with Ceph RBD of Proxmox. This error has been occurring for more than 1 year, which is affecting the system reliability severely.

Environment:
  • Proxmox (8.1.3) with Ceph (17.2.7)
  • VM: RHEL 8
  • Storage: Ceph RBD (block device) → VM as XFS and ext4 disk
  • Mount: fstab with defaults
  • VM Config: VirtIO-SCSI single, discard=on, SSD emulation

Trigger:
  • random IO freeze error happening once per 1 ~ 6 month randomly with normal VM usage pattern
  • with fio stress, hourly backup, and memory create block and release stress, it will hang within 1 - 7 days
Symptoms:

1. High I/O Wait: iostat shows like
avg-cpu: %iowait 43.71%, %idle 56.22%
sdX: %util 100.00%
But all I/O metrics are 0: r/s=0, w/s=0, rMB/s=0, wMB/s=0, aqu-sz=0

2. Operations:
✅ echo "abc" > new_file.txt (works)
✅ echo "abc" >> existing_file.txt (works)
❌ vi any_file.txt (VM freezes indefinitely)
❌ cp old_file.txt new_file.txt (VM freezes indefinitely)

3. Workaround:
Live-migrating the VM to another Proxmox node temporarily resolves the freeze,
or reboot the VM resolves the freeze

Tested some other settings which are also failed:

1. disable the KSM on proxmox node
2. Async IO=native / io_uring / threads
3. SCSI Controller: Virtio SCSI Single / Virtio SCSI
4. disable fs-freeze
5. disalbe QEMU guest agent

Question:

Is there idea how to investigate and fix the issue? Is it some know deadlock for the current setting?

Thanks for the help.
 

Attachments

  • image.png
    image.png
    47.2 KB · Views: 4
Is there an error message, or do you only experience the hang?

Can you try to strace the cp command that freezes?
Thank for you the input, during the hang, even strace cannot be used.

Now another severe hang occurs when:
1. iostat showing "
%util 100.00%
But all I/O metrics are 0: r/s=0, w/s=0, rMB/s=0, wMB/s=0, aqu-sz=0

2. strace -T -ttt -f -yy -o strace_normal.log cp test.txt test2.txt
is sucessfully "once", afterwards, it hang for all cp command

3. timeout 5 strace -T -ttt -f -yy -o strace_cp2.log cp test.txt test3.txt
does not produce any log file and output, it also hang and timeout cannot work

The last "normal" strace in step 2 above is attached.

---
For ceph status:
ceph -s
cluster:
id: 3151d9c6-878b-4e2b-95cd-df771eb6479e
health: HEALTH_OK

services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 27h)
mgr: ceph1(active, since 28h), standbys: ceph3, ceph2
osd: 12 osds: 12 up (since 2d), 12 in (since 2d)

data:
pools: 4 pools, 193 pgs
objects: 2.70M objects, 10 TiB
usage: 30 TiB used, 34 TiB / 64 TiB avail
pgs: 191 active+clean
2 active+clean+scrubbing+deep

io:
client: 11 KiB/s rd, 752 KiB/s wr, 3 op/s rd, 75 op/s wr
 

Attachments

1751510982246.png
The VM freeze with /var/log/message stopped. The message is resumed after live migrate is performed.

One more observation: we are performing backups every 2 hours on the testing site. When we encounter a downtime issue, several occurrences happen shortly after the Proxmox Backup Server runs for the regular snapshot. Is there a possibility that the backup could create a deadlock?