VM disks going readonly

Should it happen again, you can check with ls -1 /proc/$(cat /var/run/qemu-server/123.pid)/fd/ | wc -l replacing 123 with the actual ID of your VM how many open file descriptors there are.
 
  • Like
Reactions: SagnikS
Hi!

How many free space do you have on the storage?

EXT4 filesystem has default 5% reserved space, maybe you hitting the limit? (when you write the 95% of the disk).

Code:
$> tune2fs -l /dev/sda1 | grep -ie "Block count" -ie "Reserved block count" -ie "Block size"

[CODE]
The limit ( in GB ):
("Block count" - "Reserved block count") * "Block size" / 1024^3

$> tune2fs -l /dev/sda1 | grep -ie "Block count" -ie "Reserved block count" -ie "Block size" | awk '{print $(NF-0)}' | tr "\n" " " | awk '{print ($1 - $2) * $3 / 1024^3 " GB" }'
 
Last edited:
Hi!

How many free space do you have on the storage?

EXT4 filesystem has default 5% reserved space, maybe you hitting the limit? ( when you hit the limit you cant write it )
Code:
$> tune2fs -l /dev/sda1 | grep -ie "Block count" -ie "Reserved block count" -ie "Block size"

The limit ( in GB ):
("Block count" - "Reserved block count") * "Block size" / 1024^3

Code:
$> tune2fs -l /dev/sda1 | grep -ie "Block count" -ie "Reserved block count" -ie "Block size" | awk '{print $(NF-0)}' | tr "\n" " " | awk '{print ($1 - $2) * $3 / 1024^3 " GB" }'
Thanks for your reply. It's not lack of space, there's enough space available on all these servers.
 
Should it happen again, you can check with ls -1 /proc/$(cat /var/run/qemu-server/123.pid)/fd/ | wc -l replacing 123 with the actual ID of your VM how many open file descriptors there are.
It happened on another VM, not sure if related. The output of this command above was 107. I grabbed kernel logs from the guest:
 

Attachments

Last edited:
It happened on another VM, not sure if related. The output of this command above was 107. I grabbed kernel logs from the guest:
This might actually be unrelated, the timestamp in the VM is 20:03 EST, this server had a running backup and the backup had failed due to a hardware problem with the backup server at around the same time. This behaviour isn't really ideal :p (no iothread, VirtIO SCSI controller)


Code:
INFO:  69% (7.7 GiB of 11.0 GiB) in 1m 33s, read: 101.3 MiB/s, write: 101.3 MiB/s
INFO:  69% (7.7 GiB of 11.0 GiB) in 17m 15s, read: 0 B/s, write: 0 B/s
ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: timed out
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 412 failed - backup write data failed: command error: write_data upload error: pipelined request failed: timed out
INFO: Failed at 2024-02-02 01:14:54
INFO: Starting Backup of VM 413 (qemu)
INFO: Backup started at 2024-02-02 01:14:54
INFO: status = running
 
It happened on another VM, not sure if related. The output of this command above was 107. I grabbed kernel logs from the guest:
The system logs show failed IO and that the filesystem is remounted read-only because of that.
Code:
INFO:  69% (7.7 GiB of 11.0 GiB) in 1m 33s, read: 101.3 MiB/s, write: 101.3 MiB/s
INFO:  69% (7.7 GiB of 11.0 GiB) in 17m 15s, read: 0 B/s, write: 0 B/s
ERROR: backup write data failed: command error: write_data upload error: pipelined request failed: timed out
INFO: aborting backup job
INFO: resuming VM again
ERROR: Backup of VM 412 failed - backup write data failed: command error: write_data upload error: pipelined request failed: timed out
INFO: Failed at 2024-02-02 01:14:54
INFO: Starting Backup of VM 413 (qemu)
INFO: Backup started at 2024-02-02 01:14:54
INFO: status = running
When the connection to the backup target is lost or too slow, it's unfortunately expected. See my reply here: https://forum.proxmox.com/threads/i...m-in-the-vm-after-fsfreeze.141080/post-631577
 
Thank you very much for your reply. Unfortunately in most cases the guest console got spammed with systemd messages stating the disk was readonly. I was able to grab the error right after in one or two instances, I don't have a screenshot but it was something like this:
validate_block_bitmap comm_fstrim bad block bitmap checksum

I had assumed it might fstrim on the host/guest, so I disabled fstrim.timer on both, but it still happened. I was also have to trigger it on two guests by running fstrim -v / on the host, but I couldn't reproduce it after that.



No backups or snapshots at all.

I will try the jq command if/when this happens again. My most recent change was to disable discard on the VM disks on PVE, I'll report back if this reoccurs.
Does ext4 in guest still crash after disabling discard? We also have a Ubuntu VM with 64G storage, ext4 crashes (gets fs read-only) once two weeks, while other VMs are working normally (some of them also have discard on though). Also not sure if it's the same cause as this.
 
Does ext4 in guest still crash after disabling discard? We also have a Ubuntu VM with 64G storage, ext4 crashes (gets fs read-only) once two weeks, while other VMs are working normally (some of them also have discard on though). Also not sure if it's the same cause as this.
Disabling/enabling discard didn't seem to help much. It has gotten a lot less frequent recently though, not sure what has changed.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!