Read-only filesystem after failed backup

godzilla

Member
May 20, 2021
78
5
13
43
Hi,

some guests become readonly after the snapshot backup (performed overnight by PBS) fails. Curiously, I happen to see the very same backup logs on different guests, even with the same guest OS, but they don't go readonly. The backup task fails, and they go on like it was nothing.

Can you help me understand?

The disk is a plain simple Ceph RBD, Virtio SCSI Single controller and Async AIO with threads. PVE version 7.3-1, PBS 2.3-2, Ceph 16.2.11-pve1.

Full versions and logs below.

Code:
Linux proxnode07 5.19.17-2-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.17-2 (Sat, 28 Jan 2023 16:40:25  x86_64 GNU/Linux
proxmox-ve: 7.3-1 (running kernel: 5.19.17-2-pve)
pve-manager: 7.3-6 (running version: 7.3-6/723bb6ec)
pve-kernel-helper: 7.3-4
pve-kernel-5.15: 7.3-2
pve-kernel-5.19: 7.2-15
pve-kernel-5.13: 7.1-9
pve-kernel-5.19.17-2-pve: 5.19.17-2
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.83-1-pve: 5.15.83-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
ceph: 16.2.11-pve1
ceph-fuse: 16.2.11-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.3
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.3-1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.3-2
libpve-guest-common-perl: 4.2-3
libpve-http-server-perl: 4.1-5
libpve-storage-perl: 7.3-2
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-1
lxcfs: 5.0.3-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.3.3-1
proxmox-backup-file-restore: 2.3.3-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-offline-mirror-helper: 0.5.1-1
proxmox-widget-toolkit: 3.5.5
pve-cluster: 7.3-2
pve-container: 4.4-2
pve-docs: 7.3-1
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-7
pve-firmware: 3.6-3
pve-ha-manager: 3.5.1
pve-i18n: 2.8-2
pve-qemu-kvm: 7.1.0-4
pve-xtermjs: 4.16.0-1
qemu-server: 7.3-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+2
vncterm: 1.7-1
zfsutils-linux: 2.1.9-pve1

Code:
2023-03-09T23:56:41+01:00: starting new backup on datastore 'qnap01-emergenza': "vm/704/2023-03-09T22:56:33Z"
2023-03-09T23:56:41+01:00: download 'index.json.blob' from previous backup.
2023-03-10T00:00:04+01:00: register chunks in 'drive-virtio0.img.fidx' from previous backup.
2023-03-10T00:00:04+01:00: download 'drive-virtio0.img.fidx' from previous backup.
2023-03-10T00:00:05+01:00: created new fixed index 1 ("vm/704/2023-03-09T22:56:33Z/drive-virtio0.img.fidx")
2023-03-10T00:00:09+01:00: add blob "/mnt/proxqnap01/emergBackups/vm/704/2023-03-09T22:56:33Z/qemu-server.conf.blob" (480 bytes, comp: 480)
2023-03-10T00:00:09+01:00: add blob "/mnt/proxqnap01/emergBackups/vm/704/2023-03-09T22:56:33Z/fw.conf.blob" (177 bytes, comp: 177)
2023-03-10T00:00:09+01:00: backup ended and finish failed: backup ended but finished flag is not set.
2023-03-10T00:00:09+01:00: removing unfinished backup
2023-03-10T00:00:09+01:00: TASK ERROR: removing backup snapshot "/mnt/proxqnap01/emergBackups/vm/704/2023-03-09T22:56:33Z" failed - Directory not empty (os error 39)

After 00:00:04 the disk went readonly until I forced the restart at 10:24:06.

Code:
Mar  9 23:59:42 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3c:51:8a:29:08:00 SRC=185.221.173.79 DST=255.255.255.255 LEN=112 TOS=0x00 PREC=0x00 TTL=128 ID=48783
 PROTO=UDP SPT=1947 DPT=1947 LEN=92
Mar  9 23:59:45 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3e:f2:73:d0:08:00 SRC=185.56.219.241 DST=255.255.255.255 LEN=162 TOS=0x00 PREC=0x00 TTL=128 ID=3054
PROTO=UDP SPT=17500 DPT=17500 LEN=142
Mar  9 23:59:45 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3e:f2:73:d0:08:00 SRC=185.56.219.241 DST=185.56.219.255 LEN=162 TOS=0x00 PREC=0x00 TTL=128 ID=26563
PROTO=UDP SPT=17500 DPT=17500 LEN=142
Mar  9 23:59:46 havana pure-ftpd: (?@164.92.247.127) [INFO] New connection from 164.92.247.127
Mar  9 23:59:46 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3c:d1:c4:5e:08:00 SRC=185.221.172.133 DST=255.255.255.255 LEN=142 TOS=0x00 PREC=0x00 TTL=64 ID=0 PRO
TO=UDP SPT=5678 DPT=5678 LEN=122
Mar  9 23:59:47 havana kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=00:16:3c:4c:65:cf:f4:de:af:ad:9b:fc:08:00 SRC=162.142.125.81 DST=185.221.172.29 LEN=44 TOS=0x00 PREC=0x00 TTL=42 ID=30257 PR
OTO=TCP SPT=20901 DPT=21299 WINDOW=1024 RES=0x00 SYN URGP=0
Mar  9 23:59:48 havana kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=00:16:3c:4c:65:cf:f4:de:af:ad:9b:fc:08:00 SRC=165.232.188.105 DST=185.56.219.71 LEN=44 TOS=0x00 PREC=0x00 TTL=241 ID=54321 P
ROTO=TCP SPT=41979 DPT=10255 WINDOW=65535 RES=0x00 SYN URGP=0
Mar  9 23:59:49 havana kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=00:16:3c:4c:65:cf:f4:de:af:ad:9b:fc:08:00 SRC=79.124.62.86 DST=185.56.219.71 LEN=44 TOS=0x00 PREC=0x00 TTL=247 ID=61845 PROT
O=TCP SPT=59720 DPT=7678 WINDOW=1024 RES=0x00 SYN URGP=0
Mar  9 23:59:51 havana pure-ftpd: (?@164.92.247.127) [WARNING] Authentication failed for user [anonymous]
Mar  9 23:59:51 havana pure-ftpd: (?@164.92.247.127) [INFO] Logout.
Mar  9 23:59:53 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3c:51:8a:29:08:00 SRC=185.221.173.79 DST=255.255.255.255 LEN=68 TOS=0x00 PREC=0x00 TTL=128 ID=48784
PROTO=UDP SPT=63487 DPT=1947 LEN=48
Mar  9 23:59:54 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:a2:5d:35:15:5b:19:08:00 SRC=194.76.119.70 DST=255.255.255.255 LEN=153 TOS=0x00 PREC=0x00 TTL=64 ID=0 PROTO
=UDP SPT=5678 DPT=5678 LEN=133
Mar  9 23:59:54 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:08:60:6e:eb:51:fa:08:00 SRC=185.17.107.56 DST=255.255.255.255 LEN=162 TOS=0x00 PREC=0x00 TTL=128 ID=28818
PROTO=UDP SPT=17500 DPT=17500 LEN=142
Mar  9 23:59:55 havana kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=00:16:3c:4c:65:cf:f4:de:af:ad:9b:fc:08:00 SRC=185.224.128.243 DST=185.221.172.29 LEN=44 TOS=0x00 PREC=0x00 TTL=248 ID=57197
PROTO=TCP SPT=54849 DPT=5061 WINDOW=1024 RES=0x00 SYN URGP=0
Mar  9 23:59:56 havana kernel: Firewall: *UDP_IN Blocked* IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:16:3c:94:5f:71:08:00 SRC=185.17.106.137 DST=255.255.255.255 LEN=161 TOS=0x00 PREC=0x00 TTL=128 ID=26438
 PROTO=UDP SPT=17500 DPT=17500 LEN=141
Mar 10 00:00:01 havana systemd: Created slice User Slice of queenclaire.
Mar 10 00:00:01 havana systemd: Started Session 18295 of user queenclaire.
Mar 10 00:00:01 havana systemd: Created slice User Slice of root.
Mar 10 00:00:01 havana systemd: Started Session 18297 of user root.
Mar 10 00:00:01 havana systemd: Created slice User Slice of taccodieci.
Mar 10 00:00:01 havana systemd: Started Session 18296 of user taccodieci.
Mar 10 00:00:01 havana systemd: Started Session 18300 of user root.
Mar 10 00:00:01 havana systemd: Started Session 18299 of user root.
Mar 10 00:00:01 havana systemd: Started Session 18298 of user taccodieci.
Mar 10 00:00:01 havana systemd: Started Session 18302 of user root.
Mar 10 00:00:01 havana systemd: Started Session 18301 of user taccodieci.
Mar 10 00:00:01 havana systemd: Removed slice User Slice of queenclaire.
Mar 10 00:00:01 havana systemd: Stopping ConfigServer Firewall & Security - lfd...
Mar 10 00:00:01 havana systemd: lfd.service: main process exited, code=killed, status=9/KILL
Mar 10 00:00:01 havana systemd: Stopped ConfigServer Firewall & Security - lfd.
Mar 10 00:00:01 havana systemd: Unit lfd.service entered failed state.
Mar 10 00:00:01 havana systemd: lfd.service failed.
Mar 10 00:00:01 havana systemd: Starting ConfigServer Firewall & Security - lfd...
Mar 10 00:00:03 havana systemd: Started ConfigServer Firewall & Security - lfd.
Mar 10 00:00:04 havana systemd: Removed slice User Slice of root.
Mar 10 10:24:06 havana journal: Runtime journal is using 8.0M (max allowed 399.0M, trying to leave 598.6M free of 3.8G available → current limit 399.0M).
Mar 10 10:24:06 havana kernel: Initializing cgroup subsys cpuset
Mar 10 10:24:06 havana kernel: Initializing cgroup subsys cpu
Mar 10 10:24:06 havana kernel: Initializing cgroup subsys cpuacct
Mar 10 10:24:06 havana kernel: Linux version 3.10.0-957.5.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Fri Feb 1 14:54:57 UTC 2019
Mar 10 10:24:06 havana kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-957.5.1.el7.x86_64 root=/dev/vda1 ro rhgb net.ifnames=0 biosdevname=0 LANG=en_US.UTF-8

Thank you!
 
Hi,
can you please also share the full backup task log from the Proxmox VE side and the output of qm config <ID> with the ID of your VM? What exactly does read-only mean, i.e. which commands did you run and what output/errors did you get to come to that conclusion? It seems a bit strange that there are no errors about that in the VM's log.

What you could try is turning on the iothread setting for the VM's disks, then QEMU will handle IO for the disk in a dedicated thread. You could also try reducing the number of workers (see here) or set a bandwidth limit and see if that helps.
 
Hi @fiona, thanks for replying.

qm config below:

Code:
# qm config 469
agent: 1
boot: order=virtio0;net0
cores: 6
description: [redacted]
ide0: ceph_storage:vm-469-cloudinit,media=cdrom,size=4M
memory: 8192
name: [redacted]
net0: virtio=00:16:3C:C2:72:18,bridge=vmbr1,firewall=1,rate=15
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-pci
smbios1: uuid=7fdfc29e-c064-43e0-957e-0a0ed0b401ab
sockets: 1
virtio0: ceph_storage:vm-469-disk-0,iops_wr=256,iops_wr_max=512,mbps_rd=120,mbps_wr=120,size=200G
vmgenid: bf307675-c755-4e85-b00a-9ec0a2674ea8

I just turned on iothread so the new virtio line reads as follows. is it ok?

Code:
virtio0: ceph_storage:vm-469-disk-0,aio=threads,iops_wr=256,iops_wr_max=512,mbps_rd=120,mbps_wr=120,size=200G

I also changed the controller from VirtIO SCSI to Virtio SCSI Single, because I read it should improve performance and stability, do you think is it a good thing to do? However I can't reboot the guest right now.

Thank you
 
Last edited:
Code:
virtio0: ceph_storage:vm-469-disk-0,aio=threads,iops_wr=256,iops_wr_max=512,mbps_rd=120,mbps_wr=120,size=200G
There is no iothread setting. Did you shutdown/start the VM already for the change to get applied?
I also changed the controller from VirtIO SCSI to Virtio SCSI Single, because I read it should improve performance and stability, do you think is it a good thing to do? However I can't reboot the guest right now.
Yes, you need that controller to be able to use iothread.
 
OK @fiona , thanks. As said, I just enabled iothreads but couldn't reboot the guest yet. I'll do asap.
Make sure to reboot the guest with the Reboot button from the web interface, not from inside the guest. Otherwise, the change won't be applied.
 
In /var/log/messages? How would it log errors there when it's read-only?
Many times, the root partition has the errors=remount-ro option and if the file system is re-mounted read-only, because of errors, there should be messages about the errors before the re-mount occurred.
 
Hi @fiona, However it's happening more and more often. I really don't know what to do.

I'm attaching a log of another failed backup on another VM, where the hardware changes were already applied and the disk went read-only nonetheless.

Code:
2023-03-15T00:06:18+01:00: starting new backup on datastore 'qnap01-emergenza': "vm/549/2023-03-14T23:06:13Z"
2023-03-15T00:06:18+01:00: download 'index.json.blob' from previous backup.
2023-03-15T00:09:05+01:00: register chunks in 'drive-virtio0.img.fidx' from previous backup.
2023-03-15T00:09:05+01:00: download 'drive-virtio0.img.fidx' from previous backup.
2023-03-15T00:09:05+01:00: created new fixed index 1 ("vm/549/2023-03-14T23:06:13Z/drive-virtio0.img.fidx")
2023-03-15T00:09:09+01:00: add blob "/mnt/proxqnap01/emergBackups/vm/549/2023-03-14T23:06:13Z/qemu-server.conf.blob" (491 bytes, comp: 491)
2023-03-15T00:09:09+01:00: add blob "/mnt/proxqnap01/emergBackups/vm/549/2023-03-14T23:06:13Z/fw.conf.blob" (146 bytes, comp: 146)
2023-03-15T00:09:10+01:00: backup ended and finish failed: backup ended but finished flag is not set.
2023-03-15T00:09:10+01:00: removing unfinished backup
2023-03-15T00:09:10+01:00: TASK ERROR: removing backup snapshot "/mnt/proxqnap01/emergBackups/vm/549/2023-03-14T23:06:13Z" failed - Directory not empty (os error 39)

Code:
agent: 1
boot: order=virtio0;net0
cores: 6
cpulimit: 3
description: [redacted]
memory: 8192
name: [redacted]
net0: virtio=00:16:3C:4E:A1:FA,bridge=vmbr1,firewall=1,rate=15
numa: 0
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=f82a1e91-97ce-4f18-8564-0db90361c804
sockets: 1
virtio0: ceph_storage:vm-549-disk-0,aio=threads,iops_wr=256,iops_wr_max=512,mbps_rd=120,mbps_wr=120,size=500G
vmgenid: 76c805b3-3ae2-46ae-a4f3-b990c645149e
 
There's still missing some information that might help to tell what's going on:
can you please also share the full backup task log from the Proxmox VE side

What exactly does read-only mean, i.e. which commands did you run and what output/errors did you get to come to that conclusion?

Did you already try the following?
You could also try reducing the number of workers (see here) or set a bandwidth limit and see if that helps.

There still is no iothread setting for the disk (it's not the same as aio=threads).
Code:
virtio0: ceph_storage:vm-549-disk-0,aio=threads,iops_wr=256,iops_wr_max=512,mbps_rd=120,mbps_wr=120,size=500G

How does the load on the backup server on the Proxmox VE host and network look like during backup?
 
There still is no iothread setting for the disk (it's not the same as aio=threads).

Thanks @fiona ! Oops. Sorry about that, I got the two things confused. Updated and will reboot soon.

No anomalies on the load/network.

Anyway, I start to see a pattern. It looks like all the VMs where the problem occurs have Acronis cloud backup agent installed and running. Although PVE seems to be officially supported by Acronis, this changes everything, because the scenario includes what follows:
  1. kernel modules which interfere with the virtual drive and the filesystem
  2. both PBS and Acronis backups being often scheduled in the same time frame
I still have no evidence, but my spider sense is tingling. Do you have any clue about similar issues?
 
Last edited:
Thanks @fiona ! Oops. Sorry about that, I got the two things confused. Updated and will reboot soon.

No anomalies on the load/network.

Anyway, I start to see a pattern. It looks like all the VMs where the problem occurs have Acronis cloud backup agent installed and running. Although PVE seems to be officially supported by Acronis, this changes everything, because the scenario includes what follows:
  1. kernel modules which interfere with the virtual drive and the filesystem
  2. both PBS and Acronis backups being often scheduled in the same time frame
I still have no evidence, but my spider sense is tingling. Do you have any clue about similar issues?
Sorry, I never heard about Acronis before, but sounds like you got a lead. Maybe scheduling them at different times helps already.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!