VM IO drops to near 0

Chaosekie · Oct 28, 2021

Hi,

On a few of our VMs we see the following behaviour when a backup runs.

The backup started at 05:00:02

From the outside the VM seems fine (proxmox summary page), but when the backup started the OS changed the disk into read-only mode

Code:

Oct 25 05:00:57 *******-SQL-02 kernel: [5294216.466540] EXT4-fs error (device sdb1): ext4_journal_check_start:61: Detected aborted journal
Oct 25 05:00:57 *******-SQL-02 kernel: [5294216.470832] EXT4-fs (sdb1): Remounting filesystem read-only
Oct 25 05:00:57 *******-SQL-02 kernel: [5294216.529257] EXT4-fs error (device sdb1) in ext4_reserve_inode_write:5875: Journal has aborted

Some time later VM experiences extreme IO degradation:

image (a43584ab-63bc-4162-a1f3-4565dc6e11dc).png

The io returns to normal almost immediately after the backup has been canceled (for the above the backup was canceled at 16:16:31)

This has happened a few days in a row.
Today the VM experienced the IO degradation roughly 3 hours after the backup started.

The machine has the following disks mounted via sata:
70G
1050G
1050G
200G
150G

The usual symptom is the disk entering read-only mode, like in the logs above, the last two days have just seen io degradation without read-only mode.
And in very rare occasions leads to the VM corrupting it's partition table (happened to a different VM yesterday).

I read somewhere that a disk's sectors get locked when performing a backup and unlocked once processed, is this true, could this be the cause?
Where can we start to investigate this issue?

Backups run to PBS (although this particular VM changes most of it's disk every day so the incremental backups don't shave off much of the time it takes)
The filesystem on hosts is ZFS for linux.
Hosts have 10gbe.

Thanks for any help

aaron · Oct 28, 2021

Chaosekie said:
The filesystem on hosts is ZFS for linux.

Do you have more details? What disks and how is the pool layout?

Can you please provide the output of zpool status in [code][/code] tags?

Chaosekie · Oct 28, 2021

Happy to provide any details you require

Code:

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 2 days 20:32:04 with 0 errors on Tue Oct 12 20:56:07 2021
config:

        NAME                                                     STATE     READ WRITE CKSUM
        rpool                                                    ONLINE       0     0     0
          raidz3-0                                               ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901880-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N608544-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901806-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N608543-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901881-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N608534-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901805-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0MB00250-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N902052-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901802-part3  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KH1T9HAJR-00005_S47PNA0N901807-part3  ONLINE       0     0     0

errors: No known data errors

  pool: rpool-2
 state: ONLINE
  scan: scrub repaired 0B in 0 days 17:49:58 with 0 errors on Sun Oct 10 18:14:02 2021
config:

        NAME                                               STATE     READ WRITE CKSUM
        rpool-2                                            ONLINE       0     0     0
          raidz3-0                                         ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0J400744  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0J401132  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507473  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507470  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507481  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNXAH310938  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507250  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNXAH311477  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507259  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNXRH500430  ONLINE       0     0     0
            ata-SAMSUNG_MZ7KM960HAHP-00005_S2HTNX0H507265  ONLINE       0     0     0

errors: No known data errors

Chaosekie · Oct 28, 2021

In journalctl I'm seeing this when the backup started
(I've removed the items from the log that I think aren't useful such as "Starting PVESR..")

adoII · Oct 28, 2021

Same here,
no Problems with fast backup server hardware and fast servers. But on a proxmox environment at a hoster i see the same.
It seems that during backups blocks are locked for write until they are completely written to the backup server. If this takes too long vm will crash.
On the vms i see these errors using dmesg during the backups:

Code:

[4740485.032005] INFO: rcu_sched detected stalls on CPUs/tasks:
[4740486.738901]     0-...: (1 GPs behind) idle=4c7/2/0 softirq=152201674/152201675 fqs=0
[4740486.738901]     (detected by 0, t=1 jiffies, g=145639170, c=145639169, q=6562)
[4740486.738901] Task dump for CPU 0:
[4740486.738901] swapper/0       R  running task        0     0      0 0x00000008
[4740486.738901]  ffffffff85319b40 ffffffff846a889b 0000000000000000 ffffffff85319b40
[4740486.738901]  ffffffff84c1252a ffff9de1bfc19740 ffffffff8524fe00 0000000000000000
[4740486.738901]  ffffffff85319b40 00000000ffffffff ffffffff846e4835 00000000003d0900
[4740486.738901] Call Trace:
[4740486.738901]  <IRQ>
[4740486.738901]  [<ffffffff846a889b>] ? sched_show_task+0xcb/0x130
[4740486.738901]  [<ffffffff84c1252a>] ? rcu_dump_cpu_stacks+0x92/0xb2
[4740486.738901]  [<ffffffff846e4835>] ? rcu_check_callbacks+0x875/0x8d0
[4740486.738901]  [<ffffffff846fad20>] ? tick_sched_do_timer+0x30/0x30
[4740486.738901]  [<ffffffff846eb308>] ? update_process_times+0x28/0x50
[4740486.738901]  [<ffffffff846fa720>] ? tick_sched_handle.isra.12+0x20/0x50
[4740486.738901]  [<ffffffff846fad58>] ? tick_sched_timer+0x38/0x70
[4740486.738901]  [<ffffffff846ebdde>] ? __hrtimer_run_queues+0xde/0x250
[4740486.738901]  [<ffffffff846ec4bc>] ? hrtimer_interrupt+0x9c/0x1a0

itNGO · Oct 29, 2021

Hi,
when using large SATA-Disks in a RAIDZ-Pool on your backup-server I would recommend to have some SSDs for Cache and Journal.
This fixes almost all Performance-Issues with Backup at our Datacenter.

aaron · Oct 29, 2021

@Chaosekie How is the PBS connected, and how is it performing during the backups?

Is the zpool status that you posted from the PVE node or the PBS server?

If it is from the PVE server, then I would recommend changing it whenever possible, as any raidz vdevs are not ideal for many VMs. A raid 10 like pool made up of mirror VDEVs will give you much better performance for VMs. For more details you can check out the PVE admin guide and the section about ZFS RAID level considerations.
Yes, those disks are SSDs, but not the fastest ones, and they are still connected using only SATA, so it might still be a bottleneck in a raidz config if there is enough happening with many VMs.

If those pools are from the PBS, then it should be okay and more info about how it is performing during backups might give us more insights.

adoII · Oct 29, 2021

i have a setup at hetzner with a pbs an 8 drive zfs raid10
Here every time when a vm is backed up to the backup server the vm will crash with io/errors and cpu stalls.
If you need an environment to repdroduce the errors i can provide access to the setup

Angelo · Oct 29, 2021

aaron said:
@Chaosekie How is the PBS connected, and how is it performing during the backups?

Is the zpool status that you posted from the PVE node or the PBS server?

If it is from the PVE server, then I would recommend changing it whenever possible, as any raidz vdevs are not ideal for many VMs. A raid 10 like pool made up of mirror VDEVs will give you much better performance for VMs. For more details you can check out the PVE admin guide and the section about ZFS RAID level considerations.
Yes, those disks are SSDs, but not the fastest ones, and they are still connected using only SATA, so it might still be a bottleneck in a raidz config if there is enough happening with many VMs.

If those pools are from the PBS, then it should be okay and more info about how it is performing during backups might give us more insights.

Hi Aaron,

I work with @Chaosekie so I'm adding my bit on this.

The zpool status is from the PVE host - whilst your comments are valid, we don't have performance issues on the PVE hosts using RAIDZ2/3 pools - and data resilience is critical for us. We limit default VM disk IO to 80Mbytes / 5000 IOPS (both read and write) so that we manage noisy neighbours and ensure that the chances that a host is overwhelmed with IO demand is limited - where required, we'll adjust this for specific VMs. We run about 1200 VMs on this basis across about 20 hosts and PVE has proven to be a very robust platform for us.

The issue described by @Chaosekie is happening VERY sporadically to some VMs on backup - when this is happening (i.e. the VM being backed up starts experencing severely limited disk IO/throughput), performance on the host is good (i.e. all other VMs are performing just fine). It is ONLY the specific VM being backed up that is experiencing any issues. Host IO delay is less than 2% throughout.. On the vast majority of VM backups, though, we have no issues at all. And they're all being backed up to the same storage server on which PBS resides.

One of the things that's not clear to me - and I've seen the earlier response from adoll about disk blocks being 'locked' (I recall coming across this somewhere as well) - is that the mode selected for backup is 'snapshot' so I'm not understanding why a disk block may be 'locked' until backed up?

Kind regards,

Angelo.

adoII · Oct 29, 2021

Whats also new in my case is that I have my images on local-btrfs .
@aaron, do you also use btrfs ?

aaron · Nov 4, 2021

Okay, after some days in between and with a fresh mind

@Chaosekie @Angelo how is the PBS performing when you run into the problem? Is the IO Wait graph in the Administration panel high? Maybe higher than usually?

The one thing that could cause problems, is, if the VM wants to write a lot of data and the PBS might be a bit slower at the same time.

The reason is due to how VMs are backed up while they are running. This will also most likely answer your questions regarding "locked" disks.
When a VM is backed up, it will send a fsfreeze and thaw command to the guest via the agent. This will cause a very short stop of all disk IO during which the backup job will start to catch any write operations. At this point, we have the "snapshot". The backup job will start to read the disk from the beginning. Now if the VM wants to write some data somewhere else on the disk, the backup job will first back up that part of the disk out of order, before it allows the write operation to continue.

With this mechanism we can ensure that the backup contains the data on disk at the time of the start of the backup. The downside is, that write IO to parts of the disk that have not yet been backed up, is slower. I can imagine, that if the VM is writing a lot, and if anything on the way to the stored backup data is too slow (network, PBS itself, ...), you might run into the problem that you experience.

aaron · Nov 4, 2021

adoII said:
@aaron, do you also use btrfs ?

I am a ZFS person and have only very limited practical experience with BTRFS

drnoelkelly · Nov 15, 2021

We have this same problem. PVE 7.0-13.

Last weekend we lost the partition table on a small Linux guest and this weekend on a similarly small guest we had ext4 corruption as per the original post (ie 'rcu_shed detected stalls on CPU/tasks' and the guest file system gets remounted RO by the kernel). Reboot discovers file system is dirty and fsck is required to fix the disk errors (whcih thankfully appears to have worked).

We are running a PBS but it was not involved in the issue this weekend. The snapshot backup was running to a SATA drive in the same Proxmox server. We are not using ZFS on this host. All disks are ext4 and hardware RAID1.

We only recently upgraded from PVE v4.4 and prior to this all operations have been faultless for +3yrs.

Neither of the VMs who have had these issues were running qemu-guest-agent and the option was disabled. I have now installed the qemu-guest-agent and enabled the option and cold booted.

Is there anything else we should do to mitigate this? This corruption of the guest file system is pretty concerning.

I am attaching some screenshots of the guest console and below is the start of the backup task log (VM ID 100) which appears normal of course.

Thanks

====

INFO: starting new backup job: vzdump 100 103 150 107 109 1000 --compress lzo --mailnotification failure --storage Sata10TB2 --quiet 1 --mode snapshot --mailto sysadmin@tarsus.co.uk
INFO: Starting Backup of VM 100 (qemu)
INFO: Backup started at 2021-11-15 01:00:02
INFO: status = running
INFO: VM Name: Sauk
INFO: include disk 'sata0' 'Sata:100/vm-100-disk-1.qcow2' 76G
INFO: backup mode: snapshot
INFO: ionice priority: 7
INFO: creating vzdump archive '/mnt/sata10tb2/dump/vzdump-qemu-100-2021_11_15-01_00_02.vma.lzo'
INFO: started backup task 'ae8fff60-e507-41f5-9db3-21d7451db253'
INFO: resuming VM again
INFO: 0% (177.6 MiB of 76.0 GiB) in 3s, read: 59.2 MiB/s, write: 54.5 MiB/s
INFO: 1% (848.2 MiB of 76.0 GiB) in 10s, read: 95.8 MiB/s, write: 94.3 MiB/s
INFO: 2% (1.6 GiB of 76.0 GiB) in 16s, read: 133.1 MiB/s, write: 131.4 MiB/s

Search

Search

VM IO drops to near 0

Chaosekie

New Member

aaron

Proxmox Staff Member

Chaosekie

New Member

Chaosekie

New Member

Attachments

adoII

Renowned Member

itNGO

Renowned Member

aaron

Proxmox Staff Member

adoII

Renowned Member

Angelo

Active Member

adoII

Renowned Member

aaron

Proxmox Staff Member

aaron

Proxmox Staff Member

drnoelkelly

Member

Attachments