Proxmox 8.0.9 Sudden IO spikes

Anfernee12345

New Member
Aug 18, 2022
21
0
1
Hello,

First my Specs:
Supermicro H11DSI-NT v2
dual epyc 7601
512 GB RAM
Proxmox itself is running off from 980 pro that is attached to motherboard itself.
for VM-s i have this Sabrent 4 Drive NVMe SSD to PCIe 4.0 X 16 Active Cooling Adapter Card with 4 x 980 pro 1TB
From bios the pcie slot is set to 4x4x4x4 bifurcation.

Im having trouble figuring out what is causing these sudden IO spikes on my vm-s?
If the spike is on some/few machines the spikes go away in minutes. But usually they happen alot in the same time. And that when they last for hour and a half.
The pictures attached is Day timeframe (at the moment i had about 120 VM-s running). One is PVE pic and one is from a vm. The spike lasts for about hour and a half. And the spikes occur atleast once or twice or even three times a day.
All the VM-s are basically the same. They all have only a chromium based browser installed and they surf the web. And all VM-s do the same thing.
Once i happened to be behind the pc to see whats this about, and iotop showed it to be the browser that makes the writes, but how and why ?
Is it normal, can i disable that somehow ?
I have tried to play with VM setting etc, but i havent seen a change. Whatever i do, it always ends up the same way.

1. It doesnt matter what version of ubuntu/xubuntu im using, 16.04 or 22.04 or in between they all are the same
2. I have tried to place the VM-s on lvm, lvm thin, directory (xfs) , still nothing changes.

If anyone has any idea what and where to look for, or what else i can trie, i would be very thankful.
 

Attachments

  • Screenshot 2023-11-24 094539.png
    Screenshot 2023-11-24 094539.png
    172 KB · Views: 14
  • Screenshot 2023-11-24 101955.png
    Screenshot 2023-11-24 101955.png
    124.6 KB · Views: 13
Last edited:
Do you use a swap file/partition in the VM?
Do you see any related messages in the journal on the host as well as the VMs?
 
Do you use a swap file/partition in the VM?
Do you see any related messages in the journal on the host as well as the VMs?

1. As conky is showing me 0/975 MB of swap used, then its a yes i assume :)
2. About the journal, im not that proffesional about the in depths about linux, How do i check these ?

I also now included a journalctl log file for past 24h from the VM. The spike started at 4:30 and ended around 5:30
I didnt find anything in there. But maybe you can. The spike went up to 4.9 M from the usual 181 K
 

Attachments

  • New Text Document.txt
    53 KB · Views: 1
Last edited:
Please run the following command on the host and on a VM and upload the resulting files.

Bash:
journalctl --since "2023-11-23" --until "2023-11-24" >| $(hostname)-journal.txt

To see what happens when the issue persists for 1.5h, please adapt the dates accordingly
 
Please run the following command on the host and on a VM and upload the resulting files.

Bash:
journalctl --since "2023-11-23" --until "2023-11-24" >| $(hostname)-journal.txt

To see what happens when the issue persists for 1.5h, please adapt the dates accordingly
On VM there were two spikes from 00:00-01:00 and 05:00-06:00
on pve the big spike was 05:00-06:30
 

Attachments

  • pve-journal.txt
    76.3 KB · Views: 2
  • VM-journal.txt
    57.1 KB · Views: 1
Turnig off swap didnt make any difference.
One of the VM-s i turned it off for went to 4.5 M from regular 200k.
It lasted for 3 minutes.
 
There is no clear indication in the journals why this is happening.
Please post your Storage and VM config. Please replace <<vmid>> with a VMID of an affected VM.

Bash:
head -n -1 /etc/pve/storage.cfg /etc/pve/qemu-server/<<vmid>>.cfg
 
There is no clear indication in the journals why this is happening.
Please post your Storage and VM config. Please replace <<vmid>> with a VMID of an affected VM.

Bash:
head -n -1 /etc/pve/storage.cfg /etc/pve/qemu-server/<<vmid>>.cfg
==> /etc/pve/storage.cfg <==
dir: local
path /var/lib/vz
content vztmpl,backup,iso

lvmthin: local-lvm
thinpool data
vgname pve
content images,rootdir

lvm: corsair-2tb
vgname corsair-2tb
content rootdir,images
nodes pve
shared 0

dir: Pro1
path /mnt/pve/Pro1
content iso,backup,images,snippets,vztmpl,rootdir
is_mountpoint 1
nodes pve

dir: Pro2
path /mnt/pve/Pro2
content images,iso,backup,vztmpl,rootdir,snippets
is_mountpoint 1
nodes pve

dir: Pro3
path /mnt/pve/Pro3
content rootdir,vztmpl,snippets,images,backup,iso
is_mountpoint 1
nodes pve

dir: Pro4
path /mnt/pve/Pro4
content images,vztmpl,rootdir,iso,snippets,backup
is_mountpoint 1
nodes pve

==> /etc/pve/qemu-server/425.conf <==
#Xubuntu-16.04.6-IDE-names
agent: 1
boot: order=ide0;ide2;net0
cores: 1
cpu: x86-64-v2-AES
ide0: Pro1:425/vm-425-disk-0.qcow2,size=12G
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=8.1.2,ctime=1700597497
name: 22.11.23-25
net0: e1000=BC:24:11:4C:3B:54,bridge=vmbr1,firewall=1,tag=106
numa: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=edde09ec-7580-4198-81e5-6b99bbd64e43
sockets: 2
vga: qxl,memory=32
root@pve:~#
 
On picture 1 there is a vm look from proxmox webgui, and at the time this was happening i could take a screenshot also from iotop in the vm that is pic 2. Its a chromium based browser. Whats causing this thing and what could/should i test next todo .

Its not bad when 5 machines do it simultaneously, but when its more, like 20 or even more, my server load jumps to 200 from 35 that it is originally.
 

Attachments

  • 1.png
    1.png
    127.3 KB · Views: 6
  • 2.png
    2.png
    243.2 KB · Views: 6
one thing else i have noticed is that, the big spikes are once a day and they always start at 05.00 AM.
Any ideas what and where to search for, as im thinking that this comes from inside the VM-s
 
it seems the jbd2 fault, isn't it ?
it seems about ext4 journal commit update.
if it's a fsync io operation, consumer ssd drives can't sustain many of them.
 
it seems the jbd2 fault, isn't it ?
it seems about ext4 journal commit update.
if it's a fsync io operation, consumer ssd drives can't sustain many of them
I might be a little bit dummy about this, but doesnt jbd2 say reads and writes 0 kb/s on both ?
and if it is about this, can i somehow disable this just for trie out on some machines ?
Or are there any other suggestions what i could try ?
 
Last edited:
it seems the jbd2 fault, isn't it ?
it seems about ext4 journal commit update.
if it's a fsync io operation, consumer ssd drives can't sustain many of them.
Any ideas how i could disbale/overrun it ?
Cause now i see that when vm is running normally the jbd2 isnt using any io. And when these peaks happen then its using most of IO and it is probably the cause of it.
These are VM-s for my own use, so if i break them i just make new ones and no harm done whatsoever.
At this point i would be willing to try anything, cause i have been looking thorugh internet for a while now and i havent found one solution that would make any difference so far. So any kind of help would be more than appresiated.
 
I see that you're attaching you disk as IDE. Try attaching them via SCSI. To do this, Go in the GUI to Hardware, select the disk, press Detech, and then attach it again by pressing Edit on the unassigned disk.
You should also select discard as well as io_threads.

You mentioned that the issue happens typically at 5am. Do you maybe have a Backup Job running at that time?

Also, please post the output of findmnt on the host.
 
  • Like
Reactions: _gabriel
I see that you're attaching you disk as IDE. Try attaching them via SCSI. To do this, Go in the GUI to Hardware, select the disk, press Detech, and then attach it again by pressing Edit on the unassigned disk.
You should also select discard as well as io_threads.

You mentioned that the issue happens typically at 5am. Do you maybe have a Backup Job running at that time?

Also, please post the output of findmnt on the host.
1. I made an ide one just to try out if changes anything or not. I have the same amount of scsi ones also and they all act the same. I dont have the DISCARD enabled but the io thread is enabled.
2. the big issue happens at 5:00 every morning yes. ALthough there are these problem happening throughout the day but in a smaller way just few machines at a time. At 5:00 it goes crazy, with a lot more VM-s, and the server load spikes to 160 something with the amount of machines i have right now. I dont know that i have any backup jobs configured at all. at least i havent made them :)

3. File attached for the findmnt

4. this picture shows well what happens at 5:00 every day :)
and i guess as i make more VM-s then the longer the period goes that it lasts. Cause more machines are affected by it. whatever this IT is.
 

Attachments

  • findmnt.txt
    4.7 KB · Views: 1
  • Screenshot 2023-11-27 144755.png
    Screenshot 2023-11-27 144755.png
    133.3 KB · Views: 2
Last edited:
enable options "discard" and "ssd" emulation in each vdisk
Should i only do it for like 10 or 20 vm-s for testing ? Its a lot of work to do it for all 160 for testing. But if it is needed i will do it.
In your opinion could that fix the problem ?
 
Last edited:
yes, if trim isn't used, ssd will slowdown.
guests do defrag instead trim/discard operations.
"sed" command can replace in all .conf at once.
 
yes, if trim isn't used, ssd will slowdown.
guests do defrag instead trim/discard operations.
"sed" command can replace in all .conf at once.
I made the changes for 10 VM-for starter.
Whatever is happening to these other-vm, it shouldnt happen to these 10 now if i understand correctly ?

For future, please talk to me like a 5 year old when it comes to specifics like this in linux :)
I know something, but i dont know so in depth for sure.
Can you please write the sed command that would make the changes for all the VM-s at once ?
And i have to restart the machines anyway, cause the changes wont take effect before that right ?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!