I/O scheduler

tog · Jun 11, 2008

Has anybody else noticed the I/O scheduler seems to perform very "unfairly" ?

During a vzdump I pretty much see the entire system's load average inside OpenVZ containers and even in the host just skyrocket up to 5-20 easily with no actual load on the system. Apparently everything that is not the vzdump/tar is just waiting and waiting and waiting on I/O. vzctl set 0 --ioprio 2 and vzctl --set 102 --ioprio 6 helps a little but not much, anything inside the OpenVZ container can pretty much just forget about getting any I/O time. Even the KVM suffers, you stop being able to ping the guest in the KVM for 10 seconds and then all the pings come back at once.

tog · Jun 11, 2008

A quick follow-up, I did some reading on the subject.

I tried out the anticipatory scheduler as well and it had about the same results, ridiculous wait times for occasional I/O requests but not quite as long as the now-default cfq scheduler. Pretty nearly, though.

Then I tried the deadline scheduler and all is good, like I'm using Linux 2.4 again. I can do a tar without other processes waiting over 30 seconds for small occasional I/O requests. Everything that is not the tar process still remains snappy and happy.

dietmar · Jun 12, 2008

I also noticed that on older hardware (slow disks), i.e. when making lvm snapshots.

On up to date systems with fast Disks and HW Raid everything works fine (or maybe the problem just not show up)

So what HW do you use?

- Dietmar

tog · Jun 12, 2008

Considering the type of job this box is performing, nothing special. Two 500GB Seagate 7200RPM "ES" drives on a 3ware 8006-LP2 in a simple mirror.

Since I don't follow along real closely with Linux (everything else is BSD around here), having a simple tar grind everything else on the system to a halt was quite a shock to me.

Oh well, now I know not to use the fancy enhanced two newest I/O schedulers.

I'll probably put two 10k RPM drives in a mirror for the next Proxmox box and see if the CFQ scheduler still acts the same. Despite my current low opinion of it, somebody must think it's awfully fancy to have made it the default I/O scheduler as of 2.6.18.

dietmar · Jun 12, 2008

tog said:
Since I don't follow along real closely with Linux (everything else is BSD around here), having a simple tar grind everything else on the system to a halt was quite a shock to me.

Well, i tested here and backup a 2GB VM using vzdump with snapshot does quite well (2 minutes, load < 0.4).

tog said:
Oh well, now I know not to use the fancy enhanced two newest I/O schedulers.

I'll probably put two 10k RPM drives in a mirror for the next Proxmox box and see if the CFQ scheduler still acts the same. Despite my current low opinion of it, somebody must think it's awfully fancy to have made it the default I/O scheduler as of 2.6.18.

Do you know the difference between CFQ/Deadline scheduler - is there some docu around about the difference/advantage?

- Dietmar

tog · Jun 12, 2008

So if I have similar poor results on a 10k RPM drive, would you consider changing the default Proxmox kernel config to:

CONFIG_DEFAULT_IOSCHED="deadline"

So future new Proxmox users don't get a nasty surprise when they go to do backups? Proxmox is obviously more of a virtualization server to be placed in a rack than it is a desktop environment to sit down at and use Firefox on, so avoiding the two newer I/O schedulers which seem to grind everything else to a halt might be a better way to be by default.

dietmar · Jun 12, 2008

Please can you post the output of:

# pveperf

on my box i get:

CPU BOGOMIPS: 17027.41
REGEX/SECOND: 737837
HD SIZE: 94.49 GB (/dev/pve/root)
BUFFERED READS: 179.22 MB/sec
AVERAGE SEEK TIME: 9.20 ms
FSYNCS/SECOND: 1329.84
DNS EXT: 30.98 ms
DNS INT: 0.86 ms

- Dietmar

dietmar · Jun 12, 2008

tog said:
would you consider changing the default Proxmox kernel config to:

CONFIG_DEFAULT_IOSCHED="deadline"

Sure, but first we need to understand why it only happens on your box.

- Dietmar

tog · Jun 12, 2008

dietmar said:
Well, i tested here and backup a 2GB VM using vzdump with snapshot does quite well (2 minutes, load < 0.4).

Well the only way you can reproduce it is to have other tasks trying to do constant little bits of I/O in the background... for example I have a Plesk server in a container and it's receiving about one piece of mail say every 1-3 seconds. Not very heavy, but if you telnet to the plesk server port 25 the SMTP banner takes 20-60 seconds to come up using the two newer schedulers, but is fine and normal with "deadline" during a tar.

Do you know the difference between CFQ/Deadline scheduler - is there some docu around about the difference/advantage?

- Dietmar

Here's what I was reading earlier. I was more interested in simply figuring out what had happened lately in Linux 2.6 than I was in finding specific complaints or similar experiences. Everybody can be saying what a fantastic idea the new schedulers are, but if I can't get a response from my SMTP server for 60 seconds during a tar I'm just going to have to draw my own conclusions.

http://www.linuxinsight.com/cfq_to_become_the_default_i_o_scheduler.html
http://www.linuxjournal.com/article/6931
http://lwn.net/Articles/114770/

The other thing you can do to avoid the entire rest of the system grinding to a halt during a simple tar using CFQ is run the tar with nice or ionice, but I think I'd rather just use a scheduler that doesn't ever allow I/O requests to go unserviced for 20-60 seconds.

tog · Jun 12, 2008

dietmar said:
Please can you post the output of:

# pveperf

on my box i get:

CPU BOGOMIPS: 17027.41
REGEX/SECOND: 737837
HD SIZE: 94.49 GB (/dev/pve/root)
BUFFERED READS: 179.22 MB/sec
AVERAGE SEEK TIME: 9.20 ms
FSYNCS/SECOND: 1329.84
DNS EXT: 30.98 ms
DNS INT: 0.86 ms

- Dietmar

Nice disk(s) you've got there. Mine is pretty much what I'd expect out of a simple 7200RPM single-disk setup (which is actually a mirror on a 3ware controller)

Code:

# pveperf
CPU BOGOMIPS:      37243.82
REGEX/SECOND:      770397
HD SIZE:           94.49 GB (/dev/pve/root)
BUFFERED READS:    47.83 MB/sec
AVERAGE SEEK TIME: 10.93 ms
FSYNCS/SECOND:     629.16
DNS EXT:           83.94 ms
DNS INT:           1.10 ms

tog · Jun 12, 2008

dietmar said:
Sure, but first we need to understand why it only happens on your box.

- Dietmar

Best guess is because I have half the I/O per second performance you do being that my disk setup is incredibly basic.

But, really, does one expect to need four 10k RPM drives in a RAID10 just to run a low-traffic Plesk server with almost no I/O?

dietmar · Jun 12, 2008

let me do some test (this wil take some time).

tog · Jun 12, 2008

Thank you very much for your consideration and responsiveness.

I can't find it right now, but I saw a couple things found via google where people are mentioning how they use the fancy new I/O schedulers except on "busy servers" for just this reason. It seems like the new default CFQ is designed for maximum "foreground process" performance and everything else can just sit there and rot.

dietmar · Jun 12, 2008

http://www.fishpool.org/tag/Performance

tog · Jun 12, 2008

Makes sense. What are you using? A single fast disk?

dietmar · Jun 12, 2008

no, i am using a RAID 10.

but in the oracle link they mention a bug in CFQ - maybe that triggers.

unfortunately i can't find more info about that bug.

- Dietmar

tog · Jun 12, 2008

RAID10 on what kind of controller?

Perhaps then you don't need to change the default scheduler, so long as the information that this can happen to some and how to fix it is readily accessible, perhaps in the wiki.

MaFL · Jun 22, 2008

I can confirm this (currently only on openvz systems with stock openvz kernel, but i expect the same for proxmox)

I have a nightly cronjob running which backups all VEs with vzdump / LVM online.
The attached picture shows day1 with cfq, day 2 with deadline/noop.

If vzdump is called directly from the commandline the load average is higher (peaks up to 15-20, really unresponsive system)

The box should be powerfull enough:
2x Quad Xenon / LSI Megaraid 8308ELP
2x Raid 5: 3x 74gb 15k SAS / 3x 500gb SATA

Vzdump dumps from SAS -> SATA, so we even have concurrent access on the same disc group.

Regards
Matthias

dietmar · Jun 24, 2008

MaFL said:
I can confirm this (currently only on openvz systems with stock openvz kernel, but i expect the same for proxmox)

Please can you also post the output of:

# pveperf

(the script is contained in the pve-manager package)

- Dietmar

MaFL · Jun 24, 2008

I have varying results at the moment. I´ll redo the test tonight with openvz shutdown.

But possible it´s not uninteresting to test the system under load:

noop:
CPU BOGOMIPS: 25537.48
REGEX/SECOND: 576110
HD SIZE: 2.75 GB (/dev/sda1)
BUFFERED READS: 141.22 MB/sec
AVERAGE SEEK TIME: 4.43 ms
FSYNCS/SECOND: 1639.22
DNS EXT: 96.64 ms
DNS INT: 3.67 ms (mtc-server.de)

cfq:
CPU BOGOMIPS: 25537.48
REGEX/SECOND: 581994
HD SIZE: 2.75 GB (/dev/sda1)
BUFFERED READS: 135.77 MB/sec
AVERAGE SEEK TIME: 16.51 ms
FSYNCS/SECOND: 111.26
DNS EXT: 110.50 ms
DNS INT: 3.90 ms (mtc-server.de)

90% of the fsyncs are below 200
worst noop 1100, mostly >1500

I/O scheduler

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Proxmox Staff Member

Member

Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Proxmox Staff Member

Member

Renowned Member

Attachments

Proxmox Staff Member

Renowned Member