proxmox high IO for no reason

AxelTwin

Well-Known Member
Oct 10, 2017
133
6
58
39
Hi everybody,
Since a few day proxmox server IO delay went to 30% average making all containers runnunig very slow.
Disks are SSD enterprise grade on a brand new dell poweredge 740
There is only 2 container (1 zimbra 1 samba active directory) and 1 vm with 35 five total users
there is 4 disks in a raidz1 pool
Everything looks fine, I dont know where to look
Any suggestion would be appreciated
 

Attachments

  • Capture.PNG
    Capture.PNG
    138.9 KB · Views: 15
Last edited:
Only a loose thought:

So sdc, sdd and sde in your screenshot are all part of the mentioned raidZ?
Would be interesting to see also the fourth disk together with the others in atop, as well as the output from zpool status.

The fact, that only that one disk out of the three (fourth unknown) is critical busy in a raidZ, where (essentially) all IO-operations ever involve all disks in the pool, makes me think of a (most likely hardware) problem (maybe even short before failure) of that specific disk; rather than a higher level problem like the software.

Did you already check the SMART-status, the Dell hardware status/diagnostic thingy ((i)DRAC? Lifecycle Controller?) and especially the syslog?
 
Thanks for your reply Neobin,
yes, there is sdc, sdd, sde and sdf in the raidz
I checked SMART status an idrac, they both report everything is ok
in atop, you never see 4 disk, it is always 3 alternatively with sdc being always on and busy (normal behaviour?).
I have taken sdc offline from the pool and things are back to normal.
I am not sure what statement to make from that...

Code:
root@hyperviser:~# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:31 with 0 errors on Sun Aug 14 00:24:32 2022
config:

        NAME                                             STATE     READ WRITE CKSUM
        rpool                                            ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            ata-HFS480G3H2X069N_BNA9N7194I280AA1Z-part3  ONLINE       0     0     0
            ata-SSDSC2KB480G8R_BTYF138204M5480BGN-part3  ONLINE       0     0     0

errors: No known data errors

  pool: storage
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 33.7M in 00:08:03 with 0 errors on Tue Sep  6 18:41:36 2022
config:

        NAME        STATE     READ WRITE CKSUM
        storage     DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sdc     OFFLINE      0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
 
Last edited:
with sdc being always on and busy (normal behaviour?).

I would say it is definitely not a normal behavior.

I have taken sdc offline from the pool and things are back to normal.

Would indicate even more a problem with this specific drive.

You could see, if the manufacturer of the drive provides a diagnostic tool for it and let it run over it. (But this has also not necessarily be a 100% proof, if it would state: "All OK"!)

I would get a known good replacement drive, put it back in the pool and see how it behaves. If all runs well over a given time, it should be safe to say, that the old drive is defective.

PS.: Another thing, I had forget to ask before: Did you check, if there are firmware updates available for those SSDs; especially the problematic one?
 
Last edited:
I didn't think about firmware update, smart answer !
I'll check that first and change the disk if it doesn't resolve the issue.
 
To add for completion:
Of course there are more parts involved in the chain, like cable(s), ports, controller, maybe a backplane.
So to investigate further, you could for example switch the problematic drive with one of the others, regarding the backplane slot / cable / controller port. The usual drive troubleshooting things.
But for those tests and especially with only a redundancy of one drive (raidZ1), the raid/pool (even better the whole server) should definitely not be in production.
In any case make sure, you have recent and functional backups!
 
  • Like
Reactions: AxelTwin

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!