[SOLVED] problem IO delay after reconstruction of RAID

vigilian

Renowned Member
Oct 9, 2015
82
1
73
Hi,

I have a proxmox on ovh provider server and I had the sad surpise to have a bad consistency in my soft-RAID 1(mdadm) today . So apparently (I can say that from the syslog), the system have stopped immediately and rebooted for rebuilding of the array.
Mar 8 13:01:36 ns339453 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="937" x-info="http://www.rsyslog.com"] start
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Module 'fuse' is builtin
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Inserted module 'vhost_net'
Mar 8 13:01:36 ns339453 hdparm[311]: RAID status not OK. Exiting. ... failed!
Mar 8 13:01:36 ns339453 mdadm-raid[310]: Generating udev events for MD arrays...done.
Mar 8 13:01:36 ns339453 lvm[421]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[433]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[439]: 1 logical volume(s) in volume group "pve" monitored
Mar 8 13:01:36 ns339453 systemd-fsck[435]: /var/lib/vz : récupération du journal
Mar 8 13:01:36 ns339453 kernel: [ 0.000000] Initializing cgroup subsys cpuset



but since this incident, or maybe before but I don't have witnessed that, I have a very big I/O delay but not constant. Something like every 1 or 2 seconds without any VM launched and I have a pike in IO delay of 0.22 to 1 or 2% . I know that shows that there is a problem but I honestly don't know what it is. the pike of cpu is for process pveproxy worker which use every 1 or 2 sec somethink like 1 or 2 % of the CPU or of one core of the CPU I presume. But regularly, from the graphic I have a IO delay of 1.2% to 2% without really any spike of the CPU.

I have done some short smarttest on the disk but nothing appears. I'm launching now smarttest long.


I have also notice that some files have been damaged in one of my VM apparently. all my vms on this server is in qcow2. Could it be possible even with a RAID 1? or should I assume it's something else? like a hacking?
another vm which have only a wordpresson it is really really slow and I launch it then I have IO delay like 8%.
Could it be a hard disk failure? or a controller failure? Or something else? or the evil is deeper inside and I should reinstall everything? What should I do ?
 

Attachments

  • 2016-03-09.png
    2016-03-09.png
    272.7 KB · Views: 11
so after, some extensive tests of the disk which revealed no errors, a replacement of the RAM, of all electronics parts except the disks(MB,RAM,controller, CPU), and a reinstallation of the system, no more problems of IO delay, it is now in normal range. So it was a corrupted system file certainly.
Please dev team can you find a way to make a better way to check the contents of the file system? something like a hash of all critical system files?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!