Hi,
I have a proxmox on ovh provider server and I had the sad surpise to have a bad consistency in my soft-RAID 1(mdadm) today . So apparently (I can say that from the syslog), the system have stopped immediately and rebooted for rebuilding of the array.
Mar 8 13:01:36 ns339453 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="937" x-info="http://www.rsyslog.com"] start
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Module 'fuse' is builtin
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Inserted module 'vhost_net'
Mar 8 13:01:36 ns339453 hdparm[311]: RAID status not OK. Exiting. ... failed!
Mar 8 13:01:36 ns339453 mdadm-raid[310]: Generating udev events for MD arrays...done.
Mar 8 13:01:36 ns339453 lvm[421]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[433]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[439]: 1 logical volume(s) in volume group "pve" monitored
Mar 8 13:01:36 ns339453 systemd-fsck[435]: /var/lib/vz : récupération du journal
Mar 8 13:01:36 ns339453 kernel: [ 0.000000] Initializing cgroup subsys cpuset
but since this incident, or maybe before but I don't have witnessed that, I have a very big I/O delay but not constant. Something like every 1 or 2 seconds without any VM launched and I have a pike in IO delay of 0.22 to 1 or 2% . I know that shows that there is a problem but I honestly don't know what it is. the pike of cpu is for process pveproxy worker which use every 1 or 2 sec somethink like 1 or 2 % of the CPU or of one core of the CPU I presume. But regularly, from the graphic I have a IO delay of 1.2% to 2% without really any spike of the CPU.
I have done some short smarttest on the disk but nothing appears. I'm launching now smarttest long.
I have also notice that some files have been damaged in one of my VM apparently. all my vms on this server is in qcow2. Could it be possible even with a RAID 1? or should I assume it's something else? like a hacking?
another vm which have only a wordpresson it is really really slow and I launch it then I have IO delay like 8%.
Could it be a hard disk failure? or a controller failure? Or something else? or the evil is deeper inside and I should reinstall everything? What should I do ?
I have a proxmox on ovh provider server and I had the sad surpise to have a bad consistency in my soft-RAID 1(mdadm) today . So apparently (I can say that from the syslog), the system have stopped immediately and rebooted for rebuilding of the array.
Mar 8 13:01:36 ns339453 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="937" x-info="http://www.rsyslog.com"] start
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Module 'fuse' is builtin
Mar 8 13:01:36 ns339453 systemd-modules-load[272]: Inserted module 'vhost_net'
Mar 8 13:01:36 ns339453 hdparm[311]: RAID status not OK. Exiting. ... failed!
Mar 8 13:01:36 ns339453 mdadm-raid[310]: Generating udev events for MD arrays...done.
Mar 8 13:01:36 ns339453 lvm[421]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[433]: 1 logical volume(s) in volume group "pve" now active
Mar 8 13:01:36 ns339453 lvm[439]: 1 logical volume(s) in volume group "pve" monitored
Mar 8 13:01:36 ns339453 systemd-fsck[435]: /var/lib/vz : récupération du journal
Mar 8 13:01:36 ns339453 kernel: [ 0.000000] Initializing cgroup subsys cpuset
but since this incident, or maybe before but I don't have witnessed that, I have a very big I/O delay but not constant. Something like every 1 or 2 seconds without any VM launched and I have a pike in IO delay of 0.22 to 1 or 2% . I know that shows that there is a problem but I honestly don't know what it is. the pike of cpu is for process pveproxy worker which use every 1 or 2 sec somethink like 1 or 2 % of the CPU or of one core of the CPU I presume. But regularly, from the graphic I have a IO delay of 1.2% to 2% without really any spike of the CPU.
I have done some short smarttest on the disk but nothing appears. I'm launching now smarttest long.
I have also notice that some files have been damaged in one of my VM apparently. all my vms on this server is in qcow2. Could it be possible even with a RAID 1? or should I assume it's something else? like a hacking?
another vm which have only a wordpresson it is really really slow and I launch it then I have IO delay like 8%.
Could it be a hard disk failure? or a controller failure? Or something else? or the evil is deeper inside and I should reinstall everything? What should I do ?