Error - 'Module 'devicehealth' has failed:'

da-alb · Jan 18, 2021

Hi,

I have a PVE cluster with 3 nodes and Ceph installed.
Any idea on what it might be? After I've configured Ceph on every node, by mistake I have deleted the device_health_metrics pool and I have recreated it after that.
The containers installed on the pool work fine.

The erorr is the following.

Screenshot_2021-01-18 pm-80 - Proxmox Virtual Environment.png

Package version output:

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

da-alb · Jan 19, 2021

da-alb · Jan 21, 2021

Up, anyone?

RokaKen · Jan 21, 2021

There's an article that explains the device health monitoring here[0]. If you aren't using the failure prediction feature, you may be able to clear the error with ceph device monitoring off YMMV.

[0] https://ceph.io/update/new-in-nautilus-device-management-and-failure-prediction/

da-alb · Jan 21, 2021

RokaKen said:
There's an article that explains the device health monitoring here[0]. If you aren't using the failure prediction feature, you may be able to clear the error with ceph device monitoring off YMMV.

[0] https://ceph.io/update/new-in-nautilus-device-management-and-failure-prediction/

Hi,

I'm using it, so I need the device monitoring to be on. Other solutions?

Thanks

vkeven · Jan 31, 2021

Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way

da-alb · Feb 1, 2021

vkeven said:
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way

Thank you for the answer, I'm going to test it the next time I'm doing another Ceph cluster.

jon75 · Mar 15, 2021

Hi! yes! it`s work! Thanks!

da-alb · Jun 26, 2022

I have the same issue, upgraded and it's still there. I did what @vkeven suggested but still nothing.

Egbahan koissi · Sep 15, 2022

vkeven said:
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way

Great !!!!!! thank

dlasher · Jan 9, 2023

+

vkeven said:
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way

+1 this solved the issue for me too.

pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)

scyto · Sep 5, 2023

Did any of you find a cause / long term fix.

I have a 3 node ceph cluster on fresh proxmox 8 and just hit this error - very scary, lol

I tried the process of deleting and recreating - it hasn't so far recreated the device_health_metrics pool... oh it did recreate a .mgr pool.... that seems to have cleared the issue for me

proxale · Nov 24, 2023

I encountered this issue after my active manager crashed. The active manager apparently restarted and continued being active but I guess wasn't all good. Stopping the active manager and waiting for the standby to take over resolved this problem. After that, I started the original active manager and the error did not return.

Search

Search

Error - 'Module 'devicehealth' has failed:'

da-alb

Member

da-alb

Member

da-alb

Member

RokaKen

Active Member

da-alb

Member

vkeven

Active Member

da-alb

Member

jon75

New Member

da-alb

Member

Egbahan koissi

New Member

dlasher

Renowned Member

scyto

Active Member

proxale

Active Member