Error - 'Module 'devicehealth' has failed:'

da-alb

Active Member
Jan 18, 2021
123
4
38
Hi,

I have a PVE cluster with 3 nodes and Ceph installed.
Any idea on what it might be? After I've configured Ceph on every node, by mistake I have deleted the device_health_metrics pool and I have recreated it after that.
The containers installed on the pool work fine.

The erorr is the following.

Screenshot_2021-01-18 pm-80 - Proxmox Virtual Environment.png

Package version output:

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.78-1-pve: 5.4.78-1
pve-kernel-5.4.73-1-pve: 5.4.73-1
ceph: 15.2.8-pve2
ceph-fuse: 15.2.8-pve2
corosync: 3.0.4-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: not correctly installed
ifupdown2: 3.0.0-1+pve3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.16-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-2
libpve-guest-common-perl: 3.1-3
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.3-1
lxcfs: 4.0.3-pve3
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.6-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-3
pve-cluster: 6.2-1
pve-container: 3.3-2
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-7
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-2
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1
 
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way
 
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way
Thank you for the answer, I'm going to test it the next time I'm doing another Ceph cluster.
 
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way
Great !!!!!! thank
 
+
Just did the same thing , I discovered than device_healths_metrics appear to be created by manager so

1 - Create a new manager , if you already have a second manager go to step two
2 - delete the first manager ( there is no data loss here ) , wait for the standby one to become active
3 - Recreate the initial manager , the pool is back

I re-deleted the device_health_metrics pool just to confirm and the problem Re-appeared , solved the same way
+1 this solved the issue for me too.

pve-manager/7.3-4/d69b70d4 (running kernel: 5.15.83-1-pve)
 
Did any of you find a cause / long term fix.

I have a 3 node ceph cluster on fresh proxmox 8 and just hit this error - very scary, lol

I tried the process of deleting and recreating - it hasn't so far recreated the device_health_metrics pool... oh it did recreate a .mgr pool.... that seems to have cleared the issue for me
 
Last edited:
I encountered this issue after my active manager crashed. The active manager apparently restarted and continued being active but I guess wasn't all good. Stopping the active manager and waiting for the standby to take over resolved this problem. After that, I started the original active manager and the error did not return.