OSD failed but no email notification

Ting · Nov 20, 2021

Hey;

Yesterday, I had a failed OSD, is shows down/out, but no email notification, I am wondering how do I setup a email notification on failure of ceph osd? thanks.

mgibbons · Dec 10, 2023

So I came here Today to ask exactly the same Question!!

Failed OSD and no email!

Crazy

Krs

Mark

_gabriel · Dec 10, 2023

pve version ?

CelticWebs · Dec 10, 2023

We had OSDs failing a week or so ago, unfortunately we hadn't been checking properly so we completely missed dit till everything became write only, so I'd be very interested in how we get email notifications for this!

mgibbons · Dec 10, 2023

_gabriel said:
pve version ?

Code:

proxmox-ve: 8.1.0 (running kernel: 6.5.11-6-pve)
pve-manager: 8.1.3 (running version: 8.1.3/b46aac3b42da5d15)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.5: 6.5.11-6
proxmox-kernel-6.5.11-6-pve-signed: 6.5.11-6
proxmox-kernel-6.5.11-4-pve-signed: 6.5.11-4
proxmox-kernel-6.2.16-19-pve: 6.2.16-19
proxmox-kernel-6.2: 6.2.16-19
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph: 18.2.0-pve2
ceph-fuse: 18.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx7
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.0
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.1
libpve-access-control: 8.0.7
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.1.0
libpve-guest-common-perl: 5.0.6
libpve-http-server-perl: 5.0.5
libpve-network-perl: 0.9.5
libpve-rs-perl: 0.8.7
libpve-storage-perl: 8.0.5
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.1.2-1
proxmox-backup-file-restore: 3.1.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.2
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.1.3
pve-cluster: 8.0.5
pve-container: 5.0.8
pve-docs: 8.1.3
pve-edk2-firmware: 4.2023.08-2
pve-firewall: 5.0.3
pve-firmware: 3.9-1
pve-ha-manager: 4.0.3
pve-i18n: 3.1.4
pve-qemu-kvm: 8.1.2-4
pve-xtermjs: 5.3.0-2
qemu-server: 8.0.10
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.0-pve4

sb-jw · Dec 10, 2023

Interesting expectations from a hypervisor. At first I don't see it as a core task that a hypervisor has to monitor third-party storage. What about your monitoring and metrics in Grafana? Don't you have that? Then please don’t be surprised that you have no idea about the status of your systems. Personally, I don't expect monitoring from a hypervisor, I prefer to take care of it myself, then I know that it exists and how it alerts me. But blindly relying on it according to the motto “something will definitely happen somehow and somewhere” is very risky.

mgibbons · Dec 10, 2023

Hmm..

I do understand where you are coming from....

BUT Proxmox is a management wrapper around many components. Networking, Qemu and Storage.

You set up the notifications system to alert you of system status. It should also turn on the alerts module in ceph as part of that setup.

So whilst external monitoring is great, at version 8.1, when CEPH is promoted as an integral component, i am "surprised" that this part is not complete.

That is all.

Krs

Mark

sb-jw · Dec 11, 2023

mgibbons said:
You set up the notifications system to alert you of system status. It should also turn on the alerts module in ceph as part of that setup.

So whilst external monitoring is great, at version 8.1, when CEPH is promoted as an integral component, i am "surprised" that this part is not complete.

You can interpret it however you want, I still don't see it as part of it.

Take a look at what is involved in PVE. If you argue like that, then you also expect PVE to read the status of the hardware via IPMI and report a defective power supply to you. Then you also want your BGP sessions to be monitored in the SDN, your Synology NAS, which is connected via NFS, or your SAS SAN because it is connected via iSCSI.

Just because Proxmox offers many options or promotes a feature such as CEPH does not mean to me that Proxmox has to offer extensive metrics collection and monitoring for this.

It is still clear to me that monitoring is the job of every admin. I don't want to receive reports from different sources via different channels in a landscape - something like that should be centralized, for example via Checkmk (which, by the way, has ready-made checks for Proxmox and CEPH). There I can define my threshold values, when I want to be informed or I can determine who receives a push message via SMS and for what purpose.

And if you run a PVE node, you can also set up checkmk in a Docker container - it's absolutely not rocket science ;-)

But anyway, I see it exactly as described and will never understand it any other way. I am responsible for the environment and therefore I care about this responsibility and transparency of the current state.

Just my 2 cents.

OSD failed but no email notification

Ting

Member

mgibbons

New Member

_gabriel

Distinguished Member

CelticWebs

Member

mgibbons

New Member

sb-jw

Famous Member

mgibbons

New Member

sb-jw

Famous Member

We value your privacy