[ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error

tungpm

New Member
Mar 15, 2025
2
0
1
Hi everyone,

I'm running into an issue with my Ceph cluster (version 18.2.4 Reef, stable) on `ceph-node1`. The `ceph-mgr` service is throwing an unhandled exception in the `devicehealth` module with a `disk I/O error`. I’ve noticed that the error [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error only appears on ceph-node1 when ceph-node2 is powered on and connected to the cluster. When I tested with Ceph version 19.2.1, the error didn’t occur, which suggests it might be a version-specific issue with 18.2.4.
Here’s the catch: I’m planning to deploy a Rook external cluster, and the Ceph image in Rook only supports up to version 18.2.4. So, I’m stuck with this version for now. The error pops up in the logs on ceph-node1 shortly after restarting the ceph-mgr service when node2 is active (e.g., Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error).
Here's the relevant info:

Logs from `journalctl -u ceph-mgr@ceph-node1.service`

tungpm@ceph-node1:~$ sudo journalctl -u ceph-mgr@ceph-node1.service

Mar 13 18:55:23 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.

Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for >

Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.

Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: from numpy import show_config as show_numpy_config

Mar 13 18:55:28 ceph-node1 ceph-mgr[7092]: 2025-03-13T18:55:28.018+0000 7ffafa064640 -1 mgr.server handle_report got status from non-daemon mon.ceph-node1

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 devicehealth.serve:

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 Traceback (most recent call last):

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return func(self, *args, **kwargs)

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: if self.db_ready() and self.enable_monitoring:

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return self.db is not None

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._db = self.open_db()

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: During handling of the above exception, another exception occurred:

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: Traceback (most recent call last):

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._do_serve()

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self.open_db();

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)

Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error

Mar 13 19:16:41 ceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...

Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Deactivated successfully.

Mar 13 19:16:41 ceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.

Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Consumed 6.607s CPU time.
 
Here's the logs that i collect:
journalctl on node1
Mar 15 03:18:24 ceph-node1 sudo[36701]: tungpm : TTY=pts/2 ; PWD=/home/tungpm ; USER=root ; COMMAND=/usr/bin/systemctl restart ceph-mgr@ceph-node1.
Mar 15 03:18:24 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.
Mar 15 03:18:33 ceph-node1 ceph-mgr[36707]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for most users but might cause hard to track down issues or subtle bugs. A common user of the rare sub-interpreter feature is wsgi which also allows single-interpreter mode.
Mar 15 03:18:33 ceph-node1 ceph-mgr[36707]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
Mar 15 03:18:33 ceph-node1 ceph-mgr[36707]: from numpy import show_config as show_numpy_config
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: 2025-03-15T03:18:36.359+0000 7fa90fee6640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: 2025-03-15T03:18:36.359+0000 7fa90fee6640 -1 devicehealth.serve:
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: 2025-03-15T03:18:36.359+0000 7fa90fee6640 -1 Traceback (most recent call last):
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: return func(self, *args, **kwargs)
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: if self.db_ready() and self.enable_monitoring:
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: return self.db is not None
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: self._db = self.open_db()
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: During handling of the above exception, another exception occurred:
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: Traceback (most recent call last):
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: self._do_serve()
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: self.open_db();
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error
Mar 15 03:18:38 ceph-node1 ceph-mon[36661]: 2025-03-15T03:18:38.337+0000 7fb1c6ba8640 -1 log_channel(cluster) log [ERR] : Health check failed: Module 'devicehealth' has failed: disk I/O error (MGR_MODULE_ERROR)
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: WARNING:ceph-crash:post /var/lib/ceph/crash/2025-03-14T20:04:06.347095Z_c15c693b-f88d-4834-a299-78cd63e1ec7b as client.crash.ceph-node1 failed: 2025-03-15T03:20:09.225+0000 7fdac8c44640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.ceph-node1.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: 2025-03-15T03:20:09.225+0000 7fdac8c44640 -1 AuthRegistry(0x7fdac4063f38) no keyring found at /etc/ceph/ceph.client.crash.ceph-node1.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: 2025-03-15T03:20:09.229+0000 7fdac8c44640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.ceph-node1.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: 2025-03-15T03:20:09.229+0000 7fdac8c44640 -1 AuthRegistry(0x7fdac4068730) no keyring found at /etc/ceph/ceph.client.crash.ceph-node1.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: 2025-03-15T03:20:09.234+0000 7fdac8c44640 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.crash.ceph-node1.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
Mar 15 03:20:09 ceph-node1 ceph-crash[6299]: [errno 2] RADOS object not found (error connecting to the cluster)

journalctl on node2
Mar 15 03:18:15 ceph-node2 ceph-osd[5756]: 2025-03-15T03:18:15.279+0000 7f1519e1a640 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw rbd-mirror cephfs-mirror (PID: 8093) UID: 0
Mar 15 03:18:15 ceph-node2 ceph-osd[5756]: 2025-03-15T03:18:15.367+0000 7f1519e1a640 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror|cephfs-mirror (PID: 8097) UID: 0
Mar 15 03:18:15 ceph-node2 systemd[1]: logrotate.service: Deactivated successfully.
Mar 15 03:18:15 ceph-node2 systemd[1]: Finished Rotate log files.
Mar 15 03:18:15 ceph-node2 ceph-osd[5756]: 2025-03-15T03:18:15.559+0000 7f151b61d640 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2025-03-15T02:18:15.563244+0000)
Mar 15 03:18:15 ceph-node2 systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
Mar 15 03:18:15 ceph-node2 ceph-osd[5756]: 2025-03-15T03:18:15.983+0000 7f1512438640 -1 osd.1 53 set_numa_affinity unable to identify public interface '' numa node: (2) No such file or directory
Mar 15 03:18:17 ceph-node2 systemd-networkd[8064]: ens33: Gained IPv6LL
Mar 15 03:18:18 ceph-node2 systemd-resolved[868]: Clock change detected. Flushing caches.
Mar 15 03:18:18 ceph-node2 systemd-timesyncd[740]: Initial synchronization to time server 185.125.190.57:123 (ntp.ubuntu.com).

sudo dmesg -T on node1
[Sat Mar 15 02:23:46 2025] systemd[1]: Started OpenBSD Secure Shell server.
[Sat Mar 15 02:23:46 2025] systemd[1]: Started Journal Service.
[Sat Mar 15 03:12:58 2025] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[Sat Mar 15 03:12:58 2025] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready

sudo dmesg -T on node2
[Sat Mar 15 02:04:15 2025] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[Sat Mar 15 02:04:15 2025] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready
[Sat Mar 15 03:18:13 2025] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[Sat Mar 15 03:18:13 2025] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready