Hi everyone,
I'm running into an issue with my Ceph cluster (version 18.2.4 Reef, stable) on `ceph-node1`. The `ceph-mgr` service is throwing an unhandled exception in the `devicehealth` module with a `disk I/O error`. I’ve noticed that the error [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error only appears on ceph-node1 when ceph-node2 is powered on and connected to the cluster. When I tested with Ceph version 19.2.1, the error didn’t occur, which suggests it might be a version-specific issue with 18.2.4.
Here’s the catch: I’m planning to deploy a Rook external cluster, and the Ceph image in Rook only supports up to version 18.2.4. So, I’m stuck with this version for now. The error pops up in the logs on ceph-node1 shortly after restarting the ceph-mgr service when node2 is active (e.g., Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error).
Here's the relevant info:
Logs from `journalctl -u ceph-mgr@ceph-node1.service`
tungpm@ceph-node1:~$ sudo journalctl -u ceph-mgr@ceph-node1.service
Mar 13 18:55:23 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for >
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: from numpy import show_config as show_numpy_config
Mar 13 18:55:28 ceph-node1 ceph-mgr[7092]: 2025-03-13T18:55:28.018+0000 7ffafa064640 -1 mgr.server handle_report got status from non-daemon mon.ceph-node1
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 devicehealth.serve:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return func(self, *args, **kwargs)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: if self.db_ready() and self.enable_monitoring:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return self.db is not None
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._db = self.open_db()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: During handling of the above exception, another exception occurred:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._do_serve()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self.open_db();
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Deactivated successfully.
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Consumed 6.607s CPU time.
I'm running into an issue with my Ceph cluster (version 18.2.4 Reef, stable) on `ceph-node1`. The `ceph-mgr` service is throwing an unhandled exception in the `devicehealth` module with a `disk I/O error`. I’ve noticed that the error [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error only appears on ceph-node1 when ceph-node2 is powered on and connected to the cluster. When I tested with Ceph version 19.2.1, the error didn’t occur, which suggests it might be a version-specific issue with 18.2.4.
Here’s the catch: I’m planning to deploy a Rook external cluster, and the Ceph image in Rook only supports up to version 18.2.4. So, I’m stuck with this version for now. The error pops up in the logs on ceph-node1 shortly after restarting the ceph-mgr service when node2 is active (e.g., Mar 15 03:18:36 ceph-node1 ceph-mgr[36707]: sqlite3.OperationalError: disk I/O error).
Here's the relevant info:
Logs from `journalctl -u ceph-mgr@ceph-node1.service`
tungpm@ceph-node1:~$ sudo journalctl -u ceph-mgr@ceph-node1.service
Mar 13 18:55:23 ceph-node1 systemd[1]: Started Ceph cluster manager daemon.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: /lib/python3/dist-packages/scipy/__init__.py:67: UserWarning: NumPy was imported from a Python sub-interpreter but NumPy does not properly support sub-interpreters. This will likely work for >
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: Improvements in the case of bugs are welcome, but is not on the NumPy roadmap, and full support may require significant effort to achieve.
Mar 13 18:55:26 ceph-node1 ceph-mgr[7092]: from numpy import show_config as show_numpy_config
Mar 13 18:55:28 ceph-node1 ceph-mgr[7092]: 2025-03-13T18:55:28.018+0000 7ffafa064640 -1 mgr.server handle_report got status from non-daemon mon.ceph-node1
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'devicehealth' while running on mgr.ceph-node1: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 devicehealth.serve:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: 2025-03-13T19:10:39.025+0000 7ffaf2855640 -1 Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return func(self, *args, **kwargs)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 355, in _do_serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: if self.db_ready() and self.enable_monitoring:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: return self.db is not None
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._db = self.open_db()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: During handling of the above exception, another exception occurred:
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: Traceback (most recent call last):
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/devicehealth/module.py", line 399, in serve
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self._do_serve()
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: self.open_db();
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: File "/usr/share/ceph/mgr/mgr_module.py", line 1256, in open_db
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: db = sqlite3.connect(uri, check_same_thread=False, uri=True)
Mar 13 19:10:39 ceph-node1 ceph-mgr[7092]: sqlite3.OperationalError: disk I/O error
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopping Ceph cluster manager daemon...
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Deactivated successfully.
Mar 13 19:16:41 ceph-node1 systemd[1]: Stopped Ceph cluster manager daemon.
Mar 13 19:16:41 ceph-node1 systemd[1]: ceph-mgr@ceph-node1.service: Consumed 6.607s CPU time.