[Fixed][ceph][mgr][snap_schedule] : sqlite3.OperationalError: unable to open database file

francoisd · Apr 11, 2023

On every reboot or power loss, my ceph managers are crashing, and the cephfs snap_schedule is not working since 2023-02-05-18.
The ceph mgr starts anyway, and generates a crash report turning the ceph cluster in HEALTH_WARN status.
I have the issue on every node (3 nodes cluster). Probably since Quincy update.

Does anyone observe the same problem ?
Do you have some recommandations or fixes ?

snap_schedule unavailable

Code:

root@pve3:~# ceph fs snap-schedule status / | jq
Error ENOENT: Module 'snap_schedule' is not available

The crash info:

JSON:

root@pve3:~# ceph crash info '2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773'
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/snap_schedule/module.py\", line 38, in __init__\n    self.client = SnapSchedClient(self)",
        "  File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 169, in __init__\n    with self.get_schedule_db(fs_name) as conn_mgr:",
        "  File \"/usr/share/ceph/mgr/snap_schedule/fs/schedule_client.py\", line 203, in get_schedule_db\n    db.executescript(dump)",
        "sqlite3.OperationalError: unable to open database file"
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773",
    "entity_name": "mgr.pve3",
    "mgr_module": "snap_schedule",
    "mgr_module_caller": "ActivePyModule::load",
    "mgr_python_exception": "OperationalError",
    "os_id": "11",
    "os_name": "Debian GNU/Linux 11 (bullseye)",
    "os_version": "11 (bullseye)",
    "os_version_id": "11",
    "process_name": "ceph-mgr",
    "stack_sig": "2fb4f03ffef7798ee981190306cedadb7d698a3a4cd6dbb59c0400ec3f76b6ba",
    "timestamp": "2023-04-11T06:23:22.105089Z",
    "utsname_hostname": "pve3",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.102-1-pve",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PVE 5.15.102-1 (2023-03-14T13:48Z)"
}

Additional information on the ceph setup

Code:

root@pve3:~# ceph mgr module ls
MODULE                           
balancer           on (always on)
crash              on (always on)
devicehealth       on (always on)
orchestrator       on (always on)
pg_autoscaler      on (always on)
progress           on (always on)
rbd_support        on (always on)
status             on (always on)
telemetry          on (always on)
volumes            on (always on)
dashboard          on           
iostat             on           
nfs                on           
prometheus         on           
restful            on           
snap_schedule      on           
stats              on           
alerts             -             
influx             -             
insights           -             
localpool          -             
mirroring          -             
osd_perf_query     -             
osd_support        -             
selftest           -             
telegraf           -             
test_orchestrator  -             
zabbix             -

Code:

root@pve3:~# ceph crash ls
ID                                                                ENTITY    NEW 
2023-02-10T15:55:33.246668Z_d7bfe3b0-2647-4583-b257-60cc8bebb820  mgr.pve3       
2023-02-10T21:10:48.333710Z_fab6d271-2708-4bf4-a70b-36918d268a14  mgr.pve1       
2023-02-10T21:11:32.956340Z_d5fb4e86-1dbc-4247-845f-b9777af168c4  osd.2         
2023-02-10T21:11:34.464160Z_0b4effa2-0538-4090-9337-89259d3d78bd  osd.4         
2023-02-16T11:07:38.833449Z_6065ff7b-ef02-4d6f-bbd2-a3beb7ca7f9a  mgr.pve2       
2023-02-19T23:12:31.685803Z_9a53dab2-3817-4ebd-8a0b-f479d277d751  mgr.pve3       
2023-02-19T23:30:00.594498Z_5da1ca3c-8d1b-470f-818f-e8bde0ed01f7  mgr.pve1       
2023-03-05T16:00:29.692042Z_0f6ae1ae-8d74-4a0e-9e41-3172086f8460  mgr.pve2       
2023-03-12T11:46:48.532363Z_2ec74f76-9cbe-438c-8948-82d3732f0aea  mgr.pve3       
2023-03-12T21:32:17.037999Z_6df8796c-ec4c-42f6-b352-b273e345c22e  mgr.pve1       
2023-03-13T09:19:09.578815Z_849121da-6ad9-4036-b3b6-c556263a0f05  mgr.pve2       
2023-03-13T12:59:09.792996Z_b1de9b4f-c8a4-48ae-9dff-d3413e527e43  mgr.pve3       
2023-03-13T13:34:43.233360Z_4ed2be9e-2f07-4853-b3bf-1e0efe69afea  osd.5         
2023-03-18T09:22:35.338683Z_ab963d23-e823-40aa-b0ad-0330b5265ff7  mgr.pve1       
2023-03-24T17:39:43.801037Z_40bdd143-b099-4f15-bfd6-39a0d544ee38  mgr.pve1       
2023-04-07T14:02:36.377029Z_c1510456-3f13-4a5e-ba86-9ea7779817f2  mgr.pve3       
2023-04-07T14:03:37.643954Z_4c946518-c119-480b-87b5-2b0d48f583df  mgr.pve2       
2023-04-07T19:19:58.561573Z_e5057586-899c-4bd3-bf4e-f08d469f2013  mgr.pve1   *   
2023-04-08T17:43:39.891129Z_1770ab87-0303-4d0e-bbbb-6e62143876c0  mgr.pve3   *   
2023-04-08T18:18:12.318248Z_b8563587-a64f-4b97-a4ab-2d55049d1261  mgr.pve2   *   
2023-04-11T06:23:22.105089Z_356de37b-2e16-4f44-b050-326ddad84773  mgr.pve3   *

francoisd · Apr 11, 2023

Turning this thread as "Fixed" since probably related problems have been discussed here:
- https://www.spinics.net/lists/ceph-users/msg74696.html
- https://tracker.ceph.com/issues/57851

And the proposed fix also fixes the above problem :
https://github.com/ceph/ceph/pull/48449/commits/8d853cc4990dc4dbccdc916115b0b30e0ac9dc19

This fix will probably come in the next ceph update.

The problem seems to be caused by the migration from 16 (Pacific) to 17 (Quincy) if you enable the snap_schedule before the migration.
The sqlite DB storage has been moved to the cephfs_metadata, and the mgr can't migrate the old database without this minor patch.

I added the 2 lines from the match manually in the files (paths can be guessed from the crash report), rebooted the node (a restart of the mgr might have been enough), and then made this mgr active. I did the change on one node only, since it's just for the DB migration to the new storage. The patch can then be removed if desired.

This immediately fixed the fs snap-schedule status command:

Code:

root@pve3:~# ceph fs snap-schedule status | jq
{
  "fs": "cephfs",
  "subvol": null,
  "path": "/",
  "rel_path": "/",
  "schedule": "1h",
  "retention": {},
  "start": "2022-07-17T00:00:00",
  "created": "2022-07-17T22:44:20",
  "first": "2023-04-11T21:00:00",
  "last": "2023-04-11T21:00:00",
  "last_pruned": "2023-04-11T21:00:00",
  "created_count": 1,
  "pruned_count": 1,
  "active": true
}

And the cephfs automatic snapshots works again (with an appended _UTC):

Code:

root@pve3:~# ls -tr -1 /mnt/pve/cephfs/.snap
weekly_2022-07-10_231701
daily_2022-07-10_231701
scheduled-2023-04-11-21_00_00_UTC
scheduled-2023-02-05-18_00_00
scheduled-2023-02-05-17_00_00
scheduled-2023-02-05-16_00_00

As you might have noticed, I lost the retention, so I had to re-apply and check:

Code:

root@pve3:~# ceph fs snap-schedule retention add / m 12
ceph fs snap-schedule retention add / w 4
ceph fs snap-schedule retention add / d 7
ceph fs snap-schedule retention add / h 24
Retention added to path /
Retention added to path /
Retention added to path /
Retention added to path /
root@pve3:~# ceph fs snap-schedule status | jq
{
  "fs": "cephfs",
  "subvol": null,
  "path": "/",
  "rel_path": "/",
  "schedule": "1h",
  "retention": {
    "m": 12,
    "w": 4,
    "d": 7,
    "h": 24
  },
  "start": "2022-07-17T00:00:00",
  "created": "2022-07-17T22:44:20",
  "first": "2023-04-11T21:00:00",
  "last": "2023-04-11T21:00:00",
  "last_pruned": "2023-04-11T21:00:00",
  "created_count": 1,
  "pruned_count": 1,
  "active": true
}

The details of my cephfs snapshot setup on my blog

I hope this post will save some hours to people experiencing the same issue.

Search

Search

[Fixed][ceph][mgr][snap_schedule] : sqlite3.OperationalError: unable to open database file

francoisd

Renowned Member

francoisd

Renowned Member

We value your privacy