We have created two-nodes Proxmox v4.4 cluster with Ceph Hammer pool running on the same nodes.
About 6 weeks it has been working as expected, but today we were facing continuous local blackout in our office, and both cluster nodes were powered off accidentally and unexpectedly.
After this, then we powered them on back again, Proxmox cluster has been restored, but the Ceph part can’t work anymore.
In fact, both Ceph monitors’ and OSDs’ daemons doesn’t come up. We are receiving the following messages in the logs:
root@proxmox01:~# tail -f -n 50 /var/log/ceph/ceph-mon.0.log:
2017-01-05 10:32:10.626575 7fe70dec9700 0 log_channel(cluster) log [INF] : pgmap v2682162: 1024 pgs: 1024 active+clean; 2499 GB data, 2465 GB used, 3453 GB / 5918 GB avail; 1448 kB/s rd, 2205 kB/s wr, 272 op/s
2017-01-05 10:32:11.642471 7fe70dec9700 0 log_channel(cluster) log [INF] : pgmap v2682163: 1024 pgs: 1024 active+clean; 2499 GB data, 2465 GB used, 3453 GB / 5918 GB avail; 3975 kB/s rd, 9997 kB/s wr, 890 op/s
2017-01-05 12017-01-05 14:18:49.602741 7fd862f57880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6021
2017-01-05 14:18:49.773721 7fd862f57880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:36:59.388044 7f9134298880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 12133
2017-01-05 14:36:59.519656 7f9134298880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:45:08.909565 7f68f55fe880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 15585
2017-01-05 14:45:09.047079 7f68f55fe880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:52:25.379881 7fb3f8d41880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 17567
2017-01-05 14:52:25.509122 7fb3f8d41880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 15:10:59.353983 7efd917e8880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6021
2017-01-05 15:10:59.519201 7efd917e8880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 16:05:20.255160 7f5c785e4880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 20841
2017-01-05 16:05:20.387105 7f5c785e4880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 16:46:01.535320 7f09b4e74880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 39042
2017-01-05 16:46:01.669282 7f09b4e74880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 17:15:34.557752 7f0a42125880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6016
2017-01-05 17:15:34.735091 7f0a42125880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
And then we are trying to bring monitors up manually, we get
on the first node:
root@proxmox01:~# /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --debug_mon 20 --debug_ms 20 --cluster ceph -f
Corruption: checksum mismatch
2017-01-05 22:25:04.114308 7f57557d3880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
on the second node:
root@proxmox02:~# /usr/bin/ceph-mon -i 1 --pid-file /var/run/ceph/mon.1.pid -c /etc/ceph/ceph.conf --debug_mon 20 --debug_ms 20 --cluster ceph -f
Corruption: 16 missing files; e.g.: /var/lib/ceph/mon/ceph-1/store.db/607993.ldb
2017-01-05 22:28:32.279389 7f25df090880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-1': (22) Invalid argument
any help would be appreciated...
About 6 weeks it has been working as expected, but today we were facing continuous local blackout in our office, and both cluster nodes were powered off accidentally and unexpectedly.
After this, then we powered them on back again, Proxmox cluster has been restored, but the Ceph part can’t work anymore.
In fact, both Ceph monitors’ and OSDs’ daemons doesn’t come up. We are receiving the following messages in the logs:
root@proxmox01:~# tail -f -n 50 /var/log/ceph/ceph-mon.0.log:
2017-01-05 10:32:10.626575 7fe70dec9700 0 log_channel(cluster) log [INF] : pgmap v2682162: 1024 pgs: 1024 active+clean; 2499 GB data, 2465 GB used, 3453 GB / 5918 GB avail; 1448 kB/s rd, 2205 kB/s wr, 272 op/s
2017-01-05 10:32:11.642471 7fe70dec9700 0 log_channel(cluster) log [INF] : pgmap v2682163: 1024 pgs: 1024 active+clean; 2499 GB data, 2465 GB used, 3453 GB / 5918 GB avail; 3975 kB/s rd, 9997 kB/s wr, 890 op/s
2017-01-05 12017-01-05 14:18:49.602741 7fd862f57880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6021
2017-01-05 14:18:49.773721 7fd862f57880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:36:59.388044 7f9134298880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 12133
2017-01-05 14:36:59.519656 7f9134298880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:45:08.909565 7f68f55fe880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 15585
2017-01-05 14:45:09.047079 7f68f55fe880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 14:52:25.379881 7fb3f8d41880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 17567
2017-01-05 14:52:25.509122 7fb3f8d41880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 15:10:59.353983 7efd917e8880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6021
2017-01-05 15:10:59.519201 7efd917e8880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 16:05:20.255160 7f5c785e4880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 20841
2017-01-05 16:05:20.387105 7f5c785e4880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 16:46:01.535320 7f09b4e74880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 39042
2017-01-05 16:46:01.669282 7f09b4e74880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
2017-01-05 17:15:34.557752 7f0a42125880 0 ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90), process ceph-mon, pid 6016
2017-01-05 17:15:34.735091 7f0a42125880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
And then we are trying to bring monitors up manually, we get
on the first node:
root@proxmox01:~# /usr/bin/ceph-mon -i 0 --pid-file /var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf --debug_mon 20 --debug_ms 20 --cluster ceph -f
Corruption: checksum mismatch
2017-01-05 22:25:04.114308 7f57557d3880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-0': (22) Invalid argument
on the second node:
root@proxmox02:~# /usr/bin/ceph-mon -i 1 --pid-file /var/run/ceph/mon.1.pid -c /etc/ceph/ceph.conf --debug_mon 20 --debug_ms 20 --cluster ceph -f
Corruption: 16 missing files; e.g.: /var/lib/ceph/mon/ceph-1/store.db/607993.ldb
2017-01-05 22:28:32.279389 7f25df090880 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-1': (22) Invalid argument
any help would be appreciated...