good morning,
first I have to say that we use Proxmox and Ceph for some time now and always appreciate the
stability, the great support and community, thank you all for this!
I'm on the way setting up new cluster with Proxmox 5.3/Ceph luminous and migrating our 'old' production
nodes vm1,vm2,vm3 to new cluster. vm4 is new server, vm5/vm6 are interim nodes for migration and will
be removed later, after VMs are migrated and vm1-vm3 are reinstalled.
Problem: on new cluster, 'pveceph status' sometimes shows "got timeout" on vm5/vm6, sometimes on vm4, too. Why/when this is, seams random, e.g. one time "pveceph status" works, a second later not.
'ceph -s' sometimes 'thinks' 2-4s, too, before it shows it's output - but never times out. Cluster has quorum
and is healthy.
In Gui: the Ceph -> OSD tab is only shown on vm4, never on vm5/vm6 ("got timeout 500".
It makes no difference on which node I use the web gui.
According to chromium dev tools xhr calls like this:
vm5.lan.domain.tld:8006/api2/extjs/nodes/vm5/ceph/osd?_dc=1545896996438
time out and give back:
{"data":null,"message":"got timeout\n","success":0,"status":500}
I have no idea how to debug this -> two days ago just reinstalled and joined vm6, which made it a bit better.
Today reinstalling vm5. But if this won't work, may be I have to reinstall whole cluster - but this would be much more time intensive. And I'm curious, why I cannot find the reason for this behaviour or howto debug what pveceph vs. ceph tools are sometimes waiting for.
Logs show nothing what I can correlate to this - see some more infos attached:
journalctl -u "ceph*" -u "coro*" -u "pve*" --since "-1d" -> 20181227_log.txt
pveceph status -> 20181227_pveceph-status.txt
crushmap.txt -> rbd and cephfs pools are on replicated_ssd, other hdd pool will be created.
Thank you very much in advance for any idea!
Falko
first I have to say that we use Proxmox and Ceph for some time now and always appreciate the
stability, the great support and community, thank you all for this!
I'm on the way setting up new cluster with Proxmox 5.3/Ceph luminous and migrating our 'old' production
nodes vm1,vm2,vm3 to new cluster. vm4 is new server, vm5/vm6 are interim nodes for migration and will
be removed later, after VMs are migrated and vm1-vm3 are reinstalled.
Problem: on new cluster, 'pveceph status' sometimes shows "got timeout" on vm5/vm6, sometimes on vm4, too. Why/when this is, seams random, e.g. one time "pveceph status" works, a second later not.
'ceph -s' sometimes 'thinks' 2-4s, too, before it shows it's output - but never times out. Cluster has quorum
and is healthy.
In Gui: the Ceph -> OSD tab is only shown on vm4, never on vm5/vm6 ("got timeout 500".
It makes no difference on which node I use the web gui.
According to chromium dev tools xhr calls like this:
vm5.lan.domain.tld:8006/api2/extjs/nodes/vm5/ceph/osd?_dc=1545896996438
time out and give back:
{"data":null,"message":"got timeout\n","success":0,"status":500}
I have no idea how to debug this -> two days ago just reinstalled and joined vm6, which made it a bit better.
Today reinstalling vm5. But if this won't work, may be I have to reinstall whole cluster - but this would be much more time intensive. And I'm curious, why I cannot find the reason for this behaviour or howto debug what pveceph vs. ceph tools are sometimes waiting for.
Logs show nothing what I can correlate to this - see some more infos attached:
journalctl -u "ceph*" -u "coro*" -u "pve*" --since "-1d" -> 20181227_log.txt
pveceph status -> 20181227_pveceph-status.txt
crushmap.txt -> rbd and cephfs pools are on replicated_ssd, other hdd pool will be created.
Thank you very much in advance for any idea!
Falko
:~# ceph -s
cluster:
id: 97ec297a-63e2-4d6a-89af-2e5e9ee2458c
health: HEALTH_OK
services:
mon: 3 daemons, quorum vm4,vm5,vm6
mgr: vm4(active), standbys: vm5, vm6
mds: cephfs-1/1/1 up {0=vm4=up:active}, 1 up:standby
osd: 12 osds: 12 up, 12 in
data:
pools: 3 pools, 448 pgs
objects: 1.34M objects, 1.27TiB
usage: 2.55TiB used, 21.2TiB / 23.8TiB avail
pgs: 448 active+clean
io:
client: 3.96KiB/s wr, 0op/s rd, 0op/s wr
:~# ceph osd status
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | vm4 | 442G | 510G | 0 | 0 | 0 | 0 | exists,up |
| 1 | vm4 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 2 | vm4 | 359G | 594G | 0 | 1638 | 0 | 0 | exists,up |
| 3 | vm4 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 4 | vm4 | 500G | 452G | 0 | 2457 | 0 | 0 | exists,up |
| 5 | vm5 | 475G | 477G | 1 | 5734 | 0 | 0 | exists,up |
| 6 | vm5 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 7 | vm5 | 466G | 487G | 0 | 0 | 0 | 0 | exists,up |
| 8 | vm5 | 1044M | 1861G | 0 | 0 | 0 | 0 | exists,up |
| 9 | vm5 | 363G | 590G | 0 | 0 | 0 | 0 | exists,up |
| 10 | vm6 | 1044M | 1861G | 0 | 0 | 0 | 0 | exists,up |
| 11 | vm6 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 23.78134 root default
-3 10.07138 host vm4
1 hdd 3.63860 osd.1 up 1.00000 1.00000
3 hdd 3.63860 osd.3 up 1.00000 1.00000
0 ssd 0.93140 osd.0 up 1.00000 1.00000
2 ssd 0.93140 osd.2 up 1.00000 1.00000
4 ssd 0.93140 osd.4 up 1.00000 1.00000
-7 8.25208 host vm5
6 hdd 3.63860 osd.6 up 1.00000 1.00000
8 hdd 1.81929 osd.8 up 1.00000 1.00000
5 ssd 0.93140 osd.5 up 1.00000 1.00000
7 ssd 0.93140 osd.7 up 1.00000 1.00000
9 ssd 0.93140 osd.9 up 1.00000 1.00000
-10 5.45789 host vm6
10 hdd 1.81929 osd.10 up 1.00000 1.00000
11 hdd 3.63860 osd.11 up 1.00000 1.00000
:~# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
23.8TiB 21.2TiB 2.55TiB 10.73
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 1 301GiB 19.87 1.19TiB 77363
cephfs_data 3 1000GiB 45.12 1.19TiB 1151998
cephfs_metadata 4 223MiB 0.02 1.19TiB 110891
:~# ceph osd pool stats
pool rbd id 1
client io 4.29KiB/s wr, 0op/s rd, 0op/s wr
pool cephfs_data id 3
nothing is going on
pool cephfs_metadata id 4
nothing is going on
cluster:
id: 97ec297a-63e2-4d6a-89af-2e5e9ee2458c
health: HEALTH_OK
services:
mon: 3 daemons, quorum vm4,vm5,vm6
mgr: vm4(active), standbys: vm5, vm6
mds: cephfs-1/1/1 up {0=vm4=up:active}, 1 up:standby
osd: 12 osds: 12 up, 12 in
data:
pools: 3 pools, 448 pgs
objects: 1.34M objects, 1.27TiB
usage: 2.55TiB used, 21.2TiB / 23.8TiB avail
pgs: 448 active+clean
io:
client: 3.96KiB/s wr, 0op/s rd, 0op/s wr
:~# ceph osd status
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | vm4 | 442G | 510G | 0 | 0 | 0 | 0 | exists,up |
| 1 | vm4 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 2 | vm4 | 359G | 594G | 0 | 1638 | 0 | 0 | exists,up |
| 3 | vm4 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 4 | vm4 | 500G | 452G | 0 | 2457 | 0 | 0 | exists,up |
| 5 | vm5 | 475G | 477G | 1 | 5734 | 0 | 0 | exists,up |
| 6 | vm5 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
| 7 | vm5 | 466G | 487G | 0 | 0 | 0 | 0 | exists,up |
| 8 | vm5 | 1044M | 1861G | 0 | 0 | 0 | 0 | exists,up |
| 9 | vm5 | 363G | 590G | 0 | 0 | 0 | 0 | exists,up |
| 10 | vm6 | 1044M | 1861G | 0 | 0 | 0 | 0 | exists,up |
| 11 | vm6 | 1044M | 3724G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 23.78134 root default
-3 10.07138 host vm4
1 hdd 3.63860 osd.1 up 1.00000 1.00000
3 hdd 3.63860 osd.3 up 1.00000 1.00000
0 ssd 0.93140 osd.0 up 1.00000 1.00000
2 ssd 0.93140 osd.2 up 1.00000 1.00000
4 ssd 0.93140 osd.4 up 1.00000 1.00000
-7 8.25208 host vm5
6 hdd 3.63860 osd.6 up 1.00000 1.00000
8 hdd 1.81929 osd.8 up 1.00000 1.00000
5 ssd 0.93140 osd.5 up 1.00000 1.00000
7 ssd 0.93140 osd.7 up 1.00000 1.00000
9 ssd 0.93140 osd.9 up 1.00000 1.00000
-10 5.45789 host vm6
10 hdd 1.81929 osd.10 up 1.00000 1.00000
11 hdd 3.63860 osd.11 up 1.00000 1.00000
:~# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
23.8TiB 21.2TiB 2.55TiB 10.73
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 1 301GiB 19.87 1.19TiB 77363
cephfs_data 3 1000GiB 45.12 1.19TiB 1151998
cephfs_metadata 4 223MiB 0.02 1.19TiB 110891
:~# ceph osd pool stats
pool rbd id 1
client io 4.29KiB/s wr, 0op/s rd, 0op/s wr
pool cephfs_data id 3
nothing is going on
pool cephfs_metadata id 4
nothing is going on
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
bluestore block db size = 5368709120
bluestore block wal size = 5368709120
cluster network = 192.168.200.0/24
fsid = 97ec297a-63e2-4d6a-89af-2e5e9ee2458c
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.40.0/24
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.vm5]
host = vm5
mds standby for name = pve
[mds.vm4]
host = vm4
mds standby for name = pve
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.vm5]
host = vm5
mon addr = 192.168.40.15:6789
[mon.vm4]
host = vm4
mon addr = 192.168.40.14:6789
[mon.vm6]
host = vm6
mon addr = 192.168.40.16:6789
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
bluestore block db size = 5368709120
bluestore block wal size = 5368709120
cluster network = 192.168.200.0/24
fsid = 97ec297a-63e2-4d6a-89af-2e5e9ee2458c
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
public network = 192.168.40.0/24
[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
[mds.vm5]
host = vm5
mds standby for name = pve
[mds.vm4]
host = vm4
mds standby for name = pve
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
[mon.vm5]
host = vm5
mon addr = 192.168.40.15:6789
[mon.vm4]
host = vm4
mon addr = 192.168.40.14:6789
[mon.vm6]
host = vm6
mon addr = 192.168.40.16:6789
:~#pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-6 (running version: 5.3-6/37b3c8df)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-34
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
:~# apt-show-versions |grep ceph
ceph:amd64/stretch 12.2.10-pve1 uptodate
ceph-base:amd64/stretch 12.2.10-pve1 uptodate
ceph-common:amd64/stretch 12.2.10-pve1 uptodate
ceph-fuse:amd64/stretch 12.2.10-pve1 uptodate
ceph-mds:amd64/stretch 12.2.10-pve1 uptodate
ceph-mgr:amd64/stretch 12.2.10-pve1 uptodate
ceph-mon:amd64/stretch 12.2.10-pve1 uptodate
ceph-osd:amd64/stretch 12.2.10-pve1 uptodate
libcephfs1:amd64/stretch 10.2.11-2 uptodate
libcephfs2:amd64/stretch 12.2.10-pve1 uptodate
python-cephfs:amd64/stretch 12.2.10-pve1 uptodate
proxmox-ve: 5.3-1 (running kernel: 4.15.18-9-pve)
pve-manager: 5.3-6 (running version: 5.3-6/37b3c8df)
pve-kernel-4.15: 5.2-12
pve-kernel-4.15.18-9-pve: 4.15.18-30
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-43
libpve-guest-common-perl: 2.0-18
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-34
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.0.2+pve1-5
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-31
pve-container: 2.0-31
pve-docs: 5.3-1
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-16
pve-firmware: 2.0-6
pve-ha-manager: 2.0-5
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-1
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 1.0-5
qemu-server: 5.0-43
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
:~# apt-show-versions |grep ceph
ceph:amd64/stretch 12.2.10-pve1 uptodate
ceph-base:amd64/stretch 12.2.10-pve1 uptodate
ceph-common:amd64/stretch 12.2.10-pve1 uptodate
ceph-fuse:amd64/stretch 12.2.10-pve1 uptodate
ceph-mds:amd64/stretch 12.2.10-pve1 uptodate
ceph-mgr:amd64/stretch 12.2.10-pve1 uptodate
ceph-mon:amd64/stretch 12.2.10-pve1 uptodate
ceph-osd:amd64/stretch 12.2.10-pve1 uptodate
libcephfs1:amd64/stretch 10.2.11-2 uptodate
libcephfs2:amd64/stretch 12.2.10-pve1 uptodate
python-cephfs:amd64/stretch 12.2.10-pve1 uptodate