[SOLVED] ceph issue

RobFantini

Famous Member
May 24, 2012
2,042
109
133
Boston,Mass
after upgrade and system restarts we have ceph issues


at pve web page:
1- vms show no disk at hardware

2- ceph status shows no mons .

during the upgrade and restarts I has noout set. i checked each system with ceph -w one at a time until normal before upgrading the next system.

zabbix just reported some ' PROBLEM: Disk I/O is overloaded on' a system which led me to check for issues.

I'll dig in to this for more info and try to solve.

Code:
# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1
 
Last edited:
more info
Code:
# ceph -w
  cluster 63efaa45-7507-428f-9443-82a0a546b70d
  health HEALTH_WARN
  noout,sortbitwise,require_jewel_osds flag(s) set
  monmap e3: 3 mons at {0=10.2.2.21:6789/0,1=10.2.2.10:6789/0,2=10.2.2.67:6789/0}
  election epoch 28, quorum 0,1,2 1,0,2
  osdmap e66: 6 osds: 6 up, 6 in
  flags noout,sortbitwise,require_jewel_osds
  pgmap v196807: 192 pgs, 3 pools, 249 GB data, 63961 objects
  499 GB used, 2152 GB / 2651 GB avail
  192 active+clean
  client io 21814 B/s wr, 0 op/s rd, 2 op/s wr

2017-01-11 18:19:50.966762 mon.0 [INF] pgmap v196806: 192 pgs: 192 active+clean; 249 GB data, 499 GB used, 2152 GB / 2651 GB avail; 110 kB/s wr, 18 op/s
2017-01-11 18:19:51.999530 mon.0 [INF] pgmap v196807: 192 pgs: 192 active+clean; 249 GB data, 499 GB used, 2152 GB / 2651 GB avail; 21814 B/s wr, 2 op/s
 
Code:
# ceph
ceph> health
HEALTH_OK

ceph> status
  cluster 63efaa45-7507-428f-9443-82a0a546b70d
  health HEALTH_OK
  monmap e3: 3 mons at {0=10.2.2.21:6789/0,1=10.2.2.10:6789/0,2=10.2.2.67:6789/0}
  election epoch 28, quorum 0,1,2 1,0,2
  osdmap e67: 6 osds: 6 up, 6 in
  flags sortbitwise,require_jewel_osds
  pgmap v197215: 192 pgs, 3 pools, 249 GB data, 63961 objects
  499 GB used, 2152 GB / 2651 GB avail
  192 active+clean
  client io 7847 B/s wr, 0 op/s rd, 3 op/s wr

ceph> quorum_status
{"election_epoch":28,"quorum":[0,1,2],"quorum_names":["1","0","2"],"quorum_leader_name":"1","monmap":{"epoch":3,"fsid":"63efaa45-7507-428f-9443-82a0a546b70d","modified":"2017-01-08 08:47:17.540506","created":"2017-01-08 08:47:06.554026","mons":[{"rank":0,"name":"1","addr":"10.2.2.10:6789\/0"},{"rank":1,"name":"0","addr":"10.2.2.21:6789\/0"},{"rank":2,"name":"2","addr":"10.2.2.67:6789\/0"}]}}

ceph> mon_status
{"name":"0","rank":1,"state":"peon","election_epoch":28,"quorum":[0,1,2],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":3,"fsid":"63efaa45-7507-428f-9443-82a0a546b70d","modified":"2017-01-08 08:47:17.540506","created":"2017-01-08 08:47:06.554026","mons":[{"rank":0,"name":"1","addr":"10.2.2.10:6789\/0"},{"rank":1,"name":"0","addr":"10.2.2.21:6789\/0"},{"rank":2,"name":"2","addr":"10.2.2.67:6789\/0"}]}}
 
good news - I was able to do a backup of one of the systems that showed no disk in hardware.

Code:
INFO: starting new backup job: vzdump 102 --storage bkup-longterm --node s020 --mode snapshot --remove 0 --compress lzo
INFO: Starting Backup of VM 102 (qemu)
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: cups
INFO: include disk 'virtio0' 'ceph-kvm:vm-102-disk-1'
INFO: backup mode: snapshot
INFO: bandwidth limit: 500000 KB/s
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/bkup-longterm/dump/vzdump-qemu-102-2017_01_11-18_41_01.vma.lzo'
INFO: started backup task 'a0f462ef-e1dd-4ffa-abb5-2ed603c3df50'
INFO: status: 3% (184549376/5368709120), sparse 1% (89010176), duration 3, 61/31 MB/s
..
INFO: status: 100% (5368709120/5368709120), sparse 8% (459739136), duration 96, 74/0 MB/s
INFO: transferred 5368 MB in 96 seconds (55 MB/s)
INFO: archive file size: 2.19GB
INFO: Finished Backup of VM 102 (00:01:39)
INFO: Backup job finished successfully
TASK OK
 
well I just woke up and got back to working on this.

Now the the status at pve web page is all normal.

mons are there. vm hardware has disks again.

before I looked at that noticed this from dmesg on one node only:
Code:
[125808.425312] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[126708.759762] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[127981.347641] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[131580.421426] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[132480.758142] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[133381.095139] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[134281.394795] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[135212.676905] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[136771.957673] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)


So I do not know what caused or seemed to fixed the issue.
 
after upgrade and system restarts we have ceph issues


at pve web page:
1- vms show no disk at hardware

2- ceph status shows no mons .

during the upgrade and restarts I has noout set. i checked each system with ceph -w one at a time until normal before upgrading the next system.

zabbix just reported some ' PROBLEM: Disk I/O is overloaded on' a system which led me to check for issues.

I'll dig in to this for more info and try to solve.

Code:
# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1



I will close this.

I am not a ceph expert - but think that ceph-mon was making sure all was in sync after nodes rebooted befoe being able to post stats?

If so there should be a notice.

The issues could have been due to something a semi ignorant about ceph could do.