[SOLVED] ceph issue

RobFantini

Famous Member
May 24, 2012
2,041
107
133
Boston,Mass
after upgrade and system restarts we have ceph issues


at pve web page:
1- vms show no disk at hardware

2- ceph status shows no mons .

during the upgrade and restarts I has noout set. i checked each system with ceph -w one at a time until normal before upgrading the next system.

zabbix just reported some ' PROBLEM: Disk I/O is overloaded on' a system which led me to check for issues.

I'll dig in to this for more info and try to solve.

Code:
# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1
 
Last edited:
more info
Code:
# ceph -w
  cluster 63efaa45-7507-428f-9443-82a0a546b70d
  health HEALTH_WARN
  noout,sortbitwise,require_jewel_osds flag(s) set
  monmap e3: 3 mons at {0=10.2.2.21:6789/0,1=10.2.2.10:6789/0,2=10.2.2.67:6789/0}
  election epoch 28, quorum 0,1,2 1,0,2
  osdmap e66: 6 osds: 6 up, 6 in
  flags noout,sortbitwise,require_jewel_osds
  pgmap v196807: 192 pgs, 3 pools, 249 GB data, 63961 objects
  499 GB used, 2152 GB / 2651 GB avail
  192 active+clean
  client io 21814 B/s wr, 0 op/s rd, 2 op/s wr

2017-01-11 18:19:50.966762 mon.0 [INF] pgmap v196806: 192 pgs: 192 active+clean; 249 GB data, 499 GB used, 2152 GB / 2651 GB avail; 110 kB/s wr, 18 op/s
2017-01-11 18:19:51.999530 mon.0 [INF] pgmap v196807: 192 pgs: 192 active+clean; 249 GB data, 499 GB used, 2152 GB / 2651 GB avail; 21814 B/s wr, 2 op/s
 
Code:
# ceph
ceph> health
HEALTH_OK

ceph> status
  cluster 63efaa45-7507-428f-9443-82a0a546b70d
  health HEALTH_OK
  monmap e3: 3 mons at {0=10.2.2.21:6789/0,1=10.2.2.10:6789/0,2=10.2.2.67:6789/0}
  election epoch 28, quorum 0,1,2 1,0,2
  osdmap e67: 6 osds: 6 up, 6 in
  flags sortbitwise,require_jewel_osds
  pgmap v197215: 192 pgs, 3 pools, 249 GB data, 63961 objects
  499 GB used, 2152 GB / 2651 GB avail
  192 active+clean
  client io 7847 B/s wr, 0 op/s rd, 3 op/s wr

ceph> quorum_status
{"election_epoch":28,"quorum":[0,1,2],"quorum_names":["1","0","2"],"quorum_leader_name":"1","monmap":{"epoch":3,"fsid":"63efaa45-7507-428f-9443-82a0a546b70d","modified":"2017-01-08 08:47:17.540506","created":"2017-01-08 08:47:06.554026","mons":[{"rank":0,"name":"1","addr":"10.2.2.10:6789\/0"},{"rank":1,"name":"0","addr":"10.2.2.21:6789\/0"},{"rank":2,"name":"2","addr":"10.2.2.67:6789\/0"}]}}

ceph> mon_status
{"name":"0","rank":1,"state":"peon","election_epoch":28,"quorum":[0,1,2],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":3,"fsid":"63efaa45-7507-428f-9443-82a0a546b70d","modified":"2017-01-08 08:47:17.540506","created":"2017-01-08 08:47:06.554026","mons":[{"rank":0,"name":"1","addr":"10.2.2.10:6789\/0"},{"rank":1,"name":"0","addr":"10.2.2.21:6789\/0"},{"rank":2,"name":"2","addr":"10.2.2.67:6789\/0"}]}}
 
good news - I was able to do a backup of one of the systems that showed no disk in hardware.

Code:
INFO: starting new backup job: vzdump 102 --storage bkup-longterm --node s020 --mode snapshot --remove 0 --compress lzo
INFO: Starting Backup of VM 102 (qemu)
INFO: status = running
INFO: update VM 102: -lock backup
INFO: VM Name: cups
INFO: include disk 'virtio0' 'ceph-kvm:vm-102-disk-1'
INFO: backup mode: snapshot
INFO: bandwidth limit: 500000 KB/s
INFO: ionice priority: 7
INFO: creating archive '/mnt/pve/bkup-longterm/dump/vzdump-qemu-102-2017_01_11-18_41_01.vma.lzo'
INFO: started backup task 'a0f462ef-e1dd-4ffa-abb5-2ed603c3df50'
INFO: status: 3% (184549376/5368709120), sparse 1% (89010176), duration 3, 61/31 MB/s
..
INFO: status: 100% (5368709120/5368709120), sparse 8% (459739136), duration 96, 74/0 MB/s
INFO: transferred 5368 MB in 96 seconds (55 MB/s)
INFO: archive file size: 2.19GB
INFO: Finished Backup of VM 102 (00:01:39)
INFO: Backup job finished successfully
TASK OK
 
well I just woke up and got back to working on this.

Now the the status at pve web page is all normal.

mons are there. vm hardware has disks again.

before I looked at that noticed this from dmesg on one node only:
Code:
[125808.425312] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[126708.759762] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[127981.347641] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[131580.421426] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[132480.758142] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[133381.095139] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[134281.394795] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[135212.676905] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)
[136771.957673] libceph: osd2 10.2.2.21:6804 socket closed (con state OPEN)


So I do not know what caused or seemed to fixed the issue.
 
after upgrade and system restarts we have ceph issues


at pve web page:
1- vms show no disk at hardware

2- ceph status shows no mons .

during the upgrade and restarts I has noout set. i checked each system with ceph -w one at a time until normal before upgrading the next system.

zabbix just reported some ' PROBLEM: Disk I/O is overloaded on' a system which led me to check for issues.

I'll dig in to this for more info and try to solve.

Code:
# pveversion -v
proxmox-ve: 4.4-78 (running kernel: 4.4.35-2-pve)
pve-manager: 4.4-5 (running version: 4.4-5/c43015a5)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.35-2-pve: 4.4.35-78
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-102
pve-firmware: 1.1-10
libpve-common-perl: 4.0-85
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-71
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-docs: 4.4-1
pve-qemu-kvm: 2.7.1-1
pve-container: 1.0-90
pve-firewall: 2.0-33
pve-ha-manager: 1.0-38
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.6-5
lxcfs: 2.0.5-pve2
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.8-pve13~bpo80
ceph: 10.2.5-1~bpo80+1



I will close this.

I am not a ceph expert - but think that ceph-mon was making sure all was in sync after nodes rebooted befoe being able to post stats?

If so there should be a notice.

The issues could have been due to something a semi ignorant about ceph could do.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!