ceph issues [mostly operator errors ]

RobFantini · Mar 14, 2017

Hello

Yesterday we swapped out an osd , and most of our test kvm's had disk issues.

from dmesg on 2 systems:

Code:

[61499.871239] sd 2:0:0:2: [sdb] abort
[61499.871297] sd 2:0:0:2: [sdb] abort
[63670.824251] sd 2:0:0:2: [sdb] abort
[63840.080115] INFO: task jbd2/sda1-8:133 blocked for more than 120 seconds.
[63840.080821]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.081199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.081861] jbd2/sda1-8     D ffff880139834668     0   133      2 0x00000000
[63840.081867]  ffff880139834210 0000000000000046 0000000000012f40 ffff880037267fd8
[63840.081869]  0000000000012f40 ffff880139834210 ffff88013fd937f0 ffff88013ffca8e0
[63840.081872]  0000000000000002 ffffffff811d9e70 ffff880037267c80 ffff880019fa32e0
[63840.081875] Call Trace:
[63840.081883]  [<ffffffff811d9e70>] ? generic_block_bmap+0x50/0x50
[63840.081888]  [<ffffffff81517959>] ? io_schedule+0x99/0x120
[63840.081891]  [<ffffffff811d9e7a>] ? sleep_on_buffer+0xa/0x10
[63840.081893]  [<ffffffff81517cdc>] ? __wait_on_bit+0x5c/0x90
[63840.081896]  [<ffffffff811d9e70>] ? generic_block_bmap+0x50/0x50
[63840.081912]  [<ffffffff81517d87>] ? out_of_line_wait_on_bit+0x77/0x90
[63840.081917]  [<ffffffff810a95f0>] ? autoremove_wake_function+0x30/0x30
[63840.081932]  [<ffffffffa0154be1>] ? jbd2_journal_commit_transaction+0xe91/0x1a30 [jbd2]
[63840.081938]  [<ffffffff810a4323>] ? pick_next_task_fair+0x3e3/0x820
[63840.081943]  [<ffffffffa0158d92>] ? kjournald2+0xb2/0x240 [jbd2]
[63840.081946]  [<ffffffff810a95c0>] ? prepare_to_wait_event+0xf0/0xf0
[63840.081950]  [<ffffffffa0158ce0>] ? commit_timeout+0x10/0x10 [jbd2]
[63840.081955]  [<ffffffff810894fd>] ? kthread+0xbd/0xe0
[63840.081958]  [<ffffffff81089440>] ? kthread_create_on_node+0x180/0x180
[63840.081961]  [<ffffffff8151ad98>] ? ret_from_fork+0x58/0x90
[63840.081964]  [<ffffffff81089440>] ? kthread_create_on_node+0x180/0x180
..
[63840.081983] INFO: task rs:main Q:Reg:629 blocked for more than 120 seconds.
[63840.082425]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.082822] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.083507] rs:main Q:Reg   D ffff880139f7d848     0   629      1 0x00000000
[63840.083511]  ffff880139f7d3f0 0000000000000086 0000000000012f40 ffff88013a30bfd8
[63840.083514]  0000000000012f40 ffff880139f7d3f0 ffff88013fc937f0 ffff88013ffb0d08
[63840.083516]  0000000000000002 ffffffffa01516c0 ffff88013a30bb30 ffff88010eb59d28
[63840.083518] Call Trace:
..
[63840.083653] INFO: task kworker/u8:2:6335 blocked for more than 120 seconds.
[63840.084130]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.084530] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.085276] kworker/u8:2    D ffff880139efe528     0  6335      2 0x00000000
[63840.085283] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[63840.085286]  ffff880139efe0d0 0000000000000046 0000000000012f40 ffff8800ba3f7fd8
[63840.085288]  0000000000012f40 ffff880139efe0d0 ffff88013fd937f0 ffff88013ffa7f60
[63840.085291]  0000000000000002 ffffffffa01516c0 ffff8800ba3f7780 ffff88013b381d90
..
[63840.085447] INFO: task kworker/u8:0:14095 blocked for more than 120 seconds.
[63840.085921]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.086303] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.086836] kworker/u8:0    D ffff880036c56f38     0 14095      2 0x00000000
[63840.086849] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
[63840.086850]  ffff880036c56ae0 0000000000000046 0000000000012f40 ffff880065977fd8
[63840.086852]  0000000000012f40 ffff880036c56ae0 ffff880065977dc8 ffff880065977d60
[63840.086854]  ffff880065977dc0 ffff880036c56ae0 0000000000002003 0000000000000100

and another system:

Code:

[    3.473724] random: nonblocking pool is initialized
[   51.880058] sd 2:0:0:0: [sda] abort
[  240.032080] INFO: task kworker/u2:0:6 blocked for more than 120 seconds.
[  240.032451]       Not tainted 3.16.0-4-amd64 #1
[  240.032728] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.033083] kworker/u2:0    D ffff88001e5be4a8     0     6      2 0x00000000
[  240.033107] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
[  240.033109]  ffff88001e5be050 0000000000000046 0000000000012f40 ffff88001e5d7fd8
[  240.033112]  0000000000012f40 ffff88001e5be050 ffff88001e5d7dc8 ffff88001e5d7d60
[  240.033115]  ffff88001e5d7dc0 ffff88001e5be050 0000000000002003 0000000000000040
..
[  240.033175] INFO: task kworker/u2:1:33 blocked for more than 120 seconds.
[  240.033497]       Not tainted 3.16.0-4-amd64 #1
[  240.033744] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.034099] kworker/u2:1    D ffff88001a67a6e8     0    33      2 0x00000000
[  240.034107] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[  240.034110]  ffff88001a67a290 0000000000000046 0000000000012f40 ffff88001a6a7fd8
[  240.034112]  0000000000012f40 ffff88001a67a290 ffff88001fc137f0 ffff88001ffae728
[  240.034114]  0000000000000002 ffffffff811d9e70 ffff88001a6a7670 0000000000000000
[  240.034304] INFO: task jbd2/sda1-8:100 blocked for more than 120 seconds.
[  240.034638]       Not tainted 3.16.0-4-amd64 #1
[  240.034883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.035242] jbd2/sda1-8     D ffff88001f74aeb8     0   100      2 0x00000000
[  240.035245]  ffff88001f74aa60 0000000000000046 0000000000012f40 ffff88001a737fd8
[  240.035247]  0000000000012f40 ffff88001f74aa60 ffff88001fc137f0 ffff88001ffa0490
[  240.035250]  0000000000000002 ffffffff8113ee30 ffff88001a737bd0 ffff88001a737cb0

pveversion -v

Code:

proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.5-1~bpo80+1

today i got more of the same when I had had taken a node off for maintenance. set noout was done 1st.

here is settings for 2 of the the vm's

Code:

balloon: 1024
bootdisk: scsi0
cores: 2
memory: 2048
name: imap2
net0: virtio=62:65:36:65:30:38,bridge=vmbr0,tag=3
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-8110-disk-1,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=195cf837-ebaa-49c2-95e9-5ba7a0869cb0
sockets: 1

Code:

boot: c
bootdisk: scsi0
cores: 1
memory: 4096
name: rsnapshot-2017
net0: virtio=32:ED:A6:09:95:B0,bridge=vmbr0,tag=3
numa: 0
onboot: 1
ostype: l26
scsi0: ceph-kvm:vm-150-disk-2,discard=on,size=4G
scsi2: ceph-kvm:vm-150-disk-3,discard=on,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 4

RobFantini · Mar 14, 2017

and fstab in kvm is like this on both:

Code:

UUID=xxxx     /  ext4 rw,relatime,nobarrier,data=ordered,errors=remount-ro 0 1

maybe my kvm.conf and fstab settings are not ideal?

RobFantini · Mar 14, 2017

and there is some data loss, i use dovecot . we use this cronjob every 10 minutes to backup imap mail:

after every system hang , the system needs to be rebooted and due to my settings running dovecot or vm settings , we end up with errors like this

Code:

Date: Tue, 14 Mar 2017 12:10:43
..
Subject: Cron <root@imap> doveadm backup -f -A remote:10.192.54.18 | grep -v Warning

dsync-remote(user1): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.
dsync-remote(user2): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.
dsync-remote(user4): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.

mir · Mar 14, 2017

RobFantini said:
scsi0: ceph-kvm:vm-150-disk-2,discard=on,size=4G scsi2: ceph-kvm:vm-150-disk-3,discard=on,size=500G

Are you not supposed to use write-through with ceph as cache setting?

RobFantini · Mar 14, 2017

I'll use that setting, thank you.

RobFantini · Mar 16, 2017

using write-through cache has helped. however we still have aborts and hangs.

today I had to turn of 3 pve-ceph nodes . one at a time I shutdown for 2 minutes [ in order to reroute electric cord to new UPS].

I waited for ceph -s to show Healthy before starting next node.

the 3 test kvm's all had same issues as before:

Code:

[44658.916025] sd 2:0:0:0: [sda] abort

# dmesg|grep INFO
[44760.048079] INFO: task kworker/u4:0:6 blocked for more than 120 seconds.
[44760.048554] INFO: task jbd2/sda1-8:111 blocked for more than 120 seconds.
[44760.048948] INFO: task doveadm:14057 blocked for more than 120 seconds.
[44880.048030] INFO: task kworker/u4:0:6 blocked for more than 120 seconds.
[44880.048580] INFO: task jbd2/sda1-8:111 blocked for more than 120 seconds.
[44880.048978] INFO: task rs:main Q:Reg:503 blocked for more than 120 seconds.
[44880.049929] INFO: task sssd_be:593 blocked for more than 120 seconds.
[44880.050862] INFO: task zabbix_agentd:29133 blocked for more than 120 seconds.
[44880.051901] INFO: task kworker/u4:2:28135 blocked for more than 120 seconds.
[44880.052876] INFO: task doveadm:14057 blocked for more than 120 seconds.

dovecot backups suffered a few of these. to fix I have to delete some mail folders at the target:

Code:

dsync-local(user1): Warning: Mailbox changes caused a desync. You may want to run dsync again.

after 10 minutes they all settled down with this at the end of dmesg on one of the systems

Code:

[164575.481286] systemd-journald[167]: Received SIGTERM from PID 1 (systemd).
[164575.505747] systemd[1]: Unit systemd-journald.service entered failed state.
[164575.506568] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.506740] systemd[1]: Stopping Journal Service...
[164575.506770] systemd[1]: Starting Journal Service...
[164575.507280] systemd[1]: Started Journal Service.
[164575.618859] systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
[164575.620128] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.620941] systemd[1]: Stopping Journal Service...
[164575.620977] systemd[1]: Starting Journal Service...
[164575.621494] systemd[1]: Started Journal Service.
[164575.626421] systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
[164575.628311] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.630239] systemd[1]: Stopping Journal Service...
[164575.630291] systemd[1]: Starting Journal Service...
[164575.630985] systemd[1]: Started Journal Service.
[164575.641041] systemd-journald[8919]: Received request to flush runtime journal from PID 1

I would of course like to get to the point that restart a node does not cause these issues.

Any suggestions to try to fix ? If more data is needed let me know.

thanks
Rob

udo · Mar 17, 2017

RobFantini said:
using write-through cache has helped. however we still have aborts and hangs.

today I had to turn of 3 pve-ceph nodes . one at a time I shutdown for 2 minutes [ in order to reroute electric cord to new UPS].

I waited for ceph -s to show Healthy before starting next node.
...

I would of course like to get to the point that restart a node does not cause these issues.

Any suggestions to try to fix ? If more data is needed let me know.

thanks
Rob

Hi,
are these nodes running ceph-mon + ceph-osd?

Because AFAIK there is still an bug, that stopping mon+osd on the same time will produce an short "hang", because the osd mark himself out message is ignored.
Perhaps an stopping these osds before the mon will help?

Udo

RobFantini · Mar 17, 2017

these nodes are osd only. 6 nodes with three osd using ssd . no journal. . so 18 osd . the ssd are 480GB intel .

vm and mon are on separate nodes [ 3 mons ]

udo · Mar 17, 2017

RobFantini said:
these nodes are osd only. 6 nodes with three osd using ssd . no journal. . so 18 osd . the ssd are 480GB intel .

vm and mon are on separate nodes [ 3 mons ]

Hi Rob,
was noout set during maintenance, or does an rebuild start?

Udo

RobFantini · Mar 17, 2017

no I did not set noout

the reboots were fast except one.

Next time I'lll set noout ... the one node that had a delay probably caused a rebuild.

udo · Mar 17, 2017

RobFantini said:
no I did not set noout

the reboots were fast except one.

Next time I'lll set noout ... the one node that had a delay probably caused a rebuild.

Hi,
rebuild should not be an problem if "osd max backfills" and "osd recovery max active" set to an low number (depends on the cluster - I use 1).

Udo

RobFantini · Mar 17, 2017

I'll change to those settings. currently using " osd_max_backfills = 3 "
and "osd recovery max active" is not set .

I'll inject those settings tomorrow and redo the node restarts to test.

thank you again for the help!

RobFantini · Mar 17, 2017

I was able to inject osd_max-backfills=1 . not the other:

Code:

# ceph tell osd.*  injectargs  -- --osd_recovery_max_active=1
osd.0: osd_recovery_max_active = '1' (unchangeable)
osd.1: osd_recovery_max_active = '1' (unchangeable)
osd.2: osd_recovery_max_active = '1' (unchangeable)
osd.3: osd_recovery_max_active = '1' (unchangeable)
..
osd.17: osd_recovery_max_active = '1' (unchangeable)

any clues to set that?

RobFantini · Mar 17, 2017

looks like a bug - http://tracker.ceph.com/issues/18424
the 'unchangeable' part is misleading - the change supposedly got done.

Now to find the command that shows current settings to verify the change got done.

look ok:

Code:

# ceph daemon osd.1 config show|grep recovery_max_active
    "osd_recovery_max_active": "1",

# ceph daemon osd.1 config show|grep max_backfills
    "osd_max_backfills": "1",

RobFantini · Mar 17, 2017

this morning I restarted 2 nodes , the settings have helped so far. there were no dmesg storage related issues at the 3 test KVM.

so I've another question regarding ceph upgrade from 10.2.5-1~bpo80+1 to 10.2.6-1~bpo80+1

I assume that just this is needed:

1- set noout

2- upgrade then reboot one node at a time

3- do next node when 'HEALTH_OK'

Udo or anyone - how do you do your upgrades within same ceph release ?

udo · Mar 17, 2017

RobFantini said:
this morning I restarted 2 nodes , the settings have helped so far. there were no dmesg storage related issues at the 3 test KVM.

so I've another question regarding ceph upgrade from 10.2.5-1~bpo80+1 to 10.2.6-1~bpo80+1

I assume that just this is needed:

1- set noout

2- upgrade then reboot one node at a time

3- do next node when 'HEALTH_OK'

Udo or anyone - how do you do your upgrades within same ceph release ?

Hi,
I have done the update much simpler:
1. update on all nodes.
2. restart monitors, each after the other (wait for healthy cluster before restart the next one).
3. restart OSDs - also one after the other

no reboot required

Udo

RobFantini · Mar 18, 2017

so I ran in to something .

I started to do upgrades.

the 1ST worked . it did not have osd's it is a vm & mon system

2ND has issues. it is an OSD system. from my note pad :

Code:

apt-get  dist-upgrade

** screen output  display normal until stuck here:

Setting up python-cephfs (10.2.6-1~bpo80+1) ...
Setting up python-rbd (10.2.6-1~bpo80+1) ...
Setting up ceph-common (10.2.6-1~bpo80+1) ...
Setting system user ceph properties..usermod: no changes
..done
Fixing /var/run/ceph ownership....done
---------------------------------------------------------------------------------

from ps -ef  .  stuck at start ceph.target :
 
root     15845 15844  1 19:17 pts/0    00:00:02 apt-get dist-upgrade
root     16188     2  0 19:18 ?        00:00:00 [kworker/u16:3]
root     16706     2  0 19:18 ?        00:00:00 [kworker/7:0]
root     18248 15845  0 19:18 pts/1    00:00:00 /usr/bin/dpkg --force-confdef --force-confold --status-fd 25 --configure librados2:am
root     18276 18248  0 19:18 pts/1    00:00:00 /bin/sh /var/lib/dpkg/info/ceph-common.postinst configure 10.2.5-1~bpo80+1
root     18305 18276  0 19:18 pts/1    00:00:00 /bin/systemctl start ceph.target

node red at cluster.
osd's are up per ceph -s and pve web page


sys4  ~ # pvecm nodes
Cannot initialize CMAP service

so I'll work on this on site tomorrow.

RobFantini · Mar 18, 2017

from syslog:

Code:

sys4  ~ # llog
==> /var/log/syslog <==
Mar 17 19:55:29 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:29 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [quorum] crit: quorum_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [confdb] crit: cmap_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [quorum] crit: quorum_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [confdb] crit: cmap_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2

our ceph is set to not need certs . maybe something overrode my setting?

RobFantini · Mar 18, 2017

Code:

systemctl start corosync

fixed the cluster issue.

upgrade is still stuck. ctl + c does not interrupt.
try to kill the processes . that worked then run 'dpkg --configure -a

the deb updates started again. then got stuck at the same point. ps -ef showed stuck here again:

Code:

6139 ?        Ssl    0:02 corosync
 6278 ?        Ssl    0:06 /usr/bin/ceph-osd -f --cluster ceph --id 13 --setuser ceph --setgroup ceph
 6391 ?        Ssl    0:08 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph
 6504 ?        Ssl    0:05 /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph

there may be a bug .

udo · Mar 18, 2017

RobFantini said:
there may be a bug .

Hi Rob,
oops - forgot that I had problems one one node too - all nodes with mon+osd togther work well (six in my case - two clusters) - only one node with an osd only had the same issues (but not with pve-cluster because I ran the ceph-upgrade before the new pve packages was released).

Following helped me:

Code:

# i stopped the ceph-osd - but not sure if nessesary
#killing the ceph-common.postinst process
dpkg --configure ceph-base:amd64
dpkg --configure ceph-common:amd64
apt dist-upgrade
service ceph-osd@3 start

between the dpkg --configure I killed two processes more but don't remember excactly which process made the trouble.

I assume that the "10.2.5-1~bpo80+1" in the postinst make the trouble, because the version is than 10.2.6...

Udo

ceph issues [mostly operator errors ]

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Distinguished Member

Famous Member

Famous Member

Famous Member

Famous Member

Distinguished Member

Famous Member

Famous Member

Famous Member

Distinguished Member

We value your privacy