ceph issues [mostly operator errors ]

RobFantini

Famous Member
May 24, 2012
2,041
107
133
Boston,Mass
Hello

Yesterday we swapped out an osd , and most of our test kvm's had disk issues.

from dmesg on 2 systems:
Code:
[61499.871239] sd 2:0:0:2: [sdb] abort
[61499.871297] sd 2:0:0:2: [sdb] abort
[63670.824251] sd 2:0:0:2: [sdb] abort
[63840.080115] INFO: task jbd2/sda1-8:133 blocked for more than 120 seconds.
[63840.080821]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.081199] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.081861] jbd2/sda1-8     D ffff880139834668     0   133      2 0x00000000
[63840.081867]  ffff880139834210 0000000000000046 0000000000012f40 ffff880037267fd8
[63840.081869]  0000000000012f40 ffff880139834210 ffff88013fd937f0 ffff88013ffca8e0
[63840.081872]  0000000000000002 ffffffff811d9e70 ffff880037267c80 ffff880019fa32e0
[63840.081875] Call Trace:
[63840.081883]  [<ffffffff811d9e70>] ? generic_block_bmap+0x50/0x50
[63840.081888]  [<ffffffff81517959>] ? io_schedule+0x99/0x120
[63840.081891]  [<ffffffff811d9e7a>] ? sleep_on_buffer+0xa/0x10
[63840.081893]  [<ffffffff81517cdc>] ? __wait_on_bit+0x5c/0x90
[63840.081896]  [<ffffffff811d9e70>] ? generic_block_bmap+0x50/0x50
[63840.081912]  [<ffffffff81517d87>] ? out_of_line_wait_on_bit+0x77/0x90
[63840.081917]  [<ffffffff810a95f0>] ? autoremove_wake_function+0x30/0x30
[63840.081932]  [<ffffffffa0154be1>] ? jbd2_journal_commit_transaction+0xe91/0x1a30 [jbd2]
[63840.081938]  [<ffffffff810a4323>] ? pick_next_task_fair+0x3e3/0x820
[63840.081943]  [<ffffffffa0158d92>] ? kjournald2+0xb2/0x240 [jbd2]
[63840.081946]  [<ffffffff810a95c0>] ? prepare_to_wait_event+0xf0/0xf0
[63840.081950]  [<ffffffffa0158ce0>] ? commit_timeout+0x10/0x10 [jbd2]
[63840.081955]  [<ffffffff810894fd>] ? kthread+0xbd/0xe0
[63840.081958]  [<ffffffff81089440>] ? kthread_create_on_node+0x180/0x180
[63840.081961]  [<ffffffff8151ad98>] ? ret_from_fork+0x58/0x90
[63840.081964]  [<ffffffff81089440>] ? kthread_create_on_node+0x180/0x180
..
[63840.081983] INFO: task rs:main Q:Reg:629 blocked for more than 120 seconds.
[63840.082425]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.082822] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.083507] rs:main Q:Reg   D ffff880139f7d848     0   629      1 0x00000000
[63840.083511]  ffff880139f7d3f0 0000000000000086 0000000000012f40 ffff88013a30bfd8
[63840.083514]  0000000000012f40 ffff880139f7d3f0 ffff88013fc937f0 ffff88013ffb0d08
[63840.083516]  0000000000000002 ffffffffa01516c0 ffff88013a30bb30 ffff88010eb59d28
[63840.083518] Call Trace:
..
[63840.083653] INFO: task kworker/u8:2:6335 blocked for more than 120 seconds.
[63840.084130]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.084530] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.085276] kworker/u8:2    D ffff880139efe528     0  6335      2 0x00000000
[63840.085283] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[63840.085286]  ffff880139efe0d0 0000000000000046 0000000000012f40 ffff8800ba3f7fd8
[63840.085288]  0000000000012f40 ffff880139efe0d0 ffff88013fd937f0 ffff88013ffa7f60
[63840.085291]  0000000000000002 ffffffffa01516c0 ffff8800ba3f7780 ffff88013b381d90
..
[63840.085447] INFO: task kworker/u8:0:14095 blocked for more than 120 seconds.
[63840.085921]       Tainted: P           O  3.16.0-4-amd64 #1
[63840.086303] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[63840.086836] kworker/u8:0    D ffff880036c56f38     0 14095      2 0x00000000
[63840.086849] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
[63840.086850]  ffff880036c56ae0 0000000000000046 0000000000012f40 ffff880065977fd8
[63840.086852]  0000000000012f40 ffff880036c56ae0 ffff880065977dc8 ffff880065977d60
[63840.086854]  ffff880065977dc0 ffff880036c56ae0 0000000000002003 0000000000000100

and another system:
Code:
[    3.473724] random: nonblocking pool is initialized
[   51.880058] sd 2:0:0:0: [sda] abort
[  240.032080] INFO: task kworker/u2:0:6 blocked for more than 120 seconds.
[  240.032451]       Not tainted 3.16.0-4-amd64 #1
[  240.032728] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.033083] kworker/u2:0    D ffff88001e5be4a8     0     6      2 0x00000000
[  240.033107] Workqueue: scsi_tmf_2 scmd_eh_abort_handler [scsi_mod]
[  240.033109]  ffff88001e5be050 0000000000000046 0000000000012f40 ffff88001e5d7fd8
[  240.033112]  0000000000012f40 ffff88001e5be050 ffff88001e5d7dc8 ffff88001e5d7d60
[  240.033115]  ffff88001e5d7dc0 ffff88001e5be050 0000000000002003 0000000000000040
..
[  240.033175] INFO: task kworker/u2:1:33 blocked for more than 120 seconds.
[  240.033497]       Not tainted 3.16.0-4-amd64 #1
[  240.033744] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.034099] kworker/u2:1    D ffff88001a67a6e8     0    33      2 0x00000000
[  240.034107] Workqueue: writeback bdi_writeback_workfn (flush-8:0)
[  240.034110]  ffff88001a67a290 0000000000000046 0000000000012f40 ffff88001a6a7fd8
[  240.034112]  0000000000012f40 ffff88001a67a290 ffff88001fc137f0 ffff88001ffae728
[  240.034114]  0000000000000002 ffffffff811d9e70 ffff88001a6a7670 0000000000000000
[  240.034304] INFO: task jbd2/sda1-8:100 blocked for more than 120 seconds.
[  240.034638]       Not tainted 3.16.0-4-amd64 #1
[  240.034883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.035242] jbd2/sda1-8     D ffff88001f74aeb8     0   100      2 0x00000000
[  240.035245]  ffff88001f74aa60 0000000000000046 0000000000012f40 ffff88001a737fd8
[  240.035247]  0000000000012f40 ffff88001f74aa60 ffff88001fc137f0 ffff88001ffa0490
[  240.035250]  0000000000000002 ffffffff8113ee30 ffff88001a737bd0 ffff88001a737cb0

pveversion -v
Code:
proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80
ceph: 10.2.5-1~bpo80+1

today i got more of the same when I had had taken a node off for maintenance. set noout was done 1st.

here is settings for 2 of the the vm's
Code:
balloon: 1024
bootdisk: scsi0
cores: 2
memory: 2048
name: imap2
net0: virtio=62:65:36:65:30:38,bridge=vmbr0,tag=3
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: ceph-kvm:vm-8110-disk-1,discard=on,size=50G
scsihw: virtio-scsi-pci
smbios1: uuid=195cf837-ebaa-49c2-95e9-5ba7a0869cb0
sockets: 1

Code:
boot: c
bootdisk: scsi0
cores: 1
memory: 4096
name: rsnapshot-2017
net0: virtio=32:ED:A6:09:95:B0,bridge=vmbr0,tag=3
numa: 0
onboot: 1
ostype: l26
scsi0: ceph-kvm:vm-150-disk-2,discard=on,size=4G
scsi2: ceph-kvm:vm-150-disk-3,discard=on,size=500G
scsihw: virtio-scsi-pci
smbios1: uuid=9035f5f9-60e8-42fa-b6ff-5ab38a160365
sockets: 4
 
and fstab in kvm is like this on both:
Code:
UUID=xxxx     /  ext4 rw,relatime,nobarrier,data=ordered,errors=remount-ro 0 1

maybe my kvm.conf and fstab settings are not ideal?
 
and there is some data loss, i use dovecot . we use this cronjob every 10 minutes to backup imap mail:


after every system hang , the system needs to be rebooted and due to my settings running dovecot or vm settings , we end up with errors like this
Code:
Date: Tue, 14 Mar 2017 12:10:43
..
Subject: Cron <root@imap> doveadm backup -f -A remote:10.192.54.18 | grep -v Warning

dsync-remote(user1): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.
dsync-remote(user2): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.
dsync-remote(user4): Error: Mailbox INBOX sync: mailbox_delete failed: INBOX can't be deleted.
 
using write-through cache has helped. however we still have aborts and hangs.

today I had to turn of 3 pve-ceph nodes . one at a time I shutdown for 2 minutes [ in order to reroute electric cord to new UPS].

I waited for ceph -s to show Healthy before starting next node.

the 3 test kvm's all had same issues as before:
Code:
[44658.916025] sd 2:0:0:0: [sda] abort

# dmesg|grep INFO
[44760.048079] INFO: task kworker/u4:0:6 blocked for more than 120 seconds.
[44760.048554] INFO: task jbd2/sda1-8:111 blocked for more than 120 seconds.
[44760.048948] INFO: task doveadm:14057 blocked for more than 120 seconds.
[44880.048030] INFO: task kworker/u4:0:6 blocked for more than 120 seconds.
[44880.048580] INFO: task jbd2/sda1-8:111 blocked for more than 120 seconds.
[44880.048978] INFO: task rs:main Q:Reg:503 blocked for more than 120 seconds.
[44880.049929] INFO: task sssd_be:593 blocked for more than 120 seconds.
[44880.050862] INFO: task zabbix_agentd:29133 blocked for more than 120 seconds.
[44880.051901] INFO: task kworker/u4:2:28135 blocked for more than 120 seconds.
[44880.052876] INFO: task doveadm:14057 blocked for more than 120 seconds.

dovecot backups suffered a few of these. to fix I have to delete some mail folders at the target:
Code:
dsync-local(user1): Warning: Mailbox changes caused a desync. You may want to run dsync again.

after 10 minutes they all settled down with this at the end of dmesg on one of the systems
Code:
[164575.481286] systemd-journald[167]: Received SIGTERM from PID 1 (systemd).
[164575.505747] systemd[1]: Unit systemd-journald.service entered failed state.
[164575.506568] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.506740] systemd[1]: Stopping Journal Service...
[164575.506770] systemd[1]: Starting Journal Service...
[164575.507280] systemd[1]: Started Journal Service.
[164575.618859] systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
[164575.620128] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.620941] systemd[1]: Stopping Journal Service...
[164575.620977] systemd[1]: Starting Journal Service...
[164575.621494] systemd[1]: Started Journal Service.
[164575.626421] systemd[1]: Started Trigger Flushing of Journal to Persistent Storage.
[164575.628311] systemd[1]: systemd-journald.service has no holdoff time, scheduling restart.
[164575.630239] systemd[1]: Stopping Journal Service...
[164575.630291] systemd[1]: Starting Journal Service...
[164575.630985] systemd[1]: Started Journal Service.
[164575.641041] systemd-journald[8919]: Received request to flush runtime journal from PID 1


I would of course like to get to the point that restart a node does not cause these issues.

Any suggestions to try to fix ? If more data is needed let me know.

thanks
Rob
 
Last edited:
using write-through cache has helped. however we still have aborts and hangs.

today I had to turn of 3 pve-ceph nodes . one at a time I shutdown for 2 minutes [ in order to reroute electric cord to new UPS].

I waited for ceph -s to show Healthy before starting next node.
...

I would of course like to get to the point that restart a node does not cause these issues.

Any suggestions to try to fix ? If more data is needed let me know.

thanks
Rob
Hi,
are these nodes running ceph-mon + ceph-osd?

Because AFAIK there is still an bug, that stopping mon+osd on the same time will produce an short "hang", because the osd mark himself out message is ignored.
Perhaps an stopping these osds before the mon will help?

Udo
 
these nodes are osd only. 6 nodes with three osd using ssd . no journal. . so 18 osd . the ssd are 480GB intel .

vm and mon are on separate nodes [ 3 mons ]
 
no I did not set noout

the reboots were fast except one.

Next time I'lll set noout ... the one node that had a delay probably caused a rebuild.
 
I'll change to those settings. currently using " osd_max_backfills = 3 "
and "osd recovery max active" is not set .

I'll inject those settings tomorrow and redo the node restarts to test.

thank you again for the help!
 
I was able to inject osd_max-backfills=1 . not the other:
Code:
# ceph tell osd.*  injectargs  -- --osd_recovery_max_active=1
osd.0: osd_recovery_max_active = '1' (unchangeable)
osd.1: osd_recovery_max_active = '1' (unchangeable)
osd.2: osd_recovery_max_active = '1' (unchangeable)
osd.3: osd_recovery_max_active = '1' (unchangeable)
..
osd.17: osd_recovery_max_active = '1' (unchangeable)

any clues to set that?
 
looks like a bug - http://tracker.ceph.com/issues/18424
the 'unchangeable' part is misleading - the change supposedly got done.

Now to find the command that shows current settings to verify the change got done.

look ok:
Code:
# ceph daemon osd.1 config show|grep recovery_max_active
    "osd_recovery_max_active": "1",

# ceph daemon osd.1 config show|grep max_backfills
    "osd_max_backfills": "1",
 
this morning I restarted 2 nodes , the settings have helped so far. there were no dmesg storage related issues at the 3 test KVM.

so I've another question regarding ceph upgrade from 10.2.5-1~bpo80+1 to 10.2.6-1~bpo80+1

I assume that just this is needed:

1- set noout

2- upgrade then reboot one node at a time

3- do next node when 'HEALTH_OK'

Udo or anyone - how do you do your upgrades within same ceph release ?
 
this morning I restarted 2 nodes , the settings have helped so far. there were no dmesg storage related issues at the 3 test KVM.

so I've another question regarding ceph upgrade from 10.2.5-1~bpo80+1 to 10.2.6-1~bpo80+1

I assume that just this is needed:

1- set noout

2- upgrade then reboot one node at a time

3- do next node when 'HEALTH_OK'

Udo or anyone - how do you do your upgrades within same ceph release ?
Hi,
I have done the update much simpler:
1. update on all nodes.
2. restart monitors, each after the other (wait for healthy cluster before restart the next one).
3. restart OSDs - also one after the other

no reboot required

Udo
 
so I ran in to something .

I started to do upgrades.

the 1ST worked . it did not have osd's it is a vm & mon system

2ND has issues. it is an OSD system. from my note pad :
Code:
apt-get  dist-upgrade

** screen output  display normal until stuck here:

Setting up python-cephfs (10.2.6-1~bpo80+1) ...
Setting up python-rbd (10.2.6-1~bpo80+1) ...
Setting up ceph-common (10.2.6-1~bpo80+1) ...
Setting system user ceph properties..usermod: no changes
..done
Fixing /var/run/ceph ownership....done
---------------------------------------------------------------------------------

from ps -ef  .  stuck at start ceph.target :
 
root     15845 15844  1 19:17 pts/0    00:00:02 apt-get dist-upgrade
root     16188     2  0 19:18 ?        00:00:00 [kworker/u16:3]
root     16706     2  0 19:18 ?        00:00:00 [kworker/7:0]
root     18248 15845  0 19:18 pts/1    00:00:00 /usr/bin/dpkg --force-confdef --force-confold --status-fd 25 --configure librados2:am
root     18276 18248  0 19:18 pts/1    00:00:00 /bin/sh /var/lib/dpkg/info/ceph-common.postinst configure 10.2.5-1~bpo80+1
root     18305 18276  0 19:18 pts/1    00:00:00 /bin/systemctl start ceph.target

node red at cluster.
osd's are up per ceph -s and pve web page


sys4  ~ # pvecm nodes
Cannot initialize CMAP service

so I'll work on this on site tomorrow.
 
Last edited:
from syslog:
Code:
sys4  ~ # llog
==> /var/log/syslog <==
Mar 17 19:55:29 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:29 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [quorum] crit: quorum_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [confdb] crit: cmap_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:35 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [quorum] crit: quorum_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [confdb] crit: cmap_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [dcdb] crit: cpg_initialize failed: 2
Mar 17 19:55:41 sys4 pmxcfs[2167]: [status] crit: cpg_initialize failed: 2

our ceph is set to not need certs . maybe something overrode my setting?
 
Code:
systemctl start corosync
fixed the cluster issue.

upgrade is still stuck. ctl + c does not interrupt.
try to kill the processes . that worked then run 'dpkg --configure -a

the deb updates started again. then got stuck at the same point. ps -ef showed stuck here again:
Code:
6139 ?        Ssl    0:02 corosync
 6278 ?        Ssl    0:06 /usr/bin/ceph-osd -f --cluster ceph --id 13 --setuser ceph --setgroup ceph
 6391 ?        Ssl    0:08 /usr/bin/ceph-osd -f --cluster ceph --id 9 --setuser ceph --setgroup ceph
 6504 ?        Ssl    0:05 /usr/bin/ceph-osd -f --cluster ceph --id 8 --setuser ceph --setgroup ceph

there may be a bug .
 
there may be a bug .
Hi Rob,
oops - forgot that I had problems one one node too - all nodes with mon+osd togther work well (six in my case - two clusters) - only one node with an osd only had the same issues (but not with pve-cluster because I ran the ceph-upgrade before the new pve packages was released).

Following helped me:
Code:
# i stopped the ceph-osd - but not sure if nessesary
#killing the ceph-common.postinst process
dpkg --configure ceph-base:amd64
dpkg --configure ceph-common:amd64
apt dist-upgrade
service ceph-osd@3 start
between the dpkg --configure I killed two processes more but don't remember excactly which process made the trouble.

I assume that the "10.2.5-1~bpo80+1" in the postinst make the trouble, because the version is than 10.2.6...

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!