Repeated panics on one node

athompso · Jul 21, 2015

I'm getting repeated kernel panics on one of my nodes (see picture, below).

I'm guessing that one of the CEPH OSDs on this system got broken somehow, and XFS is now really unhappy. So this produces two problems for me...

1) should I even try to fix the XFS filesystem?
2) CEPH isn't rebalancing properly even though it recognizes those OSDs as down/out.

And, of course, the root question would be "What happened???" but I have no answer for that.

Curiously, pveproxy/pvemanager are completely dead on this system; the server is half-dead - it responds to ping but not much else.

udo · Jul 21, 2015

athompso said:
I'm getting repeated kernel panics on one of my nodes (see picture, below).

View attachment 2754

I'm guessing that one of the CEPH OSDs on this system got broken somehow, and XFS is now really unhappy. So this produces two problems for me...

1) should I even try to fix the XFS filesystem?

you can also try to format the osd in ext4-format - I have good experiences with ext4-osds.

2) CEPH isn't rebalancing properly even though it recognizes those OSDs as down/out.

this sounds like to few osds/nodes for your selected replica?
If you have tree nodes and an replica of 3 ceph don't rebalance if one node die...

And, of course, the root question would be "What happened???" but I have no answer for that.

any info in the logfiles?
Have you tried xfs_repair?

Udo

athompso · Jul 22, 2015

I booted sysresccd, mounted (to replay the XFS journal), unmounted, and ran xfs_repair on each of the 3 OSDs on that host. xfs_repair didn't appear to complain about anything in particular, which worries me - if the kernel panicked in xfs_metadata updates, there should have been *something* wrong...

As to OSD layout, yes, there are some design decisions that are severely suboptimal, but I can't fix them until these servers die and get replaced. I only saw CEPH fail to start rebalancing once, and it may have been because the server only half-died that time :-(.

udo · Jul 22, 2015

athompso said:
I booted sysresccd, mounted (to replay the XFS journal), unmounted, and ran xfs_repair on each of the 3 OSDs on that host. xfs_repair didn't appear to complain about anything in particular, which worries me - if the kernel panicked in xfs_metadata updates, there should have been *something* wrong...

As to OSD layout, yes, there are some design decisions that are severely suboptimal, but I can't fix them until these servers die and get replaced. I only saw CEPH fail to start rebalancing once, and it may have been because the server only half-died that time :-(.

Hi,
smartmon status of the hdd?

Udo

athompso · Jul 22, 2015

Well, actually, um, er... this node is a PowerEdge 2970 with the battery-backed PERC/5i RAID controller. I'm actually running OSDs on three RAID1 volumes to take advantage of the enhanced BBWC performance, since I've got it. (If I didn't have the battery module, I would have just disabled RAID altogether, but I don't see much reason to throw away the huge performance boost.)
Three volumes, because I have 2 x 2TB drives, 2 x 1TB drives, and 2 x 250GB drives in the system.
The other nodes are a bit saner.
As the oldest servers die, they'll likely get replaced with E3-1200v3 systems with 1 x boot drive, 1 x SSD cache and 2 x OSD disks each. Especially since I can buy two of those for the cost of one E5-2600v3 system with 6 bays, and I don't need >32GB of RAM anywhere yet.
-Adam

Search

Search

Repeated panics on one node

athompso

Renowned Member

udo

Distinguished Member

athompso

Renowned Member

udo

Distinguished Member

athompso

Renowned Member

We value your privacy