Repeated panics on one node

athompso

Member
Sep 13, 2013
127
8
18
I'm getting repeated kernel panics on one of my nodes (see picture, below).

New Doc 1_1.jpg

I'm guessing that one of the CEPH OSDs on this system got broken somehow, and XFS is now really unhappy. So this produces two problems for me...

1) should I even try to fix the XFS filesystem?
2) CEPH isn't rebalancing properly even though it recognizes those OSDs as down/out.

And, of course, the root question would be "What happened???" but I have no answer for that.

Curiously, pveproxy/pvemanager are completely dead on this system; the server is half-dead - it responds to ping but not much else.
 
I'm getting repeated kernel panics on one of my nodes (see picture, below).

View attachment 2754

I'm guessing that one of the CEPH OSDs on this system got broken somehow, and XFS is now really unhappy. So this produces two problems for me...

1) should I even try to fix the XFS filesystem?
you can also try to format the osd in ext4-format - I have good experiences with ext4-osds.
2) CEPH isn't rebalancing properly even though it recognizes those OSDs as down/out.
this sounds like to few osds/nodes for your selected replica?
If you have tree nodes and an replica of 3 ceph don't rebalance if one node die...

And, of course, the root question would be "What happened???" but I have no answer for that.
any info in the logfiles?
Have you tried xfs_repair?

Udo
 
I booted sysresccd, mounted (to replay the XFS journal), unmounted, and ran xfs_repair on each of the 3 OSDs on that host. xfs_repair didn't appear to complain about anything in particular, which worries me - if the kernel panicked in xfs_metadata updates, there should have been *something* wrong...

As to OSD layout, yes, there are some design decisions that are severely suboptimal, but I can't fix them until these servers die and get replaced. I only saw CEPH fail to start rebalancing once, and it may have been because the server only half-died that time :-(.
 
I booted sysresccd, mounted (to replay the XFS journal), unmounted, and ran xfs_repair on each of the 3 OSDs on that host. xfs_repair didn't appear to complain about anything in particular, which worries me - if the kernel panicked in xfs_metadata updates, there should have been *something* wrong...

As to OSD layout, yes, there are some design decisions that are severely suboptimal, but I can't fix them until these servers die and get replaced. I only saw CEPH fail to start rebalancing once, and it may have been because the server only half-died that time :-(.
Hi,
smartmon status of the hdd?

Udo
 
Well, actually, um, er... this node is a PowerEdge 2970 with the battery-backed PERC/5i RAID controller. I'm actually running OSDs on three RAID1 volumes to take advantage of the enhanced BBWC performance, since I've got it. (If I didn't have the battery module, I would have just disabled RAID altogether, but I don't see much reason to throw away the huge performance boost.)
Three volumes, because I have 2 x 2TB drives, 2 x 1TB drives, and 2 x 250GB drives in the system.
The other nodes are a bit saner.
As the oldest servers die, they'll likely get replaced with E3-1200v3 systems with 1 x boot drive, 1 x SSD cache and 2 x OSD disks each. Especially since I can buy two of those for the cost of one E5-2600v3 system with 6 bays, and I don't need >32GB of RAM anywhere yet.
-Adam
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!