I have a HA 3-node cluster with 3-replica ceph.
Just had a power outage, one of the nodes is down after a ups was flapping and the battery ran down. I get a kernel panic on boot. I think the boot drive is corrupted (but that's a different issue)
I believe the idea of ceph is we should be able to survive a 1 out of 3 node outage? Ceph is completely unavailabe.
Running ceph -s... I hit ctrl-c after 10 minutes of waiting
office-pmvm-01 {~} root# ceph -s
^CCluster connection aborted
strace is showing that it's trying to connect to 10.200.6.11 (which is down)
[pid 73196] setsockopt(14, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
[pid 73197] <... setsockopt resumed>) = 0
[pid 73196] <... setsockopt resumed>) = 0
[pid 73197] connect(12, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73196] connect(14, {sa_family=AF_INET, sin_port=htons(3300), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73197] <... connect resumed>) = -1 EINPROGRESS (Operation now in progress)
[pid 73196] <... connect resumed>) = -1 EINPROGRESS (Operation now in progress)
[pid 73197] epoll_ctl(6, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLET, {u32=12, u64=12}} <unfinished ...>
[pid 73196] epoll_ctl(3, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLET, {u32=14, u64=14}} <unfinished ...>
[pid 73197] <... epoll_ctl resumed>) = 0
[pid 73196] <... epoll_ctl resumed>) = 0
[pid 73197] connect(12, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73196] connect(14, {sa_family=AF_INET, sin_port=htons(3300), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73197] <... connect resumed>) = -1 EALREADY (Operation already in progress)
[pid 73196] <... connect resumed>) = -1 EALREADY (Operation already in progress)
[pid 73197] epoll_ctl(6, EPOLL_CTL_MOD, 12, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=12, u64=12}} <unfinished ...>
[pid 73196] epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=14, u64=14}} <unfinished ...>
[pid 73197] <... epoll_ctl resumed>) = 0
[pid 73196] <... epoll_ctl resumed>) = 0
[pid 73197] epoll_wait(6, <unfinished ...>
[pid 73196] epoll_wait(3, ^C <unfinished ...>
[pid 73191] <... futex resumed>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
What are next steps?
Thanks!
Just had a power outage, one of the nodes is down after a ups was flapping and the battery ran down. I get a kernel panic on boot. I think the boot drive is corrupted (but that's a different issue)
I believe the idea of ceph is we should be able to survive a 1 out of 3 node outage? Ceph is completely unavailabe.
Running ceph -s... I hit ctrl-c after 10 minutes of waiting
office-pmvm-01 {~} root# ceph -s
^CCluster connection aborted
strace is showing that it's trying to connect to 10.200.6.11 (which is down)
[pid 73196] setsockopt(14, SOL_TCP, TCP_NODELAY, [1], 4 <unfinished ...>
[pid 73197] <... setsockopt resumed>) = 0
[pid 73196] <... setsockopt resumed>) = 0
[pid 73197] connect(12, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73196] connect(14, {sa_family=AF_INET, sin_port=htons(3300), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73197] <... connect resumed>) = -1 EINPROGRESS (Operation now in progress)
[pid 73196] <... connect resumed>) = -1 EINPROGRESS (Operation now in progress)
[pid 73197] epoll_ctl(6, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLET, {u32=12, u64=12}} <unfinished ...>
[pid 73196] epoll_ctl(3, EPOLL_CTL_ADD, 14, {EPOLLIN|EPOLLET, {u32=14, u64=14}} <unfinished ...>
[pid 73197] <... epoll_ctl resumed>) = 0
[pid 73196] <... epoll_ctl resumed>) = 0
[pid 73197] connect(12, {sa_family=AF_INET, sin_port=htons(6789), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73196] connect(14, {sa_family=AF_INET, sin_port=htons(3300), sin_addr=inet_addr("10.200.6.11")}, 16 <unfinished ...>
[pid 73197] <... connect resumed>) = -1 EALREADY (Operation already in progress)
[pid 73196] <... connect resumed>) = -1 EALREADY (Operation already in progress)
[pid 73197] epoll_ctl(6, EPOLL_CTL_MOD, 12, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=12, u64=12}} <unfinished ...>
[pid 73196] epoll_ctl(3, EPOLL_CTL_MOD, 14, {EPOLLIN|EPOLLOUT|EPOLLET, {u32=14, u64=14}} <unfinished ...>
[pid 73197] <... epoll_ctl resumed>) = 0
[pid 73196] <... epoll_ctl resumed>) = 0
[pid 73197] epoll_wait(6, <unfinished ...>
[pid 73196] epoll_wait(3, ^C <unfinished ...>
[pid 73191] <... futex resumed>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
What are next steps?
Thanks!