Hello Froum,
we are running a hyperconverged 3 node proxmox ceph cluster (PVE 7.4.16, ceph 16.3)
we use a meshed 10GBit bonded network for ceph trafic with 3 x dual port intel nics.
It seems that one port on node 3 fails because node 1 cannot ping node 3 and reverse but node 2 can ping node 3
we cannot log into any vm
Current state is: after reboot of node 3 no vm can start
ceph on node 3 has still noout and noscrub flags and ceph is peering
root@amcvh13:~# ceph -s
cluster:
id: ae713943-83f3-48b4-a0c2-124c092c250b
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
Reduced data availability: 291 pgs inactive, 291 pgs peering
1388 slow ops, oldest one blocked for 3578 sec, daemons [osd.1,osd.10,osd.12,osd.17,osd.2,osd.3,osd.45,osd.47,osd.5,osd.6]... have slow ops.
services:
mon: 3 daemons, quorum amcvh11,amcvh12,amcvh13 (age 59m)
mgr: amcvh12(active, since 65m), standbys: amcvh11, amcvh13
mds: 1/1 daemons up, 2 standby
osd: 45 osds: 45 up (since 59m), 45 in (since 12M)
flags noout,noscrub,nodeep-scrub
data:
volumes: 1/1 healthy
pools: 6 pools, 433 pgs
objects: 600.49k objects, 1.1 TiB
usage: 3.4 TiB used, 8.8 TiB / 12 TiB avail
pgs: 67.206% pgs not active
291 peering
142 active+clean
our biggest problem is, that we had to stop our mail server which didn't react anymore
before becoming aware of the underlying ceph network problem - so mail on node 2 is offline now
How can I go on without causing any damage?
Is it approprite to shutdown node 3 an continue with a 2/3 degraded cluster just in
order to start and run the mailserver until we replace the nic in node 3?
P.S. We have a subscription, but our login in the customer portal failed, so I wasn't able to open
a ticket.
Any help is very much appreciated...
we are running a hyperconverged 3 node proxmox ceph cluster (PVE 7.4.16, ceph 16.3)
we use a meshed 10GBit bonded network for ceph trafic with 3 x dual port intel nics.
It seems that one port on node 3 fails because node 1 cannot ping node 3 and reverse but node 2 can ping node 3
we cannot log into any vm
Current state is: after reboot of node 3 no vm can start
ceph on node 3 has still noout and noscrub flags and ceph is peering
root@amcvh13:~# ceph -s
cluster:
id: ae713943-83f3-48b4-a0c2-124c092c250b
health: HEALTH_WARN
noout,noscrub,nodeep-scrub flag(s) set
Reduced data availability: 291 pgs inactive, 291 pgs peering
1388 slow ops, oldest one blocked for 3578 sec, daemons [osd.1,osd.10,osd.12,osd.17,osd.2,osd.3,osd.45,osd.47,osd.5,osd.6]... have slow ops.
services:
mon: 3 daemons, quorum amcvh11,amcvh12,amcvh13 (age 59m)
mgr: amcvh12(active, since 65m), standbys: amcvh11, amcvh13
mds: 1/1 daemons up, 2 standby
osd: 45 osds: 45 up (since 59m), 45 in (since 12M)
flags noout,noscrub,nodeep-scrub
data:
volumes: 1/1 healthy
pools: 6 pools, 433 pgs
objects: 600.49k objects, 1.1 TiB
usage: 3.4 TiB used, 8.8 TiB / 12 TiB avail
pgs: 67.206% pgs not active
291 peering
142 active+clean
our biggest problem is, that we had to stop our mail server which didn't react anymore
before becoming aware of the underlying ceph network problem - so mail on node 2 is offline now
How can I go on without causing any damage?
Is it approprite to shutdown node 3 an continue with a 2/3 degraded cluster just in
order to start and run the mailserver until we replace the nic in node 3?
P.S. We have a subscription, but our login in the customer portal failed, so I wasn't able to open
a ticket.
Any help is very much appreciated...