[SOLVED] Corosync and ceph issues!

tl5k5 · Aug 24, 2020

Hello all,
I walked in this morning to find problems.
My config consists of a 4 Node PVE cluster which has 3 Nodes of ceph storage
Ceph is inaccessible from the webgui
pve01 has corosync errors.
ceph status command will not run on the cli across the 3 ceph nodes.
I'm not sure if one issue caused the other or if it's just a coincidence.

Any help would be appreciated!

Code:

systemctl status corosync.service -l
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: resources)
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview

Aug 24 09:57:26 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:14:12 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:14:12 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:14:12 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:15:36 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:15:36 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:15:36 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:49:46 pve01 systemd[1]: Failed to start Corosync Cluster Engine.

Code:

journalctl -xn
-- Logs begin at Mon 2020-08-24 09:57:20 CDT, end at Mon 2020-08-24 11:00:01 CDT. --
Aug 24 10:59:46 pve01 pvestatd[5779]: got timeout
Aug 24 10:59:46 pve01 pvestatd[5779]: status update time (5.373 seconds)
Aug 24 10:59:55 pve01 snmpd[1469]: error on subcontainer 'ia_addr' insert (-1)
Aug 24 10:59:56 pve01 pvestatd[5779]: got timeout
Aug 24 10:59:56 pve01 pvestatd[5779]: status update time (5.376 seconds)
Aug 24 11:00:00 pve01 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: A start job for unit pvesr.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pvesr.service has begun execution.
--
-- The job identifier is 6792.
Aug 24 11:00:01 pve01 pvesr[16026]: unable to open file '/var/lib/pve-manager/pve-replication-state.json.tmp.16026' - No space left
Aug 24 11:00:01 pve01 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An ExecStart= process belonging to unit pvesr.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 2.
Aug 24 11:00:01 pve01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit pvesr.service has entered the 'failed' state with result 'exit-code'.
Aug 24 11:00:01 pve01 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: A start job for unit pvesr.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pvesr.service has finished with a failure.
--
-- The job identifier is 6792 and the job result is failed.

Code:

Aug 24 11:15:27 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
Aug 24 11:15:27 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'resources'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit ceph-mon@pve01.service has entered the 'failed' state with result 'resources'.
Aug 24 11:15:27 pve01 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: A start job for unit ceph-mon@pve01.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit ceph-mon@pve01.service has finished with a failure.
--
-- The job identifier is 8382 and the job result is failed.

Stoiko Ivanov · Aug 24, 2020

tl5k5 said:
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device

Do you maybe have a full filesystem?
check with `df -h`

tl5k5 · Aug 24, 2020

Stoiko Ivanov said:
Do you maybe have a full filesystem?
check with `df -h`

I did look at that...I thought it looked ok. Here's the output:

Code:

df -h
Filesystem                     Size  Used Avail Use% Mounted on
udev                            18G     0   18G   0% /dev
tmpfs                          3.6G   15M  3.6G   1% /run
/dev/mapper/pve-root            29G  4.6G   23G  17% /
tmpfs                           18G   63M   18G   1% /dev/shm
tmpfs                          5.0M     0  5.0M   0% /run/lock
tmpfs                           18G     0   18G   0% /sys/fs/cgroup
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-6
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-0
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-3
tmpfs                          3.6G     0  3.6G   0% /run/user/0
/dev/fuse                       30M   36K   30M   1% /etc/pve

Stoiko Ivanov · Aug 24, 2020

on another hunch - maybe the inodes ran out - `df -i` should indicate that

if this still is not the reason - you could try to start corosync in foreground mode like systemd would do:
`systemctl cat corosync`
just run the ExecStart line - maybe this gives a hint to the origin of the 'No space left on device'

I hope this helps!

tl5k5 · Aug 24, 2020

See output below:

Code:

df -i
Filesystem                     Inodes   IUsed   IFree IUse% Mounted on
udev                          4625469     566 4624903    1% /dev
tmpfs                         4631662    2312 4629350    1% /run
/dev/mapper/pve-root          1933312 1933312       0  100% /
tmpfs                         4631662     129 4631533    1% /dev/shm
tmpfs                         4631662      12 4631650    1% /run/lock
tmpfs                         4631662      18 4631644    1% /sys/fs/cgroup
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-6
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-0
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-3
//10.210.10.44/proxmox              0       0       0     - /mnt/pve/RAID04
//10.210.10.22/pve_cluster_VM       0       0       0     - /mnt/pve/mako
tmpfs                         4631662      10 4631652    1% /run/user/0
/dev/fuse                       10000      80    9920    1% /etc/pve

Code:

 /usr/sbin/corosync -f $COROSYNC_OPTIONS
Aug 24 11:50:24 notice  [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Aug 24 11:50:24 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Aug 24 11:50:24 error   [MAIN  ] Another Corosync instance is already running.
Aug 24 11:50:24 error   [MAIN  ] Corosync Cluster Engine exiting with status 18 at main.c:1519.

tl5k5 · Aug 24, 2020

Also:

Code:

root@pve01:/# killall -9 corosync
root@pve01:/# /usr/sbin/corosync -f $COROSYNC_OPTIONS
Aug 24 11:52:59 notice  [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Aug 24 11:52:59 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Aug 24 11:52:59 notice  [TOTEM ] Initializing transport (Kronosnet).
Aug 24 11:52:59 info    [TOTEM ] kronosnet crypto initialized: aes256/sha256
Aug 24 11:52:59 info    [TOTEM ] totemknet initialized
Aug 24 11:52:59 info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 24 11:52:59 info    [QB    ] server name: cmap
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 24 11:52:59 info    [QB    ] server name: cfg
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 24 11:52:59 info    [QB    ] server name: cpg
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 24 11:52:59 warning [WD    ] Watchdog not enabled by configuration
Aug 24 11:52:59 warning [WD    ] resource load_15min missing a recovery key.
Aug 24 11:52:59 warning [WD    ] resource memory_used missing a recovery key.
Aug 24 11:52:59 info    [WD    ] no resources configured.
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 24 11:52:59 notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 24 11:52:59 info    [QB    ] server name: votequorum
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 24 11:52:59 info    [QB    ] server name: quorum
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 notice  [TOTEM ] A new membership (1.1924) was formed. Members joined: 1
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 notice  [QUORUM] Members[1]: 1
Aug 24 11:52:59 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:53:01 info    [KNET  ] rx: host: 4 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 4 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 3 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 2 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 3 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] rx: host: 2 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 24 11:53:01 notice  [TOTEM ] A new membership (1.1928) was formed. Members joined: 2 3 4
Aug 24 11:53:01 notice  [QUORUM] This node is within the primary component and will provide service.
Aug 24 11:53:01 notice  [QUORUM] Members[4]: 1 2 3 4
Aug 24 11:53:01 notice  [MAIN  ] Completed service synchronization, ready to provide service.
^CAug 24 11:53:23 notice  [MAIN  ] Node was shut down by a signal
Aug 24 11:53:23 notice  [SERV  ] Unloading all Corosync service engines.
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync configuration map access
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync configuration service
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync profile loading service
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync watchdog service
Aug 24 11:53:24 notice  [MAIN  ] Corosync Cluster Engine exiting normally

tl5k5 · Aug 24, 2020

Stoiko Ivanov said:
on another hunch - maybe the inodes ran out - `df -i` should indicate that

if this still is not the reason - you could try to start corosync in foreground mode like systemd would do:
`systemctl cat corosync`
just run the ExecStart line - maybe this gives a hint to the origin of the 'No space left on device'

I hope this helps!

After starting it manually with "/usr/sbin/corosync -f $COROSYNC_OPTIONS", this seems to have fixed the corosync issue. It also came back up after the node was restarted. Now I need to focus on ceph as it is still problematic.
Any ideas?

Stoiko Ivanov · Aug 25, 2020

tl5k5 said:
/dev/mapper/pve-root 1933312 1933312 0 100% /

your root filesystem has ran out of inodes (a bit simplified you have too many files on it) - check where the inodes have been used up (maybe somewhere in /var there's a directory containing many 0 byte files)

I hope this helps!

tl5k5 · Aug 25, 2020

Stoiko Ivanov said:
your root filesystem has ran out of inodes (a bit simplified you have too many files on it) - check where the inodes have been used up (maybe somewhere in /var there's a directory containing many 0 byte files)

I hope this helps!

Looks like I found what's clogging up the disk: /var/lib/samba/private/msg.sock/
I found this old bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912717
My 3 ceph nodes all had 100%, or almost 100% with df -i. These 3 nodes all mount two SMB shares.
My 4th node has Samba running on it that shares out some internal storage to these first 3 ceph nodes.
All 4 nodes connect to an SMB server for backups.
Is this bug still an issue or is there something else going on?

Thanks for all your help!

Stoiko Ivanov · Aug 25, 2020

tl5k5 said:
Is this bug still an issue or is there something else going on?

seems like this is the case - see also https://bugzilla.proxmox.com/show_bug.cgi?id=2333

currently I would suggest to clean it up manually to get your system running sensibly again

I hope this helps!

tl5k5 · Aug 25, 2020

Thanks again for the help!

Stoiko Ivanov said:
seems like this is the case - see also https://bugzilla.proxmox.com/show_bug.cgi?id=2333

currently I would suggest to clean it up manually to get your system running sensibly again

I hope this helps!

Stoiko Ivanov · Aug 25, 2020

tl5k5 said:
Thanks again for the help!

You're welcome

- please mark the thread as 'SOLVED' - this helps others who run into the same issue - Thanks!

tl5k5 · Aug 25, 2020

Stoiko Ivanov said:
You're welcome - please mark the thread as 'SOLVED' - this helps others who run into the same issue - Thanks!

Sorry to admit, I can't figure out how to mark it solved.

Stoiko Ivanov · Aug 25, 2020

tl5k5 said:
Sorry to admit, I can't figure out how to mark it solved.

Above your first post there's the three dots '...' (a.k.a. more options) button -> Edit Thread -> set the Prefix to SOLVED

for the next time - I'll set this one to SOLVED

tl5k5 · Aug 25, 2020

@Stoiko Ivanov
I looked there and "Solved" was not listed.

EDIT: I think I found it. You have to go to Edit thread and then pick it by the thread title...correct?

Stoiko Ivanov · Aug 25, 2020

tl5k5 said:
I think I found it. You have to go to Edit thread and then pick it by the thread title...correct?

yes exactly

Search

Search

[SOLVED] Corosync and ceph issues!

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

tl5k5

Well-Known Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

tl5k5

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

We value your privacy