[SOLVED] Corosync and ceph issues!

tl5k5

Well-Known Member
Jul 28, 2017
62
1
48
52
Hello all,
I walked in this morning to find problems.
My config consists of a 4 Node PVE cluster which has 3 Nodes of ceph storage
Ceph is inaccessible from the webgui
pve01 has corosync errors.
ceph status command will not run on the cli across the 3 ceph nodes.
I'm not sure if one issue caused the other or if it's just a coincidence.

Any help would be appreciated!


Code:
systemctl status corosync.service -l
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: failed (Result: resources)
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview

Aug 24 09:57:26 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:14:12 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:14:12 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:14:12 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:15:36 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:15:36 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:15:36 pve01 systemd[1]: Failed to start Corosync Cluster Engine.
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed with result 'resources'.
Aug 24 10:49:46 pve01 systemd[1]: Failed to start Corosync Cluster Engine.

Code:
journalctl -xn
-- Logs begin at Mon 2020-08-24 09:57:20 CDT, end at Mon 2020-08-24 11:00:01 CDT. --
Aug 24 10:59:46 pve01 pvestatd[5779]: got timeout
Aug 24 10:59:46 pve01 pvestatd[5779]: status update time (5.373 seconds)
Aug 24 10:59:55 pve01 snmpd[1469]: error on subcontainer 'ia_addr' insert (-1)
Aug 24 10:59:56 pve01 pvestatd[5779]: got timeout
Aug 24 10:59:56 pve01 pvestatd[5779]: status update time (5.376 seconds)
Aug 24 11:00:00 pve01 systemd[1]: Starting Proxmox VE replication runner...
-- Subject: A start job for unit pvesr.service has begun execution
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pvesr.service has begun execution.
--
-- The job identifier is 6792.
Aug 24 11:00:01 pve01 pvesr[16026]: unable to open file '/var/lib/pve-manager/pve-replication-state.json.tmp.16026' - No space left
Aug 24 11:00:01 pve01 systemd[1]: pvesr.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- An ExecStart= process belonging to unit pvesr.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 2.
Aug 24 11:00:01 pve01 systemd[1]: pvesr.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit pvesr.service has entered the 'failed' state with result 'exit-code'.
Aug 24 11:00:01 pve01 systemd[1]: Failed to start Proxmox VE replication runner.
-- Subject: A start job for unit pvesr.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit pvesr.service has finished with a failure.
--
-- The job identifier is 6792 and the job result is failed.

Code:
Aug 24 11:15:27 pve01 systemd[1]: ceph-mon@pve01.service: Start request repeated too quickly.
Aug 24 11:15:27 pve01 systemd[1]: ceph-mon@pve01.service: Failed with result 'resources'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- The unit ceph-mon@pve01.service has entered the 'failed' state with result 'resources'.
Aug 24 11:15:27 pve01 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: A start job for unit ceph-mon@pve01.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
--
-- A start job for unit ceph-mon@pve01.service has finished with a failure.
--
-- The job identifier is 8382 and the job result is failed.
 
Aug 24 10:49:46 pve01 systemd[1]: corosync.service: Failed to run 'start' task: No space left on device
Do you maybe have a full filesystem?
check with `df -h`
 
Do you maybe have a full filesystem?
check with `df -h`

I did look at that...I thought it looked ok. Here's the output:

Code:
df -h
Filesystem                     Size  Used Avail Use% Mounted on
udev                            18G     0   18G   0% /dev
tmpfs                          3.6G   15M  3.6G   1% /run
/dev/mapper/pve-root            29G  4.6G   23G  17% /
tmpfs                           18G   63M   18G   1% /dev/shm
tmpfs                          5.0M     0  5.0M   0% /run/lock
tmpfs                           18G     0   18G   0% /sys/fs/cgroup
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-6
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-0
tmpfs                           18G   24K   18G   1% /var/lib/ceph/osd/ceph-3
tmpfs                          3.6G     0  3.6G   0% /run/user/0
/dev/fuse                       30M   36K   30M   1% /etc/pve
 
on another hunch - maybe the inodes ran out - `df -i` should indicate that

if this still is not the reason - you could try to start corosync in foreground mode like systemd would do:
`systemctl cat corosync`
just run the ExecStart line - maybe this gives a hint to the origin of the 'No space left on device'

I hope this helps!
 
See output below:

Code:
df -i
Filesystem                     Inodes   IUsed   IFree IUse% Mounted on
udev                          4625469     566 4624903    1% /dev
tmpfs                         4631662    2312 4629350    1% /run
/dev/mapper/pve-root          1933312 1933312       0  100% /
tmpfs                         4631662     129 4631533    1% /dev/shm
tmpfs                         4631662      12 4631650    1% /run/lock
tmpfs                         4631662      18 4631644    1% /sys/fs/cgroup
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-6
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-0
tmpfs                         4631662       8 4631654    1% /var/lib/ceph/osd/ceph-3
//10.210.10.44/proxmox              0       0       0     - /mnt/pve/RAID04
//10.210.10.22/pve_cluster_VM       0       0       0     - /mnt/pve/mako
tmpfs                         4631662      10 4631652    1% /run/user/0
/dev/fuse                       10000      80    9920    1% /etc/pve

Code:
 /usr/sbin/corosync -f $COROSYNC_OPTIONS
Aug 24 11:50:24 notice  [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Aug 24 11:50:24 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Aug 24 11:50:24 error   [MAIN  ] Another Corosync instance is already running.
Aug 24 11:50:24 error   [MAIN  ] Corosync Cluster Engine exiting with status 18 at main.c:1519.
 
Also:
Code:
root@pve01:/# killall -9 corosync
root@pve01:/# /usr/sbin/corosync -f $COROSYNC_OPTIONS
Aug 24 11:52:59 notice  [MAIN  ] Corosync Cluster Engine 3.0.4 starting up
Aug 24 11:52:59 info    [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf snmp pie relro bindnow
Aug 24 11:52:59 notice  [TOTEM ] Initializing transport (Kronosnet).
Aug 24 11:52:59 info    [TOTEM ] kronosnet crypto initialized: aes256/sha256
Aug 24 11:52:59 info    [TOTEM ] totemknet initialized
Aug 24 11:52:59 info    [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync configuration map access [0]
Aug 24 11:52:59 info    [QB    ] server name: cmap
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync configuration service [1]
Aug 24 11:52:59 info    [QB    ] server name: cfg
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Aug 24 11:52:59 info    [QB    ] server name: cpg
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync profile loading service [4]
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Aug 24 11:52:59 warning [WD    ] Watchdog not enabled by configuration
Aug 24 11:52:59 warning [WD    ] resource load_15min missing a recovery key.
Aug 24 11:52:59 warning [WD    ] resource memory_used missing a recovery key.
Aug 24 11:52:59 info    [WD    ] no resources configured.
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync watchdog service [7]
Aug 24 11:52:59 notice  [QUORUM] Using quorum provider corosync_votequorum
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Aug 24 11:52:59 info    [QB    ] server name: votequorum
Aug 24 11:52:59 notice  [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Aug 24 11:52:59 info    [QB    ] server name: quorum
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 notice  [TOTEM ] A new membership (1.1924) was formed. Members joined: 1
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 0)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 notice  [QUORUM] Members[1]: 1
Aug 24 11:52:59 notice  [MAIN  ] Completed service synchronization, ready to provide service.
Aug 24 11:52:59 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 2 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 3 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:52:59 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:52:59 warning [KNET  ] host: host: 4 has no active links
Aug 24 11:53:01 info    [KNET  ] rx: host: 4 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 4 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 3 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 2 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] rx: host: 3 link: 1 is up
Aug 24 11:53:01 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] rx: host: 2 link: 0 is up
Aug 24 11:53:01 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 4 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 4 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 3 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 2 link: 0 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: PMTUD link change for host: 2 link: 1 from 469 to 1397
Aug 24 11:53:01 info    [KNET  ] pmtud: Global data MTU changed to: 1397
Aug 24 11:53:01 notice  [TOTEM ] A new membership (1.1928) was formed. Members joined: 2 3 4
Aug 24 11:53:01 notice  [QUORUM] This node is within the primary component and will provide service.
Aug 24 11:53:01 notice  [QUORUM] Members[4]: 1 2 3 4
Aug 24 11:53:01 notice  [MAIN  ] Completed service synchronization, ready to provide service.
^CAug 24 11:53:23 notice  [MAIN  ] Node was shut down by a signal
Aug 24 11:53:23 notice  [SERV  ] Unloading all Corosync service engines.
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync configuration map access
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync configuration service
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Aug 24 11:53:23 info    [QB    ] withdrawing server sockets
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync profile loading service
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
Aug 24 11:53:23 notice  [SERV  ] Service engine unloaded: corosync watchdog service
Aug 24 11:53:24 notice  [MAIN  ] Corosync Cluster Engine exiting normally
 
on another hunch - maybe the inodes ran out - `df -i` should indicate that

if this still is not the reason - you could try to start corosync in foreground mode like systemd would do:
`systemctl cat corosync`
just run the ExecStart line - maybe this gives a hint to the origin of the 'No space left on device'

I hope this helps!

After starting it manually with "/usr/sbin/corosync -f $COROSYNC_OPTIONS", this seems to have fixed the corosync issue. It also came back up after the node was restarted. Now I need to focus on ceph as it is still problematic.
Any ideas?
 
/dev/mapper/pve-root 1933312 1933312 0 100% /
your root filesystem has ran out of inodes (a bit simplified you have too many files on it) - check where the inodes have been used up (maybe somewhere in /var there's a directory containing many 0 byte files)

I hope this helps!
 
  • Like
Reactions: tl5k5
your root filesystem has ran out of inodes (a bit simplified you have too many files on it) - check where the inodes have been used up (maybe somewhere in /var there's a directory containing many 0 byte files)

I hope this helps!

Looks like I found what's clogging up the disk: /var/lib/samba/private/msg.sock/
I found this old bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912717
My 3 ceph nodes all had 100%, or almost 100% with df -i. These 3 nodes all mount two SMB shares.
My 4th node has Samba running on it that shares out some internal storage to these first 3 ceph nodes.
All 4 nodes connect to an SMB server for backups.
Is this bug still an issue or is there something else going on?

Thanks for all your help!
 
  • Like
Reactions: tl5k5
Sorry to admit, I can't figure out how to mark it solved.
Above your first post there's the three dots '...' (a.k.a. more options) button -> Edit Thread -> set the Prefix to SOLVED

for the next time - I'll set this one to SOLVED :)
 
@Stoiko Ivanov
I looked there and "Solved" was not listed.

Selection_461.png

EDIT: I think I found it. You have to go to Edit thread and then pick it by the thread title...correct?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!