Proxmox ceph pacific cluster becomes unstable after rebooting a node

silvered.dragon

Renowned Member
Nov 4, 2015
123
4
83
Hi everyone,
I have a simple 3 node cluster that has always worked for many years and successfully passed the updates starting from proxmox 4. After updating to version 7 of proxmox and pacific ceph, the system is affected by this issue:

every time I reboot a node for any reason (ie updating to a new kernel) when the node is completely rebooted and ceph starts to resynch, the cluster becomes unstable and I cannot access the vms. In the gui something works and something doesn't, ie node summary always seems to go, while the ceph pannel is switching beetwen grayed out with error "mon dump down" and warning with a lot of errors. All this instability occurs during the ceph resynch phase, after this phase the system returns to 100% stable until the next node reboot. The resynch times are not that different from what I would expect, the problem is that during the system it is inaccessible and I feel scared. I have already passed 3 kernel updates and the problem is always the same. So in this moment I'm only restarting servers at night when I know that my colleagues aren't working on them.

I'm 100% sure that my hdd are good and that there is no network failure(I can always ping each cluster with ping <0.01ms ) or cpu overload
Servers are hp gen8 with an HP Ethernet 10Gb 2-port 530FLR-SFP+ Adapter for the ceph network(meshed) and a 2 of 4 ports from an HP NC365T 4-port Ethernet Server Adapter for the proxmox cluster network(meshed), so I'm not using any kind of switch for the clusters network.

I'm attaching some pics of the errors that I got.

I would really appreciate advice
 

Attachments

  • Cattura1.JPG
    Cattura1.JPG
    57.9 KB · Views: 22
  • InkedCattura3_LI.jpg
    InkedCattura3_LI.jpg
    941 KB · Views: 22
  • InkedCattura5_LI.jpg
    InkedCattura5_LI.jpg
    724.1 KB · Views: 21
Any help about this? Yesterday I upgraded to latest kernel and new ceph version with same issue.. I'm attaching the syslog of nodo1 and nodo2 around the reboot of node2
 

Attachments

  • nodo1_syslog.txt
    79.1 KB · Views: 0
  • nodo2_syslog.txt
    248.1 KB · Views: 0
Last edited:
Hmm.. searching in nodo2 syslog I foun this symlink loop,
Code:
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:02 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:03 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1857]: sdc2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1856]: sdd2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1852]: sdb2: Failed to update device symlinks: Too many levels of symbolic links
Sep 28 23:38:04 nodo2 systemd-udevd[1854]: sde2: Failed to update device symlinks: Too many levels of symbolic links
so I think that is same issue as this
https://forum.proxmox.com/threads/o...f-symbolic-links-after-6-4-to-7-update.92968/
 
I tried to rebuild all OSDs and now I have the new partition schema with one single partition for each OSD. The above loop error is not present anymore but the startup issue is still there.

What I noticed is this message that, when a restarted node comes up again, is continuously repeated until ceph finishes the synchronization which however takes a lot. Then everything becomes stable and I can work without problems and with the performance that I would expect:
Code:
Oct 05 00:25:44 nodo1 kernel: libceph: mon1 (1)10.0.3.20:6789 socket closed (con state V1_BANNER)
Oct 05 00:25:45 nodo1 ceph-osd[2389178]: 2021-10-05T00:25:45.483+0200 7f4aa915d700 -1 osd.3 47109 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.315920719.0:50902276 1.4c 1:33115c5b:::rbd_data.8ec0912ae8944a.0000000000000a39:head [set-alloc-hint object_size 4194304 write_size 4194304,write 282624~4096 in=4096b] snapc 1ff6=[] ondisk+write+known_if_redirected e47109)
Oct 05 00:25:46 nodo1 ceph-osd[2389178]: 2021-10-05T00:25:46.439+0200 7f4aa915d700 -1 osd.3 47109 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.315920719.0:50902276 1.4c 1:33115c5b:::rbd_data.8ec0912ae8944a.0000000000000a39:head [set-alloc-hint object_size 4194304 write_size 4194304,write 282624~4096 in=4096b] snapc 1ff6=[] ondisk+write+known_if_redirected e47109)
Oct 05 00:25:46 nodo1 pvestatd[1909]: got timeout
any suggestions?
 
  • Like
Reactions: zeuxprox
I noticed that into the packages versions I have

Code:
ifupdown: not correctly installed
ifupdown2: 3.1.0-1+pmx3

I upgraded to ifupdown2 before upgrading to pve7 to avoid those mac address problems, is this ifupdown: not correctly installed correct? I'm thinking at some network issues during startup(but I'm sure that I can ping all networks immediately after the system comes up)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!