[SOLVED] Proxmox 6.4 -> 7 upgrade: Broken network and now ceph monitors

AlexLup

Well-Known Member
Mar 19, 2018
218
14
58
42
And now I am unable to get the ceph-mons, possibly related to this network issue after Proxmox 6.4 -> 7 upgrade or the fact that all the LVM IDs seem to have changed.
Any help is appreciated as this upgrade has been a truly horrible experience this far with no access to my data..

The monitor logs are huge after setting debugging to 20 on both monitor and paxos but basically they don't go into quorum..
Tested this far:
* telnet into both messenger v1 and messenger v2 ports
* Modified ceph.conf to only speak on either messenger v1 or messenger v2
* Reseted all but 1 healthy mon in hopes of that healthy mon to pass on the epoch to the others, other mons says synchronizing then goes into electing again
* Removed all monitors but the one healthy one to achieve quorum (leader) which was achieved but then the other monitors just do not want to synch

One observation is that the epoch seems all different from each other, hope this is not a brain split I am looking at ?

As the logs are massive I do not even know where to start to copy and paste so I will try to paste the most interesting portions below..

[global]
auth client required = none
auth cluster required = none
auth service required = none
#bluestore_block_db_size = 13106127360
#bluestore_block_wal_size = 13106127360
cluster_network = 172.16.1.0/16
debug_asok = 0/0
debug_auth = 0/0
debug_buffer = 0/0
debug_client = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_filer = 0/0
debug_filestore = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_journal = 0/0
debug_journaler = 0/0
debug_lockdep = 0/0
debug_mds = 0/0
debug_mds_balancer = 0/0
debug_mds_locker = 0/0
debug_mds_log = 0/0
debug_mds_log_expire = 0/0
debug_mds_migrator = 0/0
debug_mon = 20
debug_monc = 0/0
debug_ms = 0/0
debug_objclass = 0/0
debug_objectcacher = 0/0
debug_objecter = 0/0
debug_optracker = 0/0
debug_osd = 1/1
debug_paxos = 0/0
debug_perfcounter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_rgw = 0/0
debug_throttle = 0/0
debug_timer = 0/0
debug_tp = 0/0
fsid = e44fbe1c-b1c7-481d-bd25-dc595eae2d13
mon_allow_pool_delete = true
mon_host = 192.168.1.21, 192.168.1.22, 192.168.1.23
mon_max_pg_per_osd = 500
mon_osd_allow_primary_affinity = true
osd_journal_size = 28120
osd_max_backfills = 5
osd_max_pg_per_osd_hard_ratio = 3
osd_pool_default_min_size = 2
osd_pool_default_size = 3
osd_recovery_max_active = 6
osd_recovery_op_priority = 3
osd_scrub_auto_repair = true
osd_scrub_begin_hour = 1
osd_scrub_end_hour = 8
osd_scrub_sleep = 0.1
public_network = 192.168.1.0/24
rbd_cache = true
bluestore_default_buffered_write = true # BlueStore has the ability to perform buffered writes. Buffered writes enable populating the read cache during the write process. This setting, in effect, changes the BlueStore cache into a write-through cache.
# It is advised that spinning media continue to use 64 kB while SSD/NVMe are likely to benefit from setting to 4 kB.
min_alloc_size_ssd=4096
min_alloc_size_hdd=65536
# https://yourcmc.ru/wiki/Ceph_performance
bluefs_preextend_wal_files = true
cephx_require_signatures = true
cephx_cluster_require_signatures = true
cephx_sign_messages = true
objecter_inflight_ops = 5120 # 24576 seems to be gold
objecter_inflight_op_bytes = 524288000 # (512 * 1024 000) on 512 PGs

[client]
client_reconnect_stale = true
keyring = /etc/pve/priv/$cluster.$name.keyring

[mds]
keyring = /var/lib/ceph/mds/ceph-$id/keyring
mds_data = /var/lib/ceph/mds/ceph-$id

[mon]
mon_compact_on_start = true
mon_compact_on_trim = true

[osd]
filestore_xattr_use_omap = true
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd_crush_update_on_start = true

[mds.pve23]
host = 192.168.1.23

[mds.pve21]
host = 192.168.1.21

[mds.pve22]
host = 192.168.1.22
 
Last edited:
OK, slowly slicing away at my forehead as I keep banging my head against the wall, I actually managed to get a ceph quroum by switching from the network card that is for OSDs and MONs to the backend network interface....



Now checking firewalls even tho I am able to telnet into both v1 and v2 ports without issues on the 192 address...

ceph-mon -i `hostname` --extract-monmap /tmp/monmap
# Get rid of the 192. mons
monmaptool /tmp/monmap --rm pve21 --rm pve22 --rm pve23

# Add 172 instead
monmaptool --add pve22 172.16.1.21 --add pve22 172.16.1.22 --add pve23 172.16.1.23 /tmp/monmap

# Check that the data is ok before injecting
monmaptool --print /tmp/monmap

ceph-mon -i `hostname` --inject-monmap /tmp/monmap

Start monitor and get quorum!

# Check mon_status
ceph --admin-daemon /var/run/ceph/ceph-mon.`hostname`.asok mon_status
 
Last edited:
Thinking this might be a driver issue or some other packages that are corrupted as I have the same behaviour on all three nodes even tho I replaced the switch as well..
 
Here is the issue that illustrates the problem:

root@pve21:~# scp 192.168.1.22:/etc/ceph/ceph.conf ceph.conf
ceph.conf 0% 0 0.0KB/s --:-- ETA

# Same machine on another IP
root@pve21:~# scp 172.16.1.22:/etc/ceph/ceph.conf ceph.conf
ceph.conf 100% 2989 5.9MB/s 00:00

root@pve21:~# ip a
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
link/ether 88:51:fb:5d:8c:fb brd ff:ff:ff:ff:ff:ff
altname enp0s25
3: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc mq master vmbr1 state UP group default qlen 1000
link/ether 00:02:c9:54:b5:b8 brd ff:ff:ff:ff:ff:ff
altname enp6s0
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc noqueue state UP group default qlen 1000
link/ether 88:51:fb:5d:8c:fb brd ff:ff:ff:ff:ff:ff
inet 192.168.1.21/24 scope global vmbr0
valid_lft forever preferred_lft forever
inet6 fe80::8a51:fbff:fe5d:8cfb/64 scope link
valid_lft forever preferred_lft forever
5: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc noqueue state UP group default qlen 1000
link/ether 00:02:c9:54:b5:b8 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.21/16 scope global vmbr1
valid_lft forever preferred_lft forever
inet6 fe80::202:c9ff:fe54:b5b8/64 scope link
valid_lft forever preferred_lft forever

root@pve21:~# ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1592
ether 88:51:fb:5d:8c:fb txqueuelen 1000 (Ethernet)
RX packets 2720524 bytes 527772109 (503.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2590499 bytes 418092849 (398.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 20 memory 0xee700000-ee720000

ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1592
ether 00:02:c9:54:b5:b8 txqueuelen 1000 (Ethernet)
RX packets 1480576 bytes 1030671152 (982.9 MiB)
RX errors 0 dropped 675 overruns 0 frame 0
TX packets 998166 bytes 384649573 (366.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1592
inet 192.168.1.21 netmask 255.255.255.0 broadcast 0.0.0.0
inet6 fe80::8a51:fbff:fe5d:8cfb prefixlen 64 scopeid 0x20<link>
ether 88:51:fb:5d:8c:fb txqueuelen 1000 (Ethernet)
RX packets 2720043 bytes 478646145 (456.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2589341 bytes 406718287 (387.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

vmbr1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1592
inet 172.16.1.21 netmask 255.255.0.0 broadcast 0.0.0.0
inet6 fe80::202:c9ff:fe54:b5b8 prefixlen 64 scopeid 0x20<link>
ether 00:02:c9:54:b5:b8 txqueuelen 1000 (Ethernet)
RX packets 1029855 bytes 986415804 (940.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 855122 bytes 375206911 (357.8 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

auto lo
iface lo inet loopback

iface enp1s0 inet manual
mtu 1592

iface eno1 inet manual
mtu 1592

iface ens3 inet manual
mtu 1592

iface enx803f5d0943ba inet manual
mtu 1592

auto vmbr0
iface vmbr0 inet static
address 192.168.1.21/24
gateway 192.168.1.1
bridge-ports eno1
bridge-stp off
bridge-fd 0
hwaddress 88:51:fb:5d:8c:fb
mtu 1592

auto vmbr1
iface vmbr1 inet static
address 172.16.1.21/16
bridge-ports ens3
bridge-stp off
bridge-fd 0
mtu 1592
hwaddress 00:02:c9:54:b5:b8
# post-up ifconfig ens3 mtu 9000
# post-up ifconfig vmbr1 mtu 9000
 
Dmesg: https://pastebin.com/5Fee9iWg

Code:
root@pve21:~# lspci -nn | grep 0200
00:19.0 Ethernet controller [0200]: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) [8086:1502] (rev 05)
06:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)

Code:
root@pve21:~# dmesg | grep -e eth0 -e e100
[    0.551031] pci 0000:06:00.0: reg 0x30: [mem 0xee100000-0xee1fffff pref]
[    0.552207] pci 0000:00:1c.5:   bridge window [mem 0xee100000-0xee2fffff]
[    0.585971] pci 0000:00:1c.5:   bridge window [mem 0xee100000-0xee2fffff]
[    0.586033] pci_bus 0000:06: resource 1 [mem 0xee100000-0xee2fffff]
[    2.038313] e1000e: Intel(R) PRO/1000 Network Driver
[    2.038314] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    2.040092] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    2.131858] e1000e 0000:00:19.0 0000:00:19.0 (uninitialized): registered PHC clock
[    2.219166] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 88:51:fb:5d:8c:fb
[    2.219170] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[    2.219212] e1000e 0000:00:19.0 eth0: MAC: 10, PHY: 11, PBA No: 0100FF-0FF
[    2.220055] e1000e 0000:00:19.0 eno1: renamed from eth0
[    4.857094] mlx4_core 0000:06:00.0 ens3: renamed from eth0
[   39.712367] e1000e 0000:00:19.0: Some CPU C-states have been disabled in order to enable jumbo frames
[   43.138367] e1000e 0000:00:19.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

Code:
root@pve21:~# cat /var/log/syslog | grep -e e100 -e eno1 | tail -n20
grep: (standard input): binary file matches
Aug 29 16:52:37 pve21 kernel: [   42.946697] vmbr0: port 1(eno1) entered forwarding state
Aug 30 17:37:56 pve21 kernel: [    0.545036] pci 0000:06:00.0: reg 0x30: [mem 0xee100000-0xee1fffff pref]
Aug 30 17:37:56 pve21 kernel: [    0.546214] pci 0000:00:1c.5:   bridge window [mem 0xee100000-0xee2fffff]
Aug 30 17:37:56 pve21 kernel: [    0.582604] pci 0000:00:1c.5:   bridge window [mem 0xee100000-0xee2fffff]
Aug 30 17:37:56 pve21 kernel: [    0.582665] pci_bus 0000:06: resource 1 [mem 0xee100000-0xee2fffff]
Aug 30 17:37:56 pve21 kernel: [    1.893041] e1000e: Intel(R) PRO/1000 Network Driver
Aug 30 17:37:56 pve21 kernel: [    1.893043] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Aug 30 17:37:56 pve21 kernel: [    1.893423] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
Aug 30 17:37:56 pve21 kernel: [    1.980425] e1000e 0000:00:19.0 0000:00:19.0 (uninitialized): registered PHC clock
Aug 30 17:37:56 pve21 kernel: [    2.071655] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 88:51:fb:5d:8c:fb
Aug 30 17:37:56 pve21 kernel: [    2.071659] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
Aug 30 17:37:56 pve21 kernel: [    2.071700] e1000e 0000:00:19.0 eth0: MAC: 10, PHY: 11, PBA No: 0100FF-0FF
Aug 30 17:37:56 pve21 kernel: [    2.072578] e1000e 0000:00:19.0 eno1: renamed from eth0
Aug 30 17:37:57 pve21 kernel: [   39.250733] vmbr0: port 1(eno1) entered blocking state
Aug 30 17:37:57 pve21 kernel: [   39.250741] vmbr0: port 1(eno1) entered disabled state
Aug 30 17:37:57 pve21 kernel: [   39.250909] device eno1 entered promiscuous mode
Aug 30 17:37:57 pve21 kernel: [   39.383178] e1000e 0000:00:19.0: Some CPU C-states have been disabled in order to enable jumbo frames
Aug 30 17:38:00 pve21 kernel: [   42.798837] e1000e 0000:00:19.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Aug 30 17:38:00 pve21 kernel: [   42.798888] vmbr0: port 1(eno1) entered blocking state
Aug 30 17:38:00 pve21 kernel: [   42.798891] vmbr0: port 1(eno1) entered forwarding state

Code:
root@pve21:~# modinfo e1000e | grep version
srcversion:     3CA93DC574FD3AB6D992ED0
vermagic:       5.11.22-4-pve SMP mod_unload modversions
 
Code:
ip addr | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc pfifo_fast master vmbr0 state UP group default qlen 1000
3: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc mq master vmbr1 state UP group default qlen 1000
4: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc noqueue state UP group default qlen 1000
5: vmbr1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1592 qdisc noqueue state UP group default qlen 1000
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
 
Spoiler: scp not working on built-in NIC, but ok on mellanox


root@pve21:~# scp 192.168.1.22:/etc/ceph/ceph.conf ceph.conf
ceph.conf 0% 0 0.0KB/s --:-- ETA

Do you have same problem with ssh/https ? it could be a mtu problem, as ssh/https add "don't fragment" bit in ip header, so with wrong mtu, the packet will be dropped.


can you try "ping -Mdo -s 1560 192.168.1.22" ?

 
Last edited:
Code:
root@pve22:~# pve-firewall status
Status: disabled/running
root@pve22:~# pve-firewall stop
root@pve22:~# pve-firewall status
Status: disabled/stopped
 
Do you have same problem with ssh/https ? it could be a mtu problem, as ssh/https add "don't fragment" bit in ip header, so with wrong mtu, the packet will be dropped.


can you try "ping -Mdo -s 1560 192.168.1.22" ?
Code:
root@pve21:~# ping -Mdo -s 1560 192.168.1.22
PING 192.168.1.22 (192.168.1.22) 1560(1588) bytes of data.
^C
--- 192.168.1.22 ping statistics ---
10 packets transmitted, 0 received, 100% packet loss, time 9215ms

Thank you @spirit - seems this doesnt work on any of my eno1's but works just fine on the mellanoxes:
Code:
ping -Mdo -s 1560 172.16.1.22
PING 172.16.1.22 (172.16.1.22) 1560(1588) bytes of data.
1568 bytes from 172.16.1.22: icmp_seq=1 ttl=64 time=0.194 ms
1568 bytes from 172.16.1.22: icmp_seq=2 ttl=64 time=0.168 ms
1568 bytes from 172.16.1.22: icmp_seq=3 ttl=64 time=0.208 ms
^C
--- 172.16.1.22 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2029ms
rtt min/avg/max/mdev = 0.168/0.190/0.208/0.016 ms
 
Found something interesting in rgds to the MTU thank you spirit!


Code:
root@pve22:~# cat /var/log/syslog | grep  PMTUD
Aug 29 17:09:28 pve22 corosync[2944]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1461
Aug 29 17:16:17 pve22 corosync[2944]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1461
Aug 30 18:10:32 pve22 corosync[2941]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1461
Aug 30 18:18:16 pve22 corosync[2941]:   [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 469 to 1461
Aug 30 18:58:12 pve22 corosync[2973]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 1461

Code:
root@pve21:~# ping -Mdo -s 1472 192.168.1.22
PING 192.168.1.22 (192.168.1.22) 1472(1500) bytes of data.
1480 bytes from 192.168.1.22: icmp_seq=1 ttl=64 time=0.306 ms
^C
--- 192.168.1.22 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.306/0.306/0.306/0.000 ms
root@pve21:~# ping -Mdo -s 1473 192.168.1.22
PING 192.168.1.22 (192.168.1.22) 1473(1501) bytes of data.
^C
--- 192.168.1.22 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1027ms
 
And just like that!

Code:
root@pve22:~# ip link set dev eno1 mtu 1472
root@pve22:~# ip link set dev vmbr0 mtu 1472

root@pve21:~# ip link set dev eno1 mtu 1472
root@pve21:~# ip link set dev vmbr0 mtu 1472
root@pve21:~# scp 192.168.1.22:/etc/ceph/ceph.conf ceph.conf
ceph.conf   100% 2989     4.0MB/s   00:00
 
Last edited:
Still find it weird that the mellanoxes can do an MTU of 1592 and up just fine tho, have run MTU 9000 on all my NICs since I first started with proxmox and now with Proxmox 7, the ifupdown2 package gets corrupted on 2 different nodes, and all nodes exibit MTU issues with e1000e driver and NIC!
 
Still find it weird that the mellanoxes can do an MTU of 1592 and up just fine tho, have run MTU 9000 on all my NICs since I first started with proxmox and now with Proxmox 7, the ifupdown2 package gets corrupted on 2 different nodes, and all nodes exibit MTU issues with e1000e driver and NIC!
Are you sure that your phyiscal switch ports are setup to allow mtu bigger than > 1500 ?


BTW, why do you have setup this special mtu 1592 ? you can do mtu 9000 with mellanox nic without any problem if you want jumbo frames for ceph.
 
Yes the TP Link switch was actually setup for that MTU. I do not remember where I got MTU 1592 from, it came up as I was troubleshooting.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!