[SOLVED] Ceph Node down after upgrade to 6.2

normic

Active Member
Apr 17, 2018
19
3
43
Germany, Baden
Hello,
I just started upgrading our 3-Node-Cluster. Followed the Upgrade from 5 to 6 docs, upgraded to Corosync 3 on all three nodes first. After that the pve5to6 script was happy, therefore I started with the first node.

Moved the running VMs to nodes 2 and 3 and upgraded Node 1 to Debian 10 buster with Proxmox 6.2
All went fine, but after the needed reboot the Ceph storage is not accessible any more.

Logs say simply: pvestatd got timeout

As I understand the docs, the needed upgrade from Ceph Luminous to Nautilus has to be done after _all_ nodes are on Proxmox 6.

Did I miss a step? Hit a bug?

Any help is welcome, as I'm stuck in the middle of the cluster upgrade.

Thanks in advance
normic
 
When I look at:
Bash:
pveversion -v

All nodes are on the same ceph version, one is on buster other two are still on stretch:
Updated node:
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1

Other nodes:
ceph: 12.2.13-pve1~bpo9
corosync: 3.0.4-pve1~bpo9

There is no ceph-fuse on the older nodes.

To answer your question, they are all on Luminous.
 
I can't from the problem node, as ceph is not running.

I can start the command but's running forever, until I use CTRL-C


corosync-cmapctl:

Code:
config.totemconfig_reload_in_progress (u8) = 0
internal_configuration.service.0.name (str) = corosync_cmap
internal_configuration.service.0.ver (u32) = 0
internal_configuration.service.1.name (str) = corosync_cfg
internal_configuration.service.1.ver (u32) = 0
internal_configuration.service.2.name (str) = corosync_cpg
internal_configuration.service.2.ver (u32) = 0
internal_configuration.service.3.name (str) = corosync_quorum
internal_configuration.service.3.ver (u32) = 0
internal_configuration.service.4.name (str) = corosync_pload
internal_configuration.service.4.ver (u32) = 0
internal_configuration.service.5.name (str) = corosync_votequorum
internal_configuration.service.5.ver (u32) = 0
internal_configuration.service.6.name (str) = corosync_mon
internal_configuration.service.6.ver (u32) = 0
internal_configuration.service.7.name (str) = corosync_wd
internal_configuration.service.7.ver (u32) = 0
logging.debug (str) = off
logging.to_syslog (str) = yes
nodelist.local_node_pos (u32) = 0
nodelist.node.0.name (str) = px-node-01
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.quorum_votes (u32) = 1
nodelist.node.0.ring0_addr (str) = 172.16.1.200
nodelist.node.1.name (str) = px-node-02
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.quorum_votes (u32) = 1
nodelist.node.1.ring0_addr (str) = 172.16.1.201
nodelist.node.2.name (str) = px-node-03
nodelist.node.2.nodeid (u32) = 3
nodelist.node.2.quorum_votes (u32) = 1
nodelist.node.2.ring0_addr (str) = 172.16.1.202
quorum.provider (str) = corosync_votequorum
resources.system.load_15min.current (dbl) = 0.000000
resources.system.load_15min.last_updated (u64) = 0
resources.system.load_15min.poll_period (u64) = 3000
resources.system.load_15min.state (str) = stopped
resources.system.memory_used.current (i32) = 0
resources.system.memory_used.last_updated (u64) = 0
resources.system.memory_used.poll_period (u64) = 3000
resources.system.memory_used.state (str) = stopped
resources.watchdog_timeout (u32) = 6
runtime.blackbox.dump_flight_data (str) = no
runtime.blackbox.dump_state (str) = no
runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.consensus (u32) = 1980
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 303
runtime.config.totem.interface.0.knet_ping_interval (u32) = 412
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 825
runtime.config.totem.join (u32) = 50
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 1650
runtime.config.totem.token_retransmit (u32) = 392
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50
runtime.force_gather (str) = no
runtime.members.1.config_version (u64) = 3
runtime.members.1.ip (str) = r(0) ip(172.16.1.200)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 3
runtime.members.2.ip (str) = r(0) ip(172.16.1.201)
runtime.members.2.join_count (u32) = 1
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 3
runtime.members.3.ip (str) = r(0) ip(172.16.1.202)
runtime.members.3.join_count (u32) = 1
runtime.members.3.status (str) = joined
runtime.services.cfg.0.rx (u64) = 0
runtime.services.cfg.0.tx (u64) = 0
runtime.services.cfg.1.rx (u64) = 0
runtime.services.cfg.1.tx (u64) = 0
runtime.services.cfg.2.rx (u64) = 0
runtime.services.cfg.2.tx (u64) = 0
runtime.services.cfg.3.rx (u64) = 0
runtime.services.cfg.3.tx (u64) = 0
runtime.services.cfg.service_id (u16) = 1
runtime.services.cmap.0.rx (u64) = 37
runtime.services.cmap.0.tx (u64) = 13
runtime.services.cmap.service_id (u16) = 0
runtime.services.cpg.0.rx (u64) = 2
runtime.services.cpg.0.tx (u64) = 2
runtime.services.cpg.1.rx (u64) = 0
runtime.services.cpg.1.tx (u64) = 0
runtime.services.cpg.2.rx (u64) = 35
runtime.services.cpg.2.tx (u64) = 11
runtime.services.cpg.3.rx (u64) = 451318
runtime.services.cpg.3.tx (u64) = 171755
runtime.services.cpg.4.rx (u64) = 0
runtime.services.cpg.4.tx (u64) = 0
runtime.services.cpg.5.rx (u64) = 37
runtime.services.cpg.5.tx (u64) = 13
runtime.services.cpg.6.rx (u64) = 0
runtime.services.cpg.6.tx (u64) = 0
runtime.services.cpg.service_id (u16) = 2
runtime.services.mon.service_id (u16) = 6
runtime.services.pload.0.rx (u64) = 0
runtime.services.pload.0.tx (u64) = 0
runtime.services.pload.1.rx (u64) = 0
runtime.services.pload.1.tx (u64) = 0
runtime.services.pload.service_id (u16) = 4
runtime.services.quorum.service_id (u16) = 3
runtime.services.votequorum.0.rx (u64) = 75
runtime.services.votequorum.0.tx (u64) = 26
runtime.services.votequorum.1.rx (u64) = 0
runtime.services.votequorum.1.tx (u64) = 0
runtime.services.votequorum.2.rx (u64) = 0
runtime.services.votequorum.2.tx (u64) = 0
runtime.services.votequorum.3.rx (u64) = 0
runtime.services.votequorum.3.tx (u64) = 0
runtime.services.votequorum.service_id (u16) = 5
runtime.services.wd.service_id (u16) = 7
runtime.votequorum.ev_barrier (u32) = 3
runtime.votequorum.highest_node_id (u32) = 3
runtime.votequorum.lowest_node_id (u32) = 1
runtime.votequorum.this_node_id (u32) = 1
runtime.votequorum.two_node (u8) = 0
totem.cluster_name (str) = proxmox-cluster
totem.config_version (u64) = 3
totem.interface.0.bindnetaddr (str) = 172.16.1.200
totem.interface.ringnumber (str) = 0
totem.ip_version (str) = ipv4
totem.secauth (str) = on
totem.version (u32) = 2
 
Cluster information
-------------------
Name: proxmox-cluster
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat Jul 25 15:08:35 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.54
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.1.200 (local)
0x00000002 1 172.16.1.201
0x00000003 1 172.16.1.202
 
It look good , I don't see any issue in cluster , just verify ceph nodes are able to communicate check the logs
 
Yes, I just wanted to be sure what was the reason - to mark this as solved. But I got it now.

If it's helps someone else out, I hit the "predictable network names" issue.

After changing /etc/network/interfaces to the found interface names and an additional reboot, everything works fine as far as I've seen until now.
The Ceph cluster connection comes back and after a while all was clean again.

Code:
systemd-udevd[12686]: Using default interface naming scheme 'v240'.
systemd-udevd[12686]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
systemd-udevd[12686]: Could not generate persistent MAC address for tap121i0: No such file or directory
These are some strange log entries now, but I'll investigate this further in a different thread.

So far, main issue solved with this node, up to the next ;)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!