[SOLVED] Ceph Node down after upgrade to 6.2

normic

Active Member
Apr 17, 2018
19
3
43
Germany, Baden
Hello,
I just started upgrading our 3-Node-Cluster. Followed the Upgrade from 5 to 6 docs, upgraded to Corosync 3 on all three nodes first. After that the pve5to6 script was happy, therefore I started with the first node.

Moved the running VMs to nodes 2 and 3 and upgraded Node 1 to Debian 10 buster with Proxmox 6.2
All went fine, but after the needed reboot the Ceph storage is not accessible any more.

Logs say simply: pvestatd got timeout

As I understand the docs, the needed upgrade from Ceph Luminous to Nautilus has to be done after _all_ nodes are on Proxmox 6.

Did I miss a step? Hit a bug?

Any help is welcome, as I'm stuck in the middle of the cluster upgrade.

Thanks in advance
normic
 
When I look at:
Bash:
pveversion -v

All nodes are on the same ceph version, one is on buster other two are still on stretch:
Updated node:
ceph: 12.2.13-pve1
ceph-fuse: 12.2.13-pve1
corosync: 3.0.4-pve1

Other nodes:
ceph: 12.2.13-pve1~bpo9
corosync: 3.0.4-pve1~bpo9

There is no ceph-fuse on the older nodes.

To answer your question, they are all on Luminous.
 
I can't from the problem node, as ceph is not running.

I can start the command but's running forever, until I use CTRL-C


corosync-cmapctl:

Code:
config.totemconfig_reload_in_progress (u8) = 0
internal_configuration.service.0.name (str) = corosync_cmap
internal_configuration.service.0.ver (u32) = 0
internal_configuration.service.1.name (str) = corosync_cfg
internal_configuration.service.1.ver (u32) = 0
internal_configuration.service.2.name (str) = corosync_cpg
internal_configuration.service.2.ver (u32) = 0
internal_configuration.service.3.name (str) = corosync_quorum
internal_configuration.service.3.ver (u32) = 0
internal_configuration.service.4.name (str) = corosync_pload
internal_configuration.service.4.ver (u32) = 0
internal_configuration.service.5.name (str) = corosync_votequorum
internal_configuration.service.5.ver (u32) = 0
internal_configuration.service.6.name (str) = corosync_mon
internal_configuration.service.6.ver (u32) = 0
internal_configuration.service.7.name (str) = corosync_wd
internal_configuration.service.7.ver (u32) = 0
logging.debug (str) = off
logging.to_syslog (str) = yes
nodelist.local_node_pos (u32) = 0
nodelist.node.0.name (str) = px-node-01
nodelist.node.0.nodeid (u32) = 1
nodelist.node.0.quorum_votes (u32) = 1
nodelist.node.0.ring0_addr (str) = 172.16.1.200
nodelist.node.1.name (str) = px-node-02
nodelist.node.1.nodeid (u32) = 2
nodelist.node.1.quorum_votes (u32) = 1
nodelist.node.1.ring0_addr (str) = 172.16.1.201
nodelist.node.2.name (str) = px-node-03
nodelist.node.2.nodeid (u32) = 3
nodelist.node.2.quorum_votes (u32) = 1
nodelist.node.2.ring0_addr (str) = 172.16.1.202
quorum.provider (str) = corosync_votequorum
resources.system.load_15min.current (dbl) = 0.000000
resources.system.load_15min.last_updated (u64) = 0
resources.system.load_15min.poll_period (u64) = 3000
resources.system.load_15min.state (str) = stopped
resources.system.memory_used.current (i32) = 0
resources.system.memory_used.last_updated (u64) = 0
resources.system.memory_used.poll_period (u64) = 3000
resources.system.memory_used.state (str) = stopped
resources.watchdog_timeout (u32) = 6
runtime.blackbox.dump_flight_data (str) = no
runtime.blackbox.dump_state (str) = no
runtime.config.totem.block_unlisted_ips (u32) = 1
runtime.config.totem.consensus (u32) = 1980
runtime.config.totem.downcheck (u32) = 1000
runtime.config.totem.fail_recv_const (u32) = 2500
runtime.config.totem.heartbeat_failures_allowed (u32) = 0
runtime.config.totem.hold (u32) = 303
runtime.config.totem.interface.0.knet_ping_interval (u32) = 412
runtime.config.totem.interface.0.knet_ping_timeout (u32) = 825
runtime.config.totem.join (u32) = 50
runtime.config.totem.knet_compression_level (i32) = 0
runtime.config.totem.knet_compression_model (str) = none
runtime.config.totem.knet_compression_threshold (u32) = 0
runtime.config.totem.knet_pmtud_interval (u32) = 30
runtime.config.totem.max_messages (u32) = 17
runtime.config.totem.max_network_delay (u32) = 50
runtime.config.totem.merge (u32) = 200
runtime.config.totem.miss_count_const (u32) = 5
runtime.config.totem.send_join (u32) = 0
runtime.config.totem.seqno_unchanged_const (u32) = 30
runtime.config.totem.token (u32) = 1650
runtime.config.totem.token_retransmit (u32) = 392
runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
runtime.config.totem.token_warning (u32) = 75
runtime.config.totem.window_size (u32) = 50
runtime.force_gather (str) = no
runtime.members.1.config_version (u64) = 3
runtime.members.1.ip (str) = r(0) ip(172.16.1.200)
runtime.members.1.join_count (u32) = 1
runtime.members.1.status (str) = joined
runtime.members.2.config_version (u64) = 3
runtime.members.2.ip (str) = r(0) ip(172.16.1.201)
runtime.members.2.join_count (u32) = 1
runtime.members.2.status (str) = joined
runtime.members.3.config_version (u64) = 3
runtime.members.3.ip (str) = r(0) ip(172.16.1.202)
runtime.members.3.join_count (u32) = 1
runtime.members.3.status (str) = joined
runtime.services.cfg.0.rx (u64) = 0
runtime.services.cfg.0.tx (u64) = 0
runtime.services.cfg.1.rx (u64) = 0
runtime.services.cfg.1.tx (u64) = 0
runtime.services.cfg.2.rx (u64) = 0
runtime.services.cfg.2.tx (u64) = 0
runtime.services.cfg.3.rx (u64) = 0
runtime.services.cfg.3.tx (u64) = 0
runtime.services.cfg.service_id (u16) = 1
runtime.services.cmap.0.rx (u64) = 37
runtime.services.cmap.0.tx (u64) = 13
runtime.services.cmap.service_id (u16) = 0
runtime.services.cpg.0.rx (u64) = 2
runtime.services.cpg.0.tx (u64) = 2
runtime.services.cpg.1.rx (u64) = 0
runtime.services.cpg.1.tx (u64) = 0
runtime.services.cpg.2.rx (u64) = 35
runtime.services.cpg.2.tx (u64) = 11
runtime.services.cpg.3.rx (u64) = 451318
runtime.services.cpg.3.tx (u64) = 171755
runtime.services.cpg.4.rx (u64) = 0
runtime.services.cpg.4.tx (u64) = 0
runtime.services.cpg.5.rx (u64) = 37
runtime.services.cpg.5.tx (u64) = 13
runtime.services.cpg.6.rx (u64) = 0
runtime.services.cpg.6.tx (u64) = 0
runtime.services.cpg.service_id (u16) = 2
runtime.services.mon.service_id (u16) = 6
runtime.services.pload.0.rx (u64) = 0
runtime.services.pload.0.tx (u64) = 0
runtime.services.pload.1.rx (u64) = 0
runtime.services.pload.1.tx (u64) = 0
runtime.services.pload.service_id (u16) = 4
runtime.services.quorum.service_id (u16) = 3
runtime.services.votequorum.0.rx (u64) = 75
runtime.services.votequorum.0.tx (u64) = 26
runtime.services.votequorum.1.rx (u64) = 0
runtime.services.votequorum.1.tx (u64) = 0
runtime.services.votequorum.2.rx (u64) = 0
runtime.services.votequorum.2.tx (u64) = 0
runtime.services.votequorum.3.rx (u64) = 0
runtime.services.votequorum.3.tx (u64) = 0
runtime.services.votequorum.service_id (u16) = 5
runtime.services.wd.service_id (u16) = 7
runtime.votequorum.ev_barrier (u32) = 3
runtime.votequorum.highest_node_id (u32) = 3
runtime.votequorum.lowest_node_id (u32) = 1
runtime.votequorum.this_node_id (u32) = 1
runtime.votequorum.two_node (u8) = 0
totem.cluster_name (str) = proxmox-cluster
totem.config_version (u64) = 3
totem.interface.0.bindnetaddr (str) = 172.16.1.200
totem.interface.ringnumber (str) = 0
totem.ip_version (str) = ipv4
totem.secauth (str) = on
totem.version (u32) = 2
 
Cluster information
-------------------
Name: proxmox-cluster
Config Version: 3
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Sat Jul 25 15:08:35 2020
Quorum provider: corosync_votequorum
Nodes: 3
Node ID: 0x00000001
Ring ID: 1.54
Quorate: Yes

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000001 1 172.16.1.200 (local)
0x00000002 1 172.16.1.201
0x00000003 1 172.16.1.202
 
It look good , I don't see any issue in cluster , just verify ceph nodes are able to communicate check the logs
 
Yes, I just wanted to be sure what was the reason - to mark this as solved. But I got it now.

If it's helps someone else out, I hit the "predictable network names" issue.

After changing /etc/network/interfaces to the found interface names and an additional reboot, everything works fine as far as I've seen until now.
The Ceph cluster connection comes back and after a while all was clean again.

Code:
systemd-udevd[12686]: Using default interface naming scheme 'v240'.
systemd-udevd[12686]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
systemd-udevd[12686]: Could not generate persistent MAC address for tap121i0: No such file or directory
These are some strange log entries now, but I'll investigate this further in a different thread.

So far, main issue solved with this node, up to the next ;)