[SOLVED] Upgraded via GUI to latest kernel, 3 out of 4 nodes on Dell C6100 wont load up GUI, but do reach VE CLI

nix235

Member
Oct 25, 2021
11
0
6
50
New Mexico
My Homelan environment with a Dell PowerEdge C6100 has been allot of fun, but last week I went through the GUI and did the update and upgrade process and rebooted the four nodes in the box. 3 nodes boot and the newer kernel produces some errors once the VE reaches the CLI but the URL provided is not reachable.

Kernel: 6.5.11-7 seems to be the culprit. So googling suggested I reboot and via the VE boot menu choose a different kernel. The nodes which DOES run fine still is on 6.2.16-3, so I copied down that nodes CPU BIOS settings and Kernel and attempted to reconfigure the other nodes. No dice...

There are some FAILURE messages just before proxmox VE reaches the CLI (which I can access, but there is no web access).

I havent been able to find a solution for this, and would love to get feedback on how I should report the error here and also what additional information I should provide to help resolve the issues. Thanks!
 
Hi,
the error messages you are getting would be a good start to find out what went wrong.
Since you have ssh access, you should also take a look at dmesg, journalctl -eu pvedaemon, journalctl -eu pve-cluster and journalctl -eu pveproxy.
 
  • Like
Reactions: nix235
Awesome thanks for the starting debug info, below are my results:

pvecm:
Code:
Cluster information
-------------------
Name:             pve-cluster
Config Version:   9
Transport:        knet
Secure auth:      on

Quorum information
------------------
Date:             Tue Dec 19 07:17:40 2023
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          0x00000002
Ring ID:          2.1f5
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
0x00000002          1 192.168.1.240 (local)

proxy:
Code:
Dec 18 07:34:01 pve3 systemd[1]: Starting pveproxy.service - PVE API Proxy Server...
Dec 18 07:34:32 pve3 pvecm[1400]: got timeout when trying to ensure cluster certificates and base file hierarchy is set up - no quorum (yet) or hung pmxcfs?
Dec 18 07:34:35 pve3 pveproxy[1496]: starting server
Dec 18 07:34:35 pve3 pveproxy[1496]: starting 3 worker(s)
Dec 18 07:34:35 pve3 pveproxy[1496]: worker 1497 started
Dec 18 07:34:35 pve3 pveproxy[1496]: worker 1498 started
Dec 18 07:34:35 pve3 pveproxy[1496]: worker 1499 started
Dec 18 07:34:35 pve3 systemd[1]: Started pveproxy.service - PVE API Proxy Server.

cluster:
Code:
Dec 18 07:33:55 pve3 systemd[1]: Starting pve-cluster.service - The Proxmox VE cluster filesystem...
Dec 18 07:33:55 pve3 pmxcfs[1243]: [main] notice: resolved node name 'pve3' to '192.168.1.240' for default node IP address
Dec 18 07:33:55 pve3 pmxcfs[1243]: [main] notice: resolved node name 'pve3' to '192.168.1.240' for default node IP address
Dec 18 07:33:55 pve3 pmxcfs[1247]: [quorum] crit: quorum_initialize failed: 2
Dec 18 07:33:55 pve3 pmxcfs[1247]: [quorum] crit: can't initialize service
Dec 18 07:33:55 pve3 pmxcfs[1247]: [confdb] crit: cmap_initialize failed: 2
Dec 18 07:33:55 pve3 pmxcfs[1247]: [confdb] crit: can't initialize service
Dec 18 07:33:55 pve3 pmxcfs[1247]: [dcdb] crit: cpg_initialize failed: 2
Dec 18 07:33:55 pve3 pmxcfs[1247]: [dcdb] crit: can't initialize service
Dec 18 07:33:55 pve3 pmxcfs[1247]: [status] crit: cpg_initialize failed: 2
Dec 18 07:33:55 pve3 pmxcfs[1247]: [status] crit: can't initialize service
Dec 18 07:33:56 pve3 systemd[1]: Started pve-cluster.service - The Proxmox VE cluster filesystem.
Dec 18 07:34:01 pve3 pmxcfs[1247]: [status] notice: update cluster info (cluster name  pve-cluster, version = 9)
Dec 18 07:34:01 pve3 pmxcfs[1247]: [dcdb] notice: members: 2/1247
Dec 18 07:34:01 pve3 pmxcfs[1247]: [dcdb] notice: all data is up to date
Dec 18 07:34:01 pve3 pmxcfs[1247]: [status] notice: members: 2/1247
Dec 18 07:34:01 pve3 pmxcfs[1247]: [status] notice: all data is up to date
Dec 18 08:33:55 pve3 pmxcfs[1247]: [dcdb] notice: data verification successful

pvedaemon:
Code:
Dec 18 07:33:57 pve3 systemd[1]: Starting pvedaemon.service - PVE API Daemon...
Dec 18 07:34:01 pve3 pvedaemon[1395]: starting server
Dec 18 07:34:01 pve3 pvedaemon[1395]: starting 3 worker(s)
Dec 18 07:34:01 pve3 pvedaemon[1395]: worker 1396 started
Dec 18 07:34:01 pve3 pvedaemon[1395]: worker 1397 started
Dec 18 07:34:01 pve3 pvedaemon[1395]: worker 1398 started
Dec 18 07:34:01 pve3 systemd[1]: Started pvedaemon.service - PVE API Daemon.

dmesg:
Attached
 

Attachments

Last edited:
Hm, the logs seem to be alright, some acpi warnings, but they are unfortunately quite normal.
Could you check pvecm status on the affected node, to see if the cluster fs is operational?
 
  • Like
Reactions: nix235
Hm, the logs seem to be alright, some acpi warnings, but they are unfortunately quite normal.
Could you check pvecm status on the affected node, to see if the cluster fs is operational?
Updated logs with pvecm status results.

Cluster information
-------------------
Name: pve-cluster
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Tue Dec 19 07:17:40 2023
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 0x00000002
Ring ID: 2.1f5
Quorate: No

Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 1
Quorum: 2 Activity blocked
Flags:

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 192.168.1.240 (local)
 
Ah sorry, you posted that before,
could you test if your nodes can reach each other? Your cluster has no quorum, which means that not enough nodes can talk to each other to get on the same page.
If the nodes can reach each other, the output of journalctl -efu corosync could be helpful.
 
  • Like
Reactions: nix235
corosync: NOTE: journalctl -efu corosync - command did not terminate properly and could only return to CLI via ctrl-z. It also locked up the USB and was unable to umount it from CLI.

I should also note, the only errors I see during boot are JUST before it sends me to the CLI. I get a series of [FAILURE] warnings. I'm not sure what this relates too but its consistant on all the downed nodes. I would have just blown out these nodes and used my proxmox backup server to restore them, but its also one of the downed nodes (really should NOT have upgraded the kernel...).

Kinda stumped, if I attempt to reinstall proxmox from USB does it provide a repair option instead of full install? Thankfully my important data/software is on a dedicated TrueNAS Scale box and not apart of this proxmox cluster... I really like proxmox but this issue has me worried.

Code:
Dec 18 07:33:56 pve3 systemd[1]: Starting corosync.service - Corosync Cluster Engine...
Dec 18 07:33:56 pve3 corosync[1335]:   [MAIN  ] Corosync Cluster Engine  starting up
Dec 18 07:33:56 pve3 corosync[1335]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Dec 18 07:33:56 pve3 corosync[1335]:   [TOTEM ] Initializing transport (Kronosnet).
Dec 18 07:33:57 pve3 corosync[1335]:   [TOTEM ] totemknet initialized
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] pmtud: MTU manually set to: 0
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Dec 18 07:33:57 pve3 corosync[1335]:   [QB    ] server name: cmap
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Dec 18 07:33:57 pve3 corosync[1335]:   [QB    ] server name: cfg
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Dec 18 07:33:57 pve3 corosync[1335]:   [QB    ] server name: cpg
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Dec 18 07:33:57 pve3 corosync[1335]:   [WD    ] Watchdog not enabled by configuration
Dec 18 07:33:57 pve3 corosync[1335]:   [WD    ] resource load_15min missing a recovery key.
Dec 18 07:33:57 pve3 corosync[1335]:   [WD    ] resource memory_used missing a recovery key.
Dec 18 07:33:57 pve3 corosync[1335]:   [WD    ] no resources configured.
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Dec 18 07:33:57 pve3 corosync[1335]:   [QUORUM] Using quorum provider corosync_votequorum
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Dec 18 07:33:57 pve3 corosync[1335]:   [QB    ] server name: votequorum
Dec 18 07:33:57 pve3 corosync[1335]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Dec 18 07:33:57 pve3 corosync[1335]:   [QB    ] server name: quorum
Dec 18 07:33:57 pve3 corosync[1335]:   [TOTEM ] Configuring link 0
Dec 18 07:33:57 pve3 corosync[1335]:   [TOTEM ] Configured link number 0: local addr: 192.168.1.240, port=5405
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 1 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] link: Resetting MTU for link 0 because host 2 joined
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 (passive) best link: 0 (pri: 1)
Dec 18 07:33:57 pve3 corosync[1335]:   [KNET  ] host: host: 4 has no active links
Dec 18 07:33:57 pve3 corosync[1335]:   [QUORUM] Sync members[1]: 2
Dec 18 07:33:57 pve3 corosync[1335]:   [QUORUM] Sync joined[1]: 2
Dec 18 07:33:57 pve3 corosync[1335]:   [TOTEM ] A new membership (2.1f5) was formed. Members joined: 2
Dec 18 07:33:57 pve3 corosync[1335]:   [QUORUM] Members[1]: 2
Dec 18 07:33:57 pve3 corosync[1335]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 18 07:33:57 pve3 systemd[1]: Started corosync.service - Corosync Cluster Engine.
 
Last edited:
Have you tested if the nodes are able to reach each other? Especially try the ip's that are listed in /etc/pve/corosync.conf

corosync: NOTE: journalctl -efu corosync - command did not terminate properly and could only return to CLI via ctrl-z. It also locked up the USB and was unable to umount it from CLI.
ctr-c should have done the job, you can leave out the f in -efu, as it causes the command to wait for new output from the service.

I should also note, the only errors I see during boot are JUST before it sends me to the CLI. I get a series of [FAILURE] warnings. I'm not sure what this relates too but its consistant on all the downed nodes.
Those are probably systemd units then. If they are still in the failed state, you should be able to list them with systemctl --failed

Kinda stumped, if I attempt to reinstall proxmox from USB does it provide a repair option instead of full install?
The repair option causes the system to boot with the kernel from the usb drive, instead of the one on disk. This is more targeted at rescuing a non booting system, than a system that boots but is in a faulty state, so unfortunately this won't help us here.
 
  • Like
Reactions: nix235
ping from proxmox CLI to gateway and other IPs in corosync report the same:
pings the IP (ie 192.168.1.1) then next line reports Destination host unreachable icmp_seq=X
 
Well, without a connection between the nodes, the cluster won't work.
Since the issues correlated with a kernel update, it might be possible that the network device got renamed.
On the affected nodes, check if the correct network device is set in /etc/network/interfaces.
Available devices can be listed with ip a. After correcting the file, restart the networking service.
If that wasn't the problem, I'd continue with basic network debugging
 
  • Like
Reactions: nix235
Thanks for all the great support, it helps me to learn how to use the tools to debug issues. I checked interfaces and its correct that vmbr0 is manually set for the IP assigned to the node. I cannot, however ping www.google.com from the node so something is wrong with my DNS settings and I will have to resolve that at my EdgeRouter or something. I think this will conclude this thread, however if I discover the nature of my issue I will post it here in case its relevant to the original issue (or not!).
 
Code:
lspci | egrep -i --color 'network|ethernet'
produces no results. I think I lost my driver in the kernel upgrade and lead to the server not booting fully. I need to find out how to identify the chipset and find a driver for the Dell PowerEdge C6100 and try to resolve this issue. It was odd that I could ping my gateway but ping would return Host Destination Unreachable. So I guess the box can send a ping but cannot receive given the default driver it is using. I'm going to try to fall back to the oldest kernel on the node but I have tried this before with no results as well.

The last step to attempt is to reinstall proxmox OS and hope that the data volume (VMs etc) remain intact..
 
Okay, so despite all the above, I am unable to install proxmox from the install ISO. This is some kind of motherboard NIC failure on multiple nodes on the Dell PowerEdge C6100... perhaps to over heating the chipset. At this point, since my bootable USB installer wont recognize a NIC I will attempt to add a PCI NIC to the node and see if that resolves my issue...
 
On my Dell PowerEdge C6100, I had to disable IPMI because networking wouldn't start

# grep ipmi /etc/modprobe.d/pve-blacklist.conf
blacklist ipmi_si

# lspci |grep Eth
01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
 
Interesting, I will have to give it a try someday. I have long since ditched the hardware and moved to building my own NAS using standard PC components and everything works much smoother and more efficiently. Though I do miss the multi-node format of the C6100, I do not miss the Dell PowerEdge weight and noise and power usage... but thanks for following up on this, it will help me determine if the machine still is viable to give to someone else or to toss.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!