Cluster node failed for now reason, VMs got moved, but got no IPv4. How to debug?

maxim.webster · Jan 2, 2025

Hi there,

I have a 2-node-Proxmox-Cluster using a dedicated QDevice. The nodes are "ernie" and "bert". This morning, for now reason "bert" went offline. As a result, VMs with HA configuration got moved to node "ernie" and started there. However, none of the moved VMs received a IPv4 via DHCP, they were not accessible from my LAN. I had to go to the Console and restart networking.

How can I debug this situation, starting with why bert went down? bert's syslog shows nothing special:

Code:

Jan 02 08:00:14 bert sshd[2527962]: Accepted publickey for root from 192.168.30.8 port 37234 ssh2: RSA SHA256:IsUxs2JXFlny/OXUxigsCBx69WQeZb8xX1fdoGjyEQU
Jan 02 08:00:14 bert sshd[2527962]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 02 08:00:14 bert systemd-logind[797]: New session 6526 of user root.
Jan 02 08:00:14 bert systemd[1]: Started session-6526.scope - Session 6526 of User root.
Jan 02 08:00:14 bert sshd[2527962]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 02 08:00:15 bert sshd[2527962]: Received disconnect from 192.168.30.8 port 37234:11: disconnected by user
Jan 02 08:00:15 bert sshd[2527962]: Disconnected from user root 192.168.30.8 port 37234
Jan 02 08:00:15 bert sshd[2527962]: pam_unix(sshd:session): session closed for user root
Jan 02 08:00:15 bert systemd-logind[797]: Session 6526 logged out. Waiting for processes to exit.
Jan 02 08:00:15 bert systemd[1]: session-6526.scope: Deactivated successfully.
Jan 02 08:00:15 bert systemd-logind[797]: Removed session 6526.
Jan 02 08:00:25 bert systemd[1]: Stopping user@0.service - User Manager for UID 0...
Jan 02 08:00:25 bert systemd[2527791]: Activating special unit exit.target...
Jan 02 08:00:25 bert systemd[2527791]: Stopped target default.target - Main User Target.
Jan 02 08:00:25 bert systemd[2527791]: Stopped target basic.target - Basic System.
Jan 02 08:00:25 bert systemd[2527791]: Stopped target paths.target - Paths.
Jan 02 08:00:25 bert systemd[2527791]: Stopped target sockets.target - Sockets.
Jan 02 08:00:25 bert systemd[2527791]: Stopped target timers.target - Timers.
Jan 02 08:00:25 bert systemd[2527791]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Jan 02 08:00:25 bert systemd[2527791]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 02 08:00:25 bert systemd[2527791]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jan 02 08:00:25 bert systemd[2527791]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jan 02 08:00:25 bert systemd[2527791]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jan 02 08:00:25 bert systemd[2527791]: Removed slice app.slice - User Application Slice.
Jan 02 08:00:25 bert systemd[2527791]: Reached target shutdown.target - Shutdown.
Jan 02 08:00:25 bert systemd[2527791]: Finished systemd-exit.service - Exit the Session.
Jan 02 08:00:25 bert systemd[2527791]: Reached target exit.target - Exit the Session.
Jan 02 08:00:25 bert systemd[1]: user@0.service: Deactivated successfully.
Jan 02 08:00:25 bert systemd[1]: Stopped user@0.service - User Manager for UID 0.
Jan 02 08:00:25 bert systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Jan 02 08:00:25 bert systemd[1]: run-user-0.mount: Deactivated successfully.
Jan 02 08:00:25 bert systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Jan 02 08:00:25 bert systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Jan 02 08:00:25 bert systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Jan 02 08:00:25 bert systemd[1]: user-0.slice: Consumed 4.245s CPU time.
Jan 02 08:17:01 bert CRON[2534266]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jan 02 08:17:01 bert CRON[2534267]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Jan 02 08:17:01 bert CRON[2534266]: pam_unix(cron:session): session closed for user root
Jan 02 08:20:50 bert pmxcfs[1072]: [dcdb] notice: data verification successful
-- Reboot --

What additional info can I provide?

mariol · Jan 2, 2025

However, none of the moved VMs received a IPv4 via DHCP, they were not accessible from my LAN. I had to go to the Console and restart networking.

You mean your VM's, right?

maxim.webster · Jan 2, 2025

mariol said:
You mean your VM's, right?

I'd also like to know why the Proxmox node itself went down for no reason. I became aware of this, due to an e-mail notification i received:

The node 'bert' failed and needs manual intervention.

The PVE HA manager tries to fence it and recover the configured HA resources to a healthy node if possible.

Current fence status: FENCE
Try to fence node 'bert'

Overall Cluster status:
{
"manager_status": {
"master_node": "ernie",
"node_status": {
"bert": "unknown",
"ernie": "online"
},
"service_status": {
"ct:107": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "YJe5IQ9eaVDV/8WVpP4xvQ"
},
"ct:202": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "IBdQp+rCFuCyHhAapFof6g"
},
"vm:100": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "yMTznhnLgN5I7Fma6WXSIQ"
},
"vm:102": {
"node": "ernie",
"running": 1,
"state": "started",
"uid": "z2Y21aY+W1tLEN7B55WH8A"
},
"vm:103": {
"node": "ernie",
"running": 1,
"state": "started",
"uid": "R99J5nNSWnK5L9/1zS//cg"
}
},
"timestamp": 1735802640
},
"node_status": {
"bert": "fence",
"ernie": "online"
}
}

and after some time

The node 'bert' failed and needs manual intervention.

The PVE HA manager tries to fence it and recover the configured HA resources to a healthy node if possible.

Current fence status: SUCCEED
fencing: acknowledged - got agent lock for node 'bert'

Overall Cluster status:
{
"manager_status": {
"master_node": "ernie",
"node_status": {
"bert": "unknown",
"ernie": "online"
},
"service_status": {
"ct:107": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "YJe5IQ9eaVDV/8WVpP4xvQ"
},
"ct:202": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "IBdQp+rCFuCyHhAapFof6g"
},
"vm:100": {
"node": "bert",
"running": 1,
"state": "started",
"uid": "yMTznhnLgN5I7Fma6WXSIQ"
},
"vm:102": {
"node": "ernie",
"running": 1,
"state": "started",
"uid": "z2Y21aY+W1tLEN7B55WH8A"
},
"vm:103": {
"node": "ernie",
"running": 1,
"state": "started",
"uid": "R99J5nNSWnK5L9/1zS//cg"
}
},
"timestamp": 1735802640
},
"node_status": {
"bert": "unknown",
"ernie": "online"
}
}

mariol · Jan 2, 2025

Yes, “bert” has fenced. In other words, “bert” has lost the connection to the other nodes in the cluster and has done exactly what it is supposed to do.
The question you are asking yourself now is, of course, why this happened.

Take a look at the logs going back a little further, you should find some clues. At least something like “Ring0 has lost the connection...”

Is your Qdevice running as a VM or as a separate small PHY device?

And your DHCP server is running where? Maybe on “bert” too?

If the network configuration of the two nodes is the same, then please post the config of one node.

Code:

cat /etc/network/interfaces

maxim.webster · Jan 2, 2025

mariol said:
Yes, “bert” has fenced. In other words, “bert” has lost the connection to the other nodes in the cluster and has done exactly what it is supposed to do.
The question you are asking yourself now is, of course, why this happened.

Take a look at the logs going back a little further, you should find some clues. At least something like “Ring0 has lost the connection...”

Which log?

mariol said:
Is your Qdevice running as a VM or as a separate small PHY device?

Dedicated device (Raspberry Pi).

mariol said:
And your DHCP server is running where? Maybe on “bert” too?

External (Unifi Network Controller on Unifi Dream Machine Pro Appliance).

mariol said:
If the network configuration of the two nodes is the same, then please post the config of one node.

Code:

cat /etc/network/interfaces

Code:

auto lo
iface lo inet loopback

iface enp5s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.30.7/24
        gateway 192.168.30.1
        bridge-ports enp5s0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094

source /etc/network/interfaces.d/*

At the time being, the node "bert" was shown offline on the Unifi Network Controller. As if the NIC went down or cable was removed.

mariol · Jan 2, 2025

Which log?

You can search in the journal for errors like this:

Code:

journalctl -r -p5

journalctl -r -p4

-p, --priority=
Filter output by message priorities or priority ranges. Takes either a single numeric or textual log level (i.e. between 0/"emerg" and 7/"debug"), or a range of numeric/text log levels in the form FROM..TO. The log levels are the usual syslog log levels as documented in syslog(3), i.e. "emerg" (0), "alert" (1), "crit" (2), "err" (3), "warning" (4), "notice" (5), "info" (6), "debug" (7). If a single log level is specified, all messages with this log level or a lower (hence more important) log level are shown. If a range is specified, all messages within the range are shown, including both the start and the end value of the range. This will add "PRIORITY=" matches for the specified priorities.

Dedicated device (Raspberry Pi).

Very good

External (Unifi Network Controller on Unifi Dream Machine Pro Appliance).

Also good.

Your interfaces looks also fine. What I noticed here is that the cluster network also runs on the bridge. You can do that, yes. But basically I recommend keeping the networks separate [1], Just when problems occur.

A fence can also occur if, for example, the bridge is full utilized and the latency is increasing [2].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_separate_cluster_network
[2] https://pve.proxmox.com/wiki/Cluster_Manager#_cluster_network

maxim.webster · Jan 2, 2025

mariol said:
Your interfaces looks also fine. What I noticed here is that the cluster network also runs on the bridge. You can do that, yes. But basically I recommend keeping the networks separate [1], Just when problems occur.

Would love to, but there's only one NIC on the mainboard. It's a homelab setup using consumer hardware.

There is also nothing indicating the cause of the outage, journalctl on both nodes, ernie & bert look .. okay:

root@bert:~# journalctl -r -p4
Jan 02 10:48:26 bert pveproxy[1352]: proxy detected vanished client connection
Jan 02 10:48:25 bert pveproxy[1354]: proxy detected vanished client connection
Jan 02 10:43:06 bert pveproxy[1354]: proxy detected vanished client connection
Jan 02 10:42:46 bert pveproxy[1353]: proxy detected vanished client connection
Jan 02 10:42:46 bert pveproxy[1352]: proxy detected vanished client connection
Jan 02 10:42:41 bert pveproxy[1352]: proxy detected vanished client connection
Jan 02 10:42:40 bert pveproxy[1354]: proxy detected vanished client connection
Jan 02 10:42:39 bert pveproxy[1354]: proxy detected vanished client connection
Jan 02 10:42:36 bert pveproxy[1354]: proxy detected vanished client connection
Jan 02 10:37:09 bert pveproxy[1352]: proxy detected vanished client connection
Jan 02 10:25:07 bert kernel: hrtimer: interrupt took 3980 ns
Jan 02 08:50:34 bert QEMU[1374]: gl_version 46 - core profile enabled
Jan 02 08:50:13 bert corosync-qdevice[1310]: Connect timeout
Jan 02 08:50:08 bert corosync-qdevice[1310]: Can't connect to qnetd host. (-5986): Network address not available (in use?)
Jan 02 08:50:05 bert corosync[1286]: [KNET ] host: host: 1 has no active links
Jan 02 08:50:05 bert corosync[1286]: [KNET ] host: host: 1 has no active links
Jan 02 08:50:05 bert corosync[1286]: [KNET ] host: host: 1 has no active links
Jan 02 08:50:05 bert corosync[1286]: [WD ] resource memory_used missing a recovery key.
Jan 02 08:50:05 bert corosync[1286]: [WD ] resource load_15min missing a recovery key.
Jan 02 08:50:05 bert corosync[1286]: [WD ] Watchdog not enabled by configuration
Jan 02 08:50:04 bert pmxcfs[1195]: [status] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [status] crit: cpg_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [dcdb] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [dcdb] crit: cpg_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [confdb] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [confdb] crit: cmap_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [quorum] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [quorum] crit: quorum_initialize failed: 2
Jan 02 08:50:03 bert smartd[920]: Device: /dev/nvme0, number of Error Log entries increased from 40 to 41
Jan 02 08:50:02 bert lvm[836]: /dev/zd160p5 excluded: device is rejected by filter config.
Jan 02 08:50:01 bert lvm[806]: /dev/zd128p5 excluded: device is rejected by filter config.
Jan 02 08:49:59 bert lvm[754]: /dev/zd32p3 excluded: device is rejected by filter config.
Jan 02 08:49:56 bert kernel: amdgpu 0000:08:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0
Jan 02 08:49:56 bert kernel: amdgpu 0000:08:00.0: amdgpu: Secure display: Generic Failure.
Jan 02 08:49:56 bert kernel: [drm] psp gfx command INVOKE_CMD(0x3) failed and response status is (0x4)
Jan 02 08:49:56 bert kernel: [drm] psp gfx command LOAD_TA(0x1) failed and response status is (0x7)
Jan 02 08:49:53 bert lvm[467]: VG local-thin finished
Jan 02 08:49:53 bert lvm[467]: PV /dev/sdc online, VG local-thin is complete.
Jan 02 08:49:53 bert lvm[458]: VG pve finished
Jan 02 08:49:53 bert lvm[458]: PV /dev/nvme0n1p3 online, VG pve is complete.
Jan 02 08:49:52 bert kernel: zfs: module license taints kernel.
Jan 02 08:49:52 bert kernel: Disabling lock debugging due to kernel taint
Jan 02 08:49:52 bert kernel: zfs: module license 'CDDL' taints kernel.
Jan 02 08:49:52 bert systemd-journald[378]: File /var/log/journal/436d56155efd4466996927d89c0fd81c/system.journal corrupted or uncleanly shut down, renaming and replacing.
Jan 02 08:49:52 bert kernel: spl: loading out-of-tree module taints kernel.
Jan 02 08:49:52 bert kernel: device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
Jan 02 08:49:52 bert kernel: sd 8:0:0:0: [sdc] Optimal transfer size 33553920 bytes not a multiple of preferred minimum block size (4096 bytes)
Jan 02 08:49:52 bert kernel: nvme nvme0: missing or invalid SUBNQN field.
Jan 02 08:49:52 bert kernel: amd_pstate: the _CPC object is not present in SBIOS or ACPI disabled
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 8
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 7
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 6
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 5
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 4
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 3
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 2
Jan 02 08:49:52 bert kernel: platform eisa.0: Cannot allocate resource for EISA slot 1
Jan 02 08:49:52 bert kernel: platform eisa.0: EISA: Cannot allocate resource for mainboard
Jan 02 08:49:52 bert kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
-- Boot 69ab41b7fa5a45da913b7847227576f9 --
Jan 01 10:08:34 bert pveproxy[1245721]: proxy detected vanished client connection
Jan 01 10:04:40 bert pveproxy[1245721]: proxy detected vanished client connection

ernie (the remaining one):

root@ernie:~# journalctl -r -p4
Jan 02 10:53:34 ernie kernel: device-mapper: thin: Data device (dm-3) discard unsupported: Disabling discard passdown.
Jan 02 10:48:26 ernie pveproxy[1753378]: proxy detected vanished client connection
Jan 02 10:48:25 ernie pveproxy[1753379]: proxy detected vanished client connection
Jan 02 10:43:09 ernie pveproxy[1753379]: proxy detected vanished client connection
Jan 02 10:43:06 ernie pveproxy[1753377]: proxy detected vanished client connection
Jan 02 10:43:06 ernie pveproxy[1753379]: proxy detected vanished client connection
Jan 02 10:42:46 ernie pveproxy[1753377]: proxy detected vanished client connection
Jan 02 10:42:46 ernie pveproxy[1753378]: proxy detected vanished client connection
Jan 02 10:42:41 ernie pveproxy[1753377]: proxy detected vanished client connection
Jan 02 10:42:39 ernie pveproxy[1753379]: proxy detected vanished client connection
Jan 02 10:42:36 ernie pveproxy[1753379]: proxy detected vanished client connection
Jan 02 10:37:13 ernie pveproxy[1753378]: proxy detected vanished client connection
Jan 02 10:37:13 ernie pveproxy[1753377]: proxy detected vanished client connection
Jan 02 09:15:15 ernie lvm[1943871]: /dev/zd0p3 excluded: device is rejected by filter config.
Jan 02 09:14:56 ernie QEMU[1530]: kvm: Bitmap 'repl_scsi0' is currently in use by another operation and cannot be used
Jan 02 09:07:52 ernie lvm[1940868]: /dev/zd16p5 excluded: device is rejected by filter config.
Jan 02 09:07:26 ernie QEMU[1922000]: kvm: Bitmap 'repl_scsi0' is currently in use by another operation and cannot be used
Jan 02 08:56:51 ernie pvedaemon[97973]: authentication failure; rhost=::ffff:192.168.30.172 user=root@pve msg=no such user ('root@pve')
Jan 02 08:30:07 ernie pvescheduler[1925448]: 102-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=bert' -o 'UserKnownHostsFile=/etc/pve/nodes/bert/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.7 -- pves>
Jan 02 08:25:13 ernie pvescheduler[1923381]: 107-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=bert' -o 'UserKnownHostsFile=/etc/pve/nodes/bert/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.7 -- pves>
Jan 02 08:25:10 ernie pvescheduler[1923381]: 202-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=bert' -o 'UserKnownHostsFile=/etc/pve/nodes/bert/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.7 -- pves>
Jan 02 08:25:06 ernie pvescheduler[1923381]: 100-0: got unexpected replication job error - command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=bert' -o 'UserKnownHostsFile=/etc/pve/nodes/bert/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@192.168.30.7 -- pves>
Jan 02 08:21:17 ernie corosync[1143]: [KNET ] host: host: 2 has no active links

The failed connect on bert after reboot was caused by me (used wrong realm).

Also checked the boot disk on both nodes. Both systems are identical from a hardware perspective, but bert is older. The wearout on the boot ssd is 2%, on ernie it's 0%.

root@bert:~# smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: CT500P3SSD8
Serial Number: 2242E67A3C8F
Firmware Version: P9CR30A
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x00a075
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 6cf0000034
Local Time is: Thu Jan 2 16:36:34 2025 CET
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x06): Cmd_Eff_Lg Ext_Get_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 95 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.00W 0.0000W - 0 0 0 0 0 0
1 + 3.00W 0.0000W - 0 0 0 0 0 0
2 + 1.50W 0.0000W - 0 0 0 0 0 0
3 - 0.0250W 0.0000W - 3 3 3 3 5000 1900
4 - 0.0030W - - 4 4 4 4 13000 100000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 52 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 2%
Data Units Read: 3,946,736 [2.02 TB]
Data Units Written: 6,533,326 [3.34 TB]
Host Read Commands: 25,575,541
Host Write Commands: 348,716,155
Controller Busy Time: 163
Power Cycles: 30
Power On Hours: 15,883
Unsafe Shutdowns: 13
Media and Data Integrity Errors: 0
Error Information Log Entries: 41
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 52 Celsius
Temperature Sensor 2: 63 Celsius
Temperature Sensor 8: 52 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 41 0 0x0008 0x4005 0x028 0 0 -

root@ernie:~# smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: CT500P3PSSD8
Serial Number: 2423493522E3
Firmware Version: P9CR413
PCI Vendor/Subsystem ID: 0x1344
IEEE OUI Identifier: 0x00a075
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 00a075 01493522e3
Local Time is: Thu Jan 2 16:37:23 2025 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00d7): Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 83 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x08): No_ID_Reuse

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 5.50W - - 0 0 0 0 0 0
1 + 3.00W - - 1 1 1 1 0 0
2 + 1.50W - - 2 2 2 2 0 0
3 - 0.0300W - - 3 3 3 3 5000 2500
4 - 0.0025W - - 4 4 4 4 8000 40000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 43 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 308,086 [157 GB]
Data Units Written: 1,805,960 [924 GB]
Host Read Commands: 3,135,138
Host Write Commands: 22,514,926
Controller Busy Time: 44
Power Cycles: 22
Power On Hours: 1,242
Unsafe Shutdowns: 6
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 43 Celsius

Error Information (NVMe Log 0x01, 16 of 255 entries)
No Errors Logged

Note: Since I don't have KVM attached to the nodes, bert was restarted using the reset switch.

mariol · Jan 3, 2025

Thanks for the infos.

Would love to, but there's only one NIC on the mainboard. It's a homelab setup using consumer hardware.

Nevertheless, I would recommend that you install another network card per node if you have a PCIe slot available.

The smart values look good.

Jan 02 10:43:06 bert pveproxy[1354]: proxy detected vanished client connection

To be on the safe side, also check the time on the nodes and your client from where you are accessing.

Jan 02 08:50:08 bert corosync-qdevice[1310]: Can't connect to qnetd host. (-5986): Network address not available (in use?)

The Qdev was unavailable.

Jan 02 08:50:04 bert pmxcfs[1195]: [status] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [status] crit: cpg_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [dcdb] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [dcdb] crit: cpg_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [confdb] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [confdb] crit: cmap_initialize failed: 2
Jan 02 08:50:04 bert pmxcfs[1195]: [quorum] crit: can't initialize service
Jan 02 08:50:04 bert pmxcfs[1195]: [quorum] crit: quorum_initialize failed: 2

There was no quorum.

To be on the safe side, I would also check the network cable and switch.

Search

Search

Cluster node failed for now reason, VMs got moved, but got no IPv4. How to debug?

maxim.webster

Active Member

mariol

Proxmox Staff Member

maxim.webster

Active Member

Overall Cluster status:

Overall Cluster status:

mariol

Proxmox Staff Member

maxim.webster

Active Member

mariol

Proxmox Staff Member

maxim.webster

Active Member

mariol

Proxmox Staff Member

We value your privacy

Cluster node failed for now reason, VMs got moved, but got no IPv4. How to debug?

Active Member

Proxmox Staff Member

Active Member

Overall Cluster status:​

Overall Cluster status:​

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

We value your privacy

Overall Cluster status:

Overall Cluster status: