Proxmox 7 - Losing only SSH connection to node

gouthamravee

Active Member
May 16, 2019
31
7
28
SOLVED: Am not a smart person. I gave node 4 the same IP as another much older VM and didn't realize because I use FQDNs within my network.

This is a homelab.

After I upgraded all four servers in my pool to 7 I have had a couple of issues.
one is described here -> https://forum.proxmox.com/threads/kernel-panic-whole-server-crashes-about-every-day.91803/

The other is I will randomly lose SSH connectivity with the node, but all the VMs on the node don't lose network connectivity.
I have replaced the ethernet cable with a known good cable, and tried different ports on the switch.
I just caught this happening and the syslog output is below


Code:
Jul 30 00:32:01 pmx4 systemd[1]: Finished Proxmox VE replication runner.

Jul 30 00:32:13 pmx4 pvestatd[5587]: status update time (10.286 seconds)

Jul 30 00:32:18 pmx4 pvestatd[5587]: status update time (5.282 seconds)

Jul 30 00:32:21 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:23 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:36 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:36 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:37 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:41 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:42 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:50 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:32:51 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:33:00 pmx4 systemd[1]: Starting Proxmox VE replication runner...

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] link: host: 3 link: 0 is down

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] link: host: 2 link: 0 is down

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] host: host: 3 has no active links

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Jul 30 00:33:01 pmx4 corosync[5559]:   [KNET  ] host: host: 2 has no active links

Jul 30 00:33:02 pmx4 corosync[5559]:   [TOTEM ] Token has not been received in 3225 ms

Jul 30 00:33:03 pmx4 corosync[5559]:   [KNET  ] rx: host: 2 link: 0 is up

Jul 30 00:33:03 pmx4 corosync[5559]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Jul 30 00:33:04 pmx4 corosync[5559]:   [TOTEM ] A processor failed, forming new configuration: token timed out (4300ms), waiting 5160ms for consensus.

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] link: host: 2 link: 0 is down

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] link: host: 1 link: 0 is down

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] host: host: 2 has no active links

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Jul 30 00:33:10 pmx4 corosync[5559]:   [KNET  ] host: host: 1 has no active links

Jul 30 00:33:14 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:33:14 pmx4 corosync[5559]:   [QUORUM] Sync left[3]: 1 2 3

Jul 30 00:33:14 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7d92) was formed. Members left: 1 2 3

Jul 30 00:33:14 pmx4 corosync[5559]:   [TOTEM ] Failed to receive the leave message. failed: 1 2 3

Jul 30 00:33:14 pmx4 pmxcfs[3544]: [dcdb] notice: members: 4/3544

Jul 30 00:33:14 pmx4 corosync[5559]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.

Jul 30 00:33:14 pmx4 pmxcfs[3544]: [status] notice: members: 4/3544

Jul 30 00:33:14 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:33:14 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:33:14 pmx4 pmxcfs[3544]: [status] notice: node lost quorum

Jul 30 00:33:14 pmx4 pmxcfs[3544]: [dcdb] crit: received write while not quorate - trigger resync

Jul 30 00:33:14 pmx4 pmxcfs[3544]: [dcdb] crit: leaving CPG group

Jul 30 00:33:14 pmx4 pve-ha-lrm[6663]: unable to write lrm status file - unable to open file '/etc/pve/nodes/pmx4/lrm_status.tmp.6663' - Permission denied

Jul 30 00:33:14 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:15 pmx4 pmxcfs[3544]: [dcdb] notice: start cluster connection

Jul 30 00:33:15 pmx4 pmxcfs[3544]: [dcdb] crit: cpg_join failed: 14

Jul 30 00:33:15 pmx4 pmxcfs[3544]: [dcdb] crit: can't initialize service

Jul 30 00:33:15 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:16 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:17 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:18 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:19 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:20 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:21 pmx4 pmxcfs[3544]: [dcdb] notice: members: 4/3544

Jul 30 00:33:21 pmx4 pmxcfs[3544]: [dcdb] notice: all data is up to date

Jul 30 00:33:21 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:22 pmx4 pvesr[224330]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:33:23 pmx4 pvesr[224330]: cfs-lock 'file-replication_cfg' error: no quorum!

Jul 30 00:33:23 pmx4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

Jul 30 00:33:23 pmx4 systemd[1]: pvesr.service: Failed with result 'exit-code'.

Jul 30 00:33:23 pmx4 systemd[1]: Failed to start Proxmox VE replication runner.

Jul 30 00:33:24 pmx4 pvestatd[5587]: pbs1: error fetching datastores - 500 Can't connect to pbs.thisisafakehostname.com:8007 (Connection timed out)

Jul 30 00:33:25 pmx4 pvestatd[5587]: status update time (21.209 seconds)

Jul 30 00:33:32 pmx4 pvestatd[5587]: pbs1: error fetching datastores - 500 Can't connect to pbs.thisisafakehostname.com:8007 (Connection timed out)

Jul 30 00:33:32 pmx4 pvestatd[5587]: status update time (7.227 seconds)

Jul 30 00:33:42 pmx4 pvestatd[5587]: pbs1: error fetching datastores - 500 Can't connect to pbs.thisisafakehostname.com:8007 (Connection timed out)

Jul 30 00:33:42 pmx4 pvestatd[5587]: status update time (7.216 seconds)

Jul 30 00:33:46 pmx4 corosync[5559]:   [KNET  ] rx: host: 3 link: 0 is up

Jul 30 00:33:46 pmx4 corosync[5559]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)

Jul 30 00:33:52 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:33:52 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7d9a) was formed. Members

Jul 30 00:33:52 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:33:52 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:33:52 pmx4 corosync[5559]:   [KNET  ] rx: host: 2 link: 0 is up

Jul 30 00:33:52 pmx4 corosync[5559]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Jul 30 00:33:57 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:33:57 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7d9e) was formed. Members

Jul 30 00:33:57 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:33:57 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:00 pmx4 systemd[1]: Starting Proxmox VE replication runner...

Jul 30 00:34:01 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:02 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:03 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:03 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7da2) was formed. Members

Jul 30 00:34:03 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:03 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:03 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:04 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:05 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:06 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:07 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:08 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:08 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7da6) was formed. Members

Jul 30 00:34:08 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:08 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:08 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:09 pmx4 pvesr[224528]: trying to acquire cfs lock 'file-replication_cfg' ...

Jul 30 00:34:10 pmx4 pvesr[224528]: cfs-lock 'file-replication_cfg' error: no quorum!

Jul 30 00:34:10 pmx4 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a

Jul 30 00:34:10 pmx4 systemd[1]: pvesr.service: Failed with result 'exit-code'.

Jul 30 00:34:10 pmx4 systemd[1]: Failed to start Proxmox VE replication runner.

Jul 30 00:34:13 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:13 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7daa) was formed. Members

Jul 30 00:34:13 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:13 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:18 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:18 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7dae) was formed. Members

Jul 30 00:34:18 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:18 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:23 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:23 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7db2) was formed. Members

Jul 30 00:34:23 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:23 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:28 pmx4 corosync[5559]:   [QUORUM] Sync members[1]: 4

Jul 30 00:34:28 pmx4 corosync[5559]:   [TOTEM ] A new membership (4.7db6) was formed. Members

Jul 30 00:34:28 pmx4 corosync[5559]:   [QUORUM] Members[1]: 4

Jul 30 00:34:28 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:30 pmx4 corosync[5559]:   [KNET  ] rx: host: 1 link: 0 is up

Jul 30 00:34:30 pmx4 corosync[5559]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)

Jul 30 00:34:30 pmx4 corosync[5559]:   [QUORUM] Sync members[4]: 1 2 3 4

Jul 30 00:34:30 pmx4 corosync[5559]:   [QUORUM] Sync joined[3]: 1 2 3

Jul 30 00:34:30 pmx4 corosync[5559]:   [TOTEM ] A new membership (1.7dba) was formed. Members joined: 1 2 3

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: members: 1/871, 2/3222, 3/980, 4/3544

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: starting data syncronisation

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: members: 1/871, 2/3222, 3/980, 4/3544

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: starting data syncronisation

Jul 30 00:34:30 pmx4 corosync[5559]:   [QUORUM] This node is within the primary component and will provide service.

Jul 30 00:34:30 pmx4 corosync[5559]:   [QUORUM] Members[4]: 1 2 3 4

Jul 30 00:34:30 pmx4 corosync[5559]:   [MAIN  ] Completed service synchronization, ready to provide service.

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: node has quorum

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: received sync request (epoch 1/871/00000004)

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: received sync request (epoch 1/871/00000004)

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: received all states

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: leader is 1/871

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: synced members: 1/871, 2/3222, 3/980

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: waiting for updates from leader

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: dfsm_deliver_queue: queue length 3

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: received all states

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: all data is up to date

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: dfsm_deliver_queue: queue length 32

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [main] notice: ignore duplicate

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: update complete - trying to commit (got 4 inode updates)

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: all data is up to date

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [dcdb] notice: dfsm_deliver_sync_queue: queue length 3

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx3/pbs1: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx/pbs1: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx/local: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx/local-lvm: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx3/local-lvm: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx3/local: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx2/local-lvm: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx2/local: -1

Jul 30 00:34:30 pmx4 pmxcfs[3544]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pmx2/pbs1: -1

Jul 30 00:34:32 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:34:33 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:34:48 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:34:50 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:35:00 pmx4 pmxcfs[3544]: [status] notice: received log

Jul 30 00:35:00 pmx4 systemd[1]: Starting Proxmox VE replication runner...

Jul 30 00:35:01 pmx4 systemd[1]: pvesr.service: Succeeded.
 
Last edited:
System specs are detailed below
Rich (BB code):
PCPartPicker Part List

CPU: Intel Core i5-3570K 3.4 GHz Quad-Core Processor
CPU Cooler: Thermaltake CLP0556 39.7 CFM Sleeve Bearing CPU Cooler
Motherboard: Gigabyte GA-Z77X-UD3H ATX LGA1155 Motherboard
Memory: G.Skill Ripjaws X Series 32 GB (4 x 8 GB) DDR3-1600 CL9 Memory
Storage: Western Digital Blue 3 TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital Blue 3 TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital Red 4 TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital Red 4 TB 3.5" 5400RPM Internal Hard Drive
Storage: Western Digital RE 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Western Digital RE 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Western Digital RE 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Western Digital RE 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Hitachi Ultrastar 7K4000 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Hitachi Ultrastar 7K4000 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Hitachi Ultrastar 7K4000 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Hitachi Ultrastar 7K4000 4 TB 3.5" 7200RPM Internal Hard Drive
Storage: Hitachi Ultrastar 7K4000 4 TB 3.5" 7200RPM Internal Hard Drive
Power Supply: Rosewill Fortress 650 W 80+ Platinum Certified ATX Power Supply
Custom: SAS9211-8I 8PORT Int 6GB Sata+sas Pcie 2.0
Custom: SAS9211-8I 8PORT Int 6GB Sata+sas Pcie 2.0
Custom: 4U (6 x 5.25" + 7 x 3.5" HDD Bay) Rackmount Chassis
Total: $0.00
Prices include shipping, taxes, and discounts when available
Generated by PCPartPicker 2021-07-30 00:40 EDT-0400
 
Last edited:
The corosync logs shows that the network link was down by that time.
How is it possible for the network link to go down for the host but the VMS running on the host are able to connect?
The host and the VM share the same NIC.

I'm not doing anything exotic with my proxmox setup, they are all running the same version of 7 and are part of a simple cluster.
 
Did I get that correctly that you run Corosync over the same interface like the WebGUI and the VMs?
Then you most probably already have found your culprit.
 
  • Like
Reactions: gouthamravee
Did I get that correctly that you run Corosync over the same interface like the WebGUI and the VMs?
Then you most probably already have found your culprit.
Ohh I did not know that was not recommended. I didn't have any issues with 6.x.
Looks like I'm gonna be buying some additional NICs.
 
I am quite sure that is not possible.
I promise you I'm not crazy, I'm no linux grandmaster but I've been a sys admin for 3+ years now. I wouldn't post on the forums unless I was truly stuck.

I'm not finding anything from system logs, network logs, wireshark, and replacing cables.

So far its just two out of four nodes that are giving me grief. Node 4 was the main one that was dropping SSH before, but last night node 2 started doing the same thing. Where I would lose access to proxmox for a few minutes but all the VMs would be accessible. Node 2 is extra special cause it's hosting my NVR so there's hundreds of megabytes of data going into it per second. Node 2 has 2 NICs one dedicated to the NVR VM, the other is shared by everything else on the node.

Both nodes are also setup to do a quick memtest at boot, I initially thought it was bad ram. I did a full memtest on all the sticks and didn't find bad addresses, I swapped ram around between all 4 nodes so both Node 2 and Node 4 now have 32GB of ram to use, and the boot mem test isn't throwing errors.

I am not opposed to reinstalling proxmox from scratch since I did a 6 -> 7 upgrade. I'll have time to do that over the weekend, but I want to make sure there isn't anything wrong with my configuration that could show up again later.

I am planning on adding a dedicated NIC for the host soon.

I posted the specs for Node 4 above, Node 2 is

 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!