No ssh, no GUI

martind

Member
Aug 3, 2020
13
0
6
43
Have recently had a problem with one of our PVE nodes in our cluster whereby our monitoring started to flag issues (no access to nrpe, ssh down, etc).

I would log in via the GUI and see everything was fine, all VM's running without issue. Tried restarted nrpe and SSH with no luck. I shelved it for now as had other things to deal with. Now, I have no ssh access and no GUI access ('Connection Refused'). I can get a shell prompt by logging in to the GUI on another node, navigating to this problematic node and click on Shell.

There's no firewall running, nothing like fail2ban or denyhosts causing problems. While using the gui, from another node, it'll occasionally drop out with the node in question having the red x and the shell disconnecting. If I wait 30-60 seconds, it'll come back up.

Physically, there are no issues with the machine; switch is fine, network cables all fine. Server has plenty ram (256gb, over 150 available), plenty disk space (over 1TB free). No CPU throttling or high loads.

Has anyone experienced this? Can you offer any suggestions? At this point I'm at a total loss!
 
Last edited:
I've experienced the same thing on PBS - reboot the server, all running fine for some time, then both SSH and Web UI dies. Logging in to console works, and backups are running fine, but SSH and Web UI are dead. Don't know if I should open a separate ticket or if posting here are OK?

Running PBS as a VM on PVE 7.3-3. PBS is version 2.3

Checked syslog, but couldn't find anything I think is related.
 
Nothing in syslog or journal - I have discovered that I think it's network related. I tried killing off a pile of VM's on the host but it made no difference.

Running a tcpdump on the machine for pings or ssh connections from specific hosts.. I sometimes get results and other times I don't... or they'll come in flooding all of a sudden.
 
Edit: okay, someone had weird aliases in.

I do indeed have issues in the journal.

Bash:
Dec 08 12:26:49 node2 corosync[68770]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 08 12:26:49 node2 corosync[68770]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 08 12:26:49 node2 corosync[68770]:   [QUORUM] Sync members[2]: 1 3
Dec 08 12:26:49 node2 corosync[68770]:   [QUORUM] Sync joined[1]: 1
Dec 08 12:26:49 node2 corosync[68770]:   [TOTEM ] A new membership (1.4c5b) was formed. Members joined: 1
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: members: 1/25475, 3/97003
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: starting data syncronisation
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: members: 1/25475, 3/97003
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: starting data syncronisation
Dec 08 12:26:49 node2 corosync[68770]:   [QUORUM] This node is within the primary component and will provide service.
Dec 08 12:26:49 node2 corosync[68770]:   [QUORUM] Members[2]: 1 3
Dec 08 12:26:49 node2 corosync[68770]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: node has quorum
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: received sync request (epoch 1/25475/00000D49)
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: received sync request (epoch 1/25475/00000719)
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: received all states
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: leader is 1/25475
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: synced members: 1/25475, 3/97003
Dec 08 12:26:49 node2 pmxcfs[97003]: [dcdb] notice: all data is up to date
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: received all states
Dec 08 12:26:49 node2 pmxcfs[97003]: [status] notice: all data is up to date
Dec 08 12:29:03 node2 corosync[68770]:   [KNET  ] link: host: 1 link: 0 is down
Dec 08 12:29:03 node2 corosync[68770]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 08 12:29:03 node2 corosync[68770]:   [KNET  ] host: host: 1 has no active links
Dec 08 12:29:04 node2 corosync[68770]:   [TOTEM ] Token has not been received in 2250 ms
Dec 08 12:29:04 node2 corosync[68770]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Dec 08 12:29:08 node2 corosync[68770]:   [QUORUM] Sync members[1]: 3
Dec 08 12:29:08 node2 corosync[68770]:   [QUORUM] Sync left[1]: 1
Dec 08 12:29:08 node2 corosync[68770]:   [TOTEM ] A new membership (3.4c5f) was formed. Members left: 1
Dec 08 12:29:08 node2 corosync[68770]:   [TOTEM ] Failed to receive the leave message. failed: 1
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] notice: members: 3/97003
Dec 08 12:29:08 node2 pmxcfs[97003]: [status] notice: members: 3/97003
Dec 08 12:29:08 node2 corosync[68770]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Dec 08 12:29:08 node2 corosync[68770]:   [QUORUM] Members[1]: 3
Dec 08 12:29:08 node2 pmxcfs[97003]: [status] notice: node lost quorum
Dec 08 12:29:08 node2 corosync[68770]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] crit: received write while not quorate - trigger resync
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] crit: leaving CPG group
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] notice: start cluster connection
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] crit: cpg_join failed: 14
Dec 08 12:29:08 node2 pmxcfs[97003]: [dcdb] crit: can't initialize service
Dec 08 12:29:14 node2 pmxcfs[97003]: [dcdb] notice: members: 3/97003
Dec 08 12:29:14 node2 pmxcfs[97003]: [dcdb] notice: all data is up to date
Dec 08 12:29:54 node2 corosync[68770]:   [KNET  ] rx: host: 1 link: 0 is up
Dec 08 12:29:54 node2 corosync[68770]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Dec 08 12:29:54 node2 corosync[68770]:   [QUORUM] Sync members[2]: 1 3
Dec 08 12:29:54 node2 corosync[68770]:   [QUORUM] Sync joined[1]: 1
Dec 08 12:29:54 node2 corosync[68770]:   [TOTEM ] A new membership (1.4c63) was formed. Members joined: 1
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: members: 1/25475, 3/97003
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: starting data syncronisation
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: members: 1/25475, 3/97003
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: starting data syncronisation
Dec 08 12:29:54 node2 corosync[68770]:   [QUORUM] This node is within the primary component and will provide service.
Dec 08 12:29:54 node2 corosync[68770]:   [QUORUM] Members[2]: 1 3
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: node has quorum
Dec 08 12:29:54 node2 corosync[68770]:   [MAIN  ] Completed service synchronization, ready to provide service.
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: received sync request (epoch 1/25475/00000D4D)
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: received sync request (epoch 1/25475/0000071B)
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: received all states
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: leader is 1/25475
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: synced members: 1/25475, 3/97003
Dec 08 12:29:54 node2 pmxcfs[97003]: [dcdb] notice: all data is up to date
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: received all states
Dec 08 12:29:54 node2 pmxcfs[97003]: [status] notice: all data is up to date