Connection failed (Error 500: hostname lookup 'mouse' failed ...

jweese74

New Member
Oct 3, 2022
6
0
1
I am running a seven-node cluster, upon a reboot this morning I have been receiving the following error on one of my nodes:

Code:
Connection files (Error 500: hostname look 'mouse' failed - failed to get address info for: mouse: Name or service not known)

I SSH'd into 'mouse' and checked the following:

Code:
root@mouse:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.100.35 mouse.elephanteggs.com mouse

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Code:
root@mouse:~# hostnamectl
   Static hostname: mouse
         Icon name: computer-server
           Chassis: server
        Machine ID: 8fe85cdb52ef4002bf36639f7e4cc59d
           Boot ID: 9d37017b5a5f4035a742703c5cc2cf89
  Operating System: Debian GNU/Linux 11 (bullseye)
            Kernel: Linux 5.15.60-1-pve
      Architecture: x86-64

Code:
root@mouse:~# cat /etc/hostname
mouse

Strange, because I made no changes to anything; so off to Google I went and it led me to try the following things...

Code:
root@mouse:~# pvecm updatecerts
ipcc_send_rec[1] failed: Connection refused
ipcc_send_rec[2] failed: Connection refused
ipcc_send_rec[3] failed: Connection refused
Unable to load access control list: Connection refused

Code:
root@mouse:~# ls -l /etc/pve
total 0

Code:
root@mouse:~# tail -100 /var/log/syslog
Oct  3 10:55:27 mouse pveproxy[3264]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1943.
Oct  3 10:55:27 mouse pveproxy[3265]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1943.
Oct  3 10:55:27 mouse pveproxy[3266]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1943.
Oct  3 10:55:32 mouse pveproxy[3264]: worker exit
Oct  3 10:55:32 mouse pveproxy[3265]: worker exit
Oct  3 10:55:32 mouse pveproxy[3266]: worker exit
Oct  3 10:55:32 mouse pveproxy[1467]: worker 3264 finished
Oct  3 10:55:32 mouse pveproxy[1467]: worker 3265 finished
Oct  3 10:55:32 mouse pveproxy[1467]: starting 2 worker(s)
Oct  3 10:55:32 mouse pveproxy[1467]: worker 3267 started
Oct  3 10:55:32 mouse pveproxy[1467]: worker 3266 finished
Oct  3 10:55:32 mouse pveproxy[1467]: worker 3268 started

Code:
root@mouse:~# systemctl restart pve-cluster.service
Job for pve-cluster.service failed because the control process exited with error code.
See "systemctl status pve-cluster.service" and "journalctl -xe" for details.

Code:
root@mouse:~# journalctl -xe
░░ Support: https://www.debian.org/support
░░
░░ Automatic restarting of the unit pve-cluster.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Oct 03 10:58:25 mouse systemd[1]: Stopped The Proxmox VE cluster filesystem.
░░ Subject: A stop job for unit pve-cluster.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit pve-cluster.service has finished.
░░
░░ The job identifier is 2560 and the job result is done.
Oct 03 10:58:25 mouse systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 03 10:58:25 mouse systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit pve-cluster.service has entered the 'failed' state with result 'exit-code'.
Oct 03 10:58:25 mouse systemd[1]: Failed to start The Proxmox VE cluster filesystem.
░░ Subject: A start job for unit pve-cluster.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit pve-cluster.service has finished with a failure.
░░
░░ The job identifier is 2560 and the job result is failed.
Oct 03 10:58:29 mouse pveproxy[3389]: worker exit
Oct 03 10:58:29 mouse pveproxy[3390]: worker exit
Oct 03 10:58:29 mouse pveproxy[1467]: worker 3389 finished
Oct 03 10:58:29 mouse pveproxy[1467]: starting 1 worker(s)
Oct 03 10:58:29 mouse pveproxy[1467]: worker 3397 started
Oct 03 10:58:29 mouse pveproxy[3391]: worker exit
Oct 03 10:58:29 mouse pveproxy[1467]: worker 3390 finished
Oct 03 10:58:29 mouse pveproxy[1467]: starting 1 worker(s)
Oct 03 10:58:29 mouse pveproxy[1467]: worker 3398 started
Oct 03 10:58:29 mouse pveproxy[3397]: /etc/pve/local/pve-ssl.key: failed to load local private key (key_file or key) at /usr/share/perl5/PVE/APIServer/AnyEvent.pm line 1943.

Code:
root@mouse:~# systemctl status -l pve-cluster
● pve-cluster.service - The Proxmox VE cluster filesystem
     Loaded: loaded (/lib/systemd/system/pve-cluster.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2022-10-03 10:59:24 EDT; 2min 41s ago
    Process: 3444 ExecStart=/usr/bin/pmxcfs (code=exited, status=255/EXCEPTION)
        CPU: 11ms

Oct 03 10:59:24 mouse systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 5.
Oct 03 10:59:24 mouse systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 03 10:59:24 mouse systemd[1]: pve-cluster.service: Start request repeated too quickly.
Oct 03 10:59:24 mouse systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 03 10:59:24 mouse systemd[1]: Failed to start The Proxmox VE cluster filesystem.

Moving on to the cluster manager, rabbit - I checked the following:

Code:
root@rabbit:~# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost
192.168.100.5 rabbit.elephanteggs.com pinnacle

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Code:
root@rabbit:~# pvecm updatecerts
(re)generate node files
merge authorized SSH keys and known hosts

Code:
root@rabbit:~# pvecm nodes

Membership information
----------------------
    Nodeid      Votes Name
         1          1 rabbit (local)
         2          1 hedgehog
         3          1 fennec
         4          1 hamster
         5          1 quokka
         6          1 weasel
         7          1 mouse

And, that's all I've got; hoping someone can provide some help. Thanks terribly in advance.
 
Hi,

can you please post the full output of journalctl -b -u pve-cluster -u corosync -u pveproxy -u pvestatd

Most absolutely - thank you for your quick reply.

Code:
root@mouse:~# journalctl -b -u pve-cluster -u corosync -u pveproxy -u pvestatd
-- Journal begins at Fri 2022-09-16 10:22:53 EDT, ends at Mon 2022-10-03 11:17:37 EDT. --
Oct 03 10:19:00 mouse systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 03 10:19:01 mouse pmxcfs[1333]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1333]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1333]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse pmxcfs[1333]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse pmxcfs[1333]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Oct 03 10:19:01 mouse pmxcfs[1333]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 03 10:19:01 mouse systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Oct 03 10:19:01 mouse systemd[1]: Starting Corosync Cluster Engine...
Oct 03 10:19:01 mouse systemd[1]: Starting PVE Status Daemon...
Oct 03 10:19:01 mouse corosync[1362]:   [MAIN  ] Corosync Cluster Engine 3.1.5 starting up
Oct 03 10:19:01 mouse corosync[1362]:   [MAIN  ] Corosync built-in features: dbus monitoring watchdog systemd xmlconf vqsim nozzle snmp pie relro bindnow
Oct 03 10:19:01 mouse corosync[1362]:   [TOTEM ] Initializing transport (Kronosnet).
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 1.
Oct 03 10:19:01 mouse systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 03 10:19:01 mouse systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 03 10:19:01 mouse pmxcfs[1377]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1377]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse pmxcfs[1377]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1377]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse pmxcfs[1377]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse pmxcfs[1377]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 03 10:19:01 mouse systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Scheduled restart job, restart counter is at 2.
Oct 03 10:19:01 mouse systemd[1]: Stopped The Proxmox VE cluster filesystem.
Oct 03 10:19:01 mouse corosync[1362]:   [TOTEM ] totemknet initialized
Oct 03 10:19:01 mouse corosync[1362]:   [KNET  ] common: crypto_nss.so has been loaded from /usr/lib/x86_64-linux-gnu/kronosnet/crypto_nss.so
Oct 03 10:19:01 mouse systemd[1]: Starting The Proxmox VE cluster filesystem...
Oct 03 10:19:01 mouse pmxcfs[1448]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1448]: [database] crit: unable to set WAL mode: near "PRAGMA": syntax error#010
Oct 03 10:19:01 mouse pmxcfs[1448]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse pmxcfs[1448]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse pmxcfs[1448]: [main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
Oct 03 10:19:01 mouse pmxcfs[1448]: [main] notice: exit proxmox configuration filesystem (-1)
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Control process exited, code=exited, status=255/EXCEPTION
Oct 03 10:19:01 mouse systemd[1]: pve-cluster.service: Failed with result 'exit-code'.
Oct 03 10:19:01 mouse systemd[1]: Failed to start The Proxmox VE cluster filesystem.
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Oct 03 10:19:01 mouse corosync[1362]:   [QB    ] server name: cmap
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Oct 03 10:19:01 mouse corosync[1362]:   [QB    ] server name: cfg
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Oct 03 10:19:01 mouse corosync[1362]:   [QB    ] server name: cpg
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync resource monitoring service [6]
Oct 03 10:19:01 mouse corosync[1362]:   [WD    ] Watchdog not enabled by configuration
Oct 03 10:19:01 mouse corosync[1362]:   [WD    ] resource load_15min missing a recovery key.
Oct 03 10:19:01 mouse corosync[1362]:   [WD    ] resource memory_used missing a recovery key.
Oct 03 10:19:01 mouse corosync[1362]:   [WD    ] no resources configured.
Oct 03 10:19:01 mouse corosync[1362]:   [SERV  ] Service engine loaded: corosync watchdog service [7]
Oct 03 10:19:01 mouse corosync[1362]:   [QUORUM] Using quorum provider corosync_votequorum
 
[main] crit: memdb_open failed - unable to open database '/var/lib/pve-cluster/config.db'
I think that's your actual underlying issue. Got the backing database deleted? Iow., what's ls -la /var/lib/pve-cluster/*

FWIW, you may be able to restore the DB from another node after stopping pve-cluster there to get the DB in a clean state, but that doesn't help with explaining why the DB files got deleted in the first place, that definitively does not happen on its own or just due to a normal apt update...
 
I think that's your actual underlying issue. Got the backing database deleted? Iow., what's ls -la /var/lib/pve-cluster/*

FWIW, you may be able to restore the DB from another node after stopping pve-cluster there to get the DB in a clean state, but that doesn't help with explaining why the DB files got deleted in the first place, that definitively does not happen on its own or just due to a normal apt update...

Very interesting...

Code:
root@mouse:~# ls -la /var/lib/pve-cluster/*
-rw------- 1 root root 106496 Oct  3 08:48 /var/lib/pve-cluster/config.db

/var/lib/pve-cluster/backup:
total 24
drwxr-xr-x 2 root root  4096 Sep 16 10:37 .
drwxr-xr-x 3 root root  4096 Oct  3 08:48 ..
-rw-r--r-- 1 root root 13647 Sep 16 10:37 config-1663339061.sql.gz

How does such a thing happen?
 
Hmm, OK the database file is actually still there, after looking more closely it seems that the write ahead log (WAL) cannot be created, the error # 10 would indicate an IO error: https://www.sqlite.org/rescode.html#ioerr

So, can you create a file there manually, as root: touch /var/lib/pve-cluster/test

One possibility could be either a full file system (df -h) or a read only one, for example due to a HW failure.
 
Hmm, OK the database file is actually still there, after looking more closely it seems that the write ahead log (WAL) cannot be created, the error # 10 would indicate an IO error: https://www.sqlite.org/rescode.html#ioerr

So, can you create a file there manually, as root: touch /var/lib/pve-cluster/test

One possibility could be either a full file system (df -h) or a read only one, for example due to a HW failure.

Doesn't look to be the issue, I thought of that one as well...

Code:
root@mouse:/var/lib/pve-cluster# ls -la
total 120
drwxr-xr-x  3 root root   4096 Oct  3 11:49 .
drwxr-xr-x 38 root root   4096 Sep 16 11:19 ..
drwxr-xr-x  2 root root   4096 Sep 16 10:37 backup
-rw-------  1 root root 106496 Oct  3 08:48 config.db
-rw-------  1 root root      0 Sep 16 10:22 .pmxcfs.lockfile
-rw-r--r--  1 root root      5 Oct  3 11:49 test
 
Hmm, what version do you even use? pveversion -v and apt list --installed | grep sqlite

How got this Proxmox VE cluster installed, with the official ISO? Also, do you compile your own packages?
 
Hmm, what version do you even use? pveversion -v and apt list --installed | grep sqlite

How got this Proxmox VE cluster installed, with the official ISO? Also, do you compile your own packages?

I only ever install from the official ISO - I've been playing with PVE for a few years now and never encountered this issue.

Code:
root@mouse:~# pveversion -v
proxmox-ve: 7.2-1 (running kernel: 5.15.60-1-pve)
pve-manager: 7.2-11 (running version: 7.2-11/b76d3178)
pve-kernel-helper: 7.2-12
pve-kernel-5.15: 7.2-11
pve-kernel-5.13: 7.1-9
pve-kernel-5.15.60-1-pve: 5.15.60-1
pve-kernel-5.15.53-1-pve: 5.15.53-1
pve-kernel-5.13.19-6-pve: 5.13.19-15
pve-kernel-5.13.19-1-pve: 5.13.19-3
ceph: 16.2.9-pve1
ceph-fuse: 16.2.9-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve1
libproxmox-acme-perl: 1.4.2
libproxmox-backup-qemu0: 1.3.1-1
libpve-access-control: 7.2-4
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.2-2
libpve-guest-common-perl: 4.1-2
libpve-http-server-perl: 4.1-3
libpve-storage-perl: 7.2-10
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.0-3
lxcfs: 4.0.12-pve1
novnc-pve: 1.3.0-3
proxmox-backup-client: 2.2.6-1
proxmox-backup-file-restore: 2.2.6-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.5.1
pve-cluster: 7.2-2
pve-container: 4.2-2
pve-docs: 7.2-2
pve-edk2-firmware: 3.20220526-1
pve-firewall: 4.2-6
pve-firmware: 3.5-3
pve-ha-manager: 3.4.0
pve-i18n: 2.7-2
pve-qemu-kvm: 7.0.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.2-4
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.7.1~bpo11+1
vncterm: 1.7-1
zfsutils-linux: 2.1.5-pve1

...and...

Code:
root@mouse:~# apt list --installed | grep sqlite

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libsqlite3-0/stable,now 3.34.1-3 amd64 [installed]
sqlite3/stable,now 3.34.1-3 amd64 [installed]
 
I only ever install from the official ISO - I've been playing with PVE for a few years now and never encountered this issue.
Ok, just wanted to be sure as this is quite a strange error, it normally would indicate a bogus sql query, but the ones here are static and didn't change in well over 5 years...

Currently, I still got two routes that we can check for an issue..

First, the DB itself, maybe it got corrupted and triggers something weird in sqlite (IMO not that realistic, as sqlite is extremly battle tested and produces quite accurate errors, but well):

Code:
sqlite3 /var/lib/pve-cluster/config.db 'PRAGMA integrity_check'
sqlite3 /var/lib/pve-cluster/config.db .schema

Then, a botched update or some other thing that made the pmxcfs binary, sqlite library or another dependency wonky.
Here it would be good to know exactly what happened before the reboot, any upgrade (/var/log/apt/history.log) or the like. Reinstalling pve-cluster and sqlite dependencies via apt install --reinstall pve-cluster libsqlite3-0 could be worth a try. Otherwise a more close look at the system would be required, which we don't do here in the community forum, would be covered by enterprise support (analyzing via SSH could be a fast option, which is included standard and higher) levels though.
 
Last edited:
... via apt install --reinstall pve-cluster libsqlite3-0 could be worth a try...

This seemed to do the trick and everything is back online.

Otherwise a more close look at the system would be required, which we don't do here in the community forum, would be covered by enterprise support (analyzing via SSH could be a fast option, which is included standard and higher) levels though.

The eventual goal with this cluster will be to convert to an enterprise license. While I've been using PVE for my home lab for a number of years now, I need to prove its viability as a better alternative to our current, much more expensive solution first before I can get others in my department on the same page.

It is somewhat of an uphill battle, unfortunately.

Thank you again for your help.
 
This seemed to do the trick and everything is back online.
Ok, glad to hear that it fixed the issue, but IMO this is still strange, for a root cause analysis I'd first check /var/log/apt/history.log and /var/log/apt/term.log for what actually happen in the last updates, maybe there was an unnoticed error or the like.
Then I'd also check the disk HW and file system integrity, as this could possibly indicate some bit rot, i.e., check S.M.A.R.T attributes of the disks (doable via Proxmox VE web interface) and check dmesg output for IO or other disk related errors.