[SOLVED] pveproxy only listening on ipv6 address, not on ipv4

loneboat

Well-Known Member
Jan 17, 2019
37
2
48
34
I have a Proxmox 7 install ("xxxx" below), which previously was working fine. However I noticed this morning that I cannot connect to the web GUI as I usually do (pveproxy on port 8006). I haven't made any recent changes that I'm aware of.

I can see from systemctl that it's up and running, and I can see from lsof that it's listening on ipv6 only:


Code:
root@xxxxx:~# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; vendor preset: en>
     Active: active (running) since Tue 2021-11-16 09:08:37 CST; 11min ago
    Process: 9394 ExecStartPre=/usr/bin/pvecm updatecerts --silent (code=exited, stat>
    Process: 9396 ExecStart=/usr/bin/pveproxy start (code=exited, status=0/SUCCESS)
   Main PID: 9399 (pveproxy)
      Tasks: 4 (limit: 231981)
     Memory: 133.3M
        CPU: 2.437s
     CGroup: /system.slice/pveproxy.service
             ├─9399 pveproxy
             ├─9400 pveproxy worker
             ├─9401 pveproxy worker
             └─9402 pveproxy worker
            
root@xxxxx:~# lsof -i:8006
COMMAND    PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
pveproxy  9399 www-data    6u  IPv6  40326      0t0  TCP *:8006 (LISTEN)
pveproxy  9400 www-data    6u  IPv6  40326      0t0  TCP *:8006 (LISTEN)
pveproxy  9401 www-data    6u  IPv6  40326      0t0  TCP *:8006 (LISTEN)
pveproxy  9402 www-data    6u  IPv6  40326      0t0  TCP *:8006 (LISTEN)

I have another node ("yyyy" below) in the same cluster (which is still on Proxmox 6.3-4), which produced different results from lsof:

Code:
root@yyyy:~# lsof -i:8006
COMMAND     PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
pveproxy   8830 www-data    6u  IPv4  43806      0t0  TCP *:8006 (LISTEN)
pveproxy  13639 www-data    6u  IPv4  43806      0t0  TCP *:8006 (LISTEN)
pveproxy  20350 www-data    6u  IPv4  43806      0t0  TCP *:8006 (LISTEN)
pveproxy  31762 www-data    6u  IPv4  43806      0t0  TCP *:8006 (LISTEN)

I should note that previously, node yyyy's web interface showed the whole cluster (including node xxxx) just fine, and I've typically used node yyyy to manage the whole cluster without issue.

Have I somehow disabled IPv4 on the web GUI? IPv4 stack on the node itself is working fine, because I can ssh into xxxx via ipv4 address just fine.

I did find the following thread from 2018 with the OPPOSITE problem (person was asking about enabling ipv6, when only ipv4 was responding): https://forum.proxmox.com/threads/pveproxy-not-listening-on-ipv6-only-ipv4.45417/ . But it was not really helpful to me - they just mentioned adding ipv6 address to hosts file.
 
What's the content of /etc/default/pveproxy and the result of cat /proc/sys/net/ipv6/bindv6only?
 
I'm afraid I don't have a pveproxy file in /etc/default:

Code:
root@xxxx:~# ls -al /etc/default/
total 110
drwxr-xr-x  3 root root   27 Sep  9 12:48 .
drwxr-xr-x 86 root root  174 Nov 16 09:19 ..
-rw-r--r--  1 root root  124 Feb 24  2021 bridge-utils
-rw-r--r--  1 root root  159 Aug  9 07:41 ceph
-rw-r--r--  1 root root  255 May 13  2021 chrony
-rw-r--r--  1 root root  285 Sep  9 12:46 console-setup
-rw-r--r--  1 root root   35 May 16  2021 corosync
-rw-r--r--  1 root root  955 Feb 22  2021 cron
-rw-r--r--  1 root root  297 Feb 21  2021 dbus
-rw-r--r--  1 root root 1176 Sep  2 04:51 grub
drwxr-xr-x  2 root root    5 Sep  9 12:48 grub.d
-rw-r--r--  1 root root   81 Jul 28 14:09 hwclock
-rw-r--r--  1 root root  150 Sep  9 12:46 keyboard
-rw-r--r--  1 root root   52 Sep  9 12:46 locale
-rw-r--r--  1 root root  913 Jul 20 10:28 lxc
-rw-r--r--  1 root root  625 May 24 05:29 networking
-rw-r--r--  1 root root  793 Jun 28 02:15 nfs-common
-rw-r--r--  1 root root 1756 Jul  6 14:16 nss
-rw-r--r--  1 root root 2691 Apr 28  2021 open-iscsi
-rw-r--r--  1 root root   76 May 12  2021 pve-ha-manager
-rw-r--r--  1 root root  356 Jul 12  2020 rpcbind
-rw-r--r--  1 root root 1900 Sep  1  2019 rrdcached
-rw-r--r--  1 root root 2062 Feb  2  2021 rsync
-rw-r--r--  1 root root  363 Oct  9  2019 smartmontools
-rw-r--r--  1 root root  133 Mar 13  2021 ssh
-rw-r--r--  1 root root 1118 Feb  7  2020 useradd
-rw-r--r--  1 root root 4206 Jul  9 11:23 zfs

Here is bindv6only:

Code:
root@xxxx:~# cat /proc/sys/net/ipv6/bindv6only
0
 
pve-manager/7.1-5/6fe2299a0 (running kernel: 5.13.19-1-pve)

After further investigation, my installation is significantly more screwed up than I initially thought. Essential services aren't starting, I can't even SSH into the box, and there's all sorts of weird connectivity issues.
 
I'm afraid I don't have a pveproxy file in /etc/default:

Code:
root@xxxx:~# ls -al /etc/default/
total 110
drwxr-xr-x  3 root root   27 Sep  9 12:48 .
drwxr-xr-x 86 root root  174 Nov 16 09:19 ..
-rw-r--r--  1 root root  124 Feb 24  2021 bridge-utils
-rw-r--r--  1 root root  159 Aug  9 07:41 ceph
-rw-r--r--  1 root root  255 May 13  2021 chrony
-rw-r--r--  1 root root  285 Sep  9 12:46 console-setup
-rw-r--r--  1 root root   35 May 16  2021 corosync
-rw-r--r--  1 root root  955 Feb 22  2021 cron
-rw-r--r--  1 root root  297 Feb 21  2021 dbus
-rw-r--r--  1 root root 1176 Sep  2 04:51 grub
drwxr-xr-x  2 root root    5 Sep  9 12:48 grub.d
-rw-r--r--  1 root root   81 Jul 28 14:09 hwclock
-rw-r--r--  1 root root  150 Sep  9 12:46 keyboard
-rw-r--r--  1 root root   52 Sep  9 12:46 locale
-rw-r--r--  1 root root  913 Jul 20 10:28 lxc
-rw-r--r--  1 root root  625 May 24 05:29 networking
-rw-r--r--  1 root root  793 Jun 28 02:15 nfs-common
-rw-r--r--  1 root root 1756 Jul  6 14:16 nss
-rw-r--r--  1 root root 2691 Apr 28  2021 open-iscsi
-rw-r--r--  1 root root   76 May 12  2021 pve-ha-manager
-rw-r--r--  1 root root  356 Jul 12  2020 rpcbind
-rw-r--r--  1 root root 1900 Sep  1  2019 rrdcached
-rw-r--r--  1 root root 2062 Feb  2  2021 rsync
-rw-r--r--  1 root root  363 Oct  9  2019 smartmontools
-rw-r--r--  1 root root  133 Mar 13  2021 ssh
-rw-r--r--  1 root root 1118 Feb  7  2020 useradd
-rw-r--r--  1 root root 4206 Jul  9 11:23 zfs

Here is bindv6only:

Code:
root@xxxx:~# cat /proc/sys/net/ipv6/bindv6only
0
Thank you for the output of those commands.
Could you also run ss -tlpn and provide the output?
 
Thank you for the output of those commands.
Could you also run ss -tlpn and provide the output?

You bet:

Code:
root@xxxx:~# ss -tlpn
State      Recv-Q     Send-Q       Local Address:Port       Peer Address:Port   Process
LISTEN     0          4096             127.0.0.1:85              0.0.0.0:*       users:(("pvedaemon worke",pid=9392,fd=6),("pvedaemon worke",pid=9391,fd=6),("pvedaemon worke",pid=9390,fd=6),("pvedaemon",pid=9389,fd=6))
LISTEN     0          128                0.0.0.0:22              0.0.0.0:*       users:(("sshd",pid=5073,fd=3))
LISTEN     0          100              127.0.0.1:25              0.0.0.0:*       users:(("master",pid=5243,fd=13))
LISTEN     0          4096               0.0.0.0:111             0.0.0.0:*       users:(("rpcbind",pid=4559,fd=4),("systemd",pid=1,fd=35))
LISTEN     0          128                   [::]:22                 [::]:*       users:(("sshd",pid=5073,fd=4))
LISTEN     0          4096                     *:3128                  *:*       users:(("spiceproxy work",pid=9407,fd=6),("spiceproxy",pid=9406,fd=6))
LISTEN     0          100                  [::1]:25                 [::]:*       users:(("master",pid=5243,fd=14))
LISTEN     0          4096                     *:8006                  *:*       users:(("pveproxy worker",pid=9402,fd=6),("pveproxy worker",pid=9401,fd=6),("pveproxy worker",pid=9400,fd=6),("pveproxy",pid=9399,fd=6))
LISTEN     0          4096                  [::]:111                [::]:*       users:(("rpcbind",pid=4559,fd=6),("systemd",pid=1,fd=37))
 
Thank you for the output.
As you can see here `pveproxy` has a wildcard for the address. This means that it matches both IPv4 and IPv6.
It seems the issue you have is caused by something else.

What is the exact error you get when you try to connect to the Web GUI?
A screenshot would be great.
And if possible the pveproxy log (/var/log/pveproxy/access.log) after trying to connect to the GUI.
 
Thank you for the output.
As you can see here `pveproxy` has a wildcard for the address. This means that it matches both IPv4 and IPv6.
It seems the issue you have is caused by something else.

What is the exact error you get when you try to connect to the Web GUI?
A screenshot would be great.
And if possible the pveproxy log (/var/log/pveproxy/access.log) after trying to connect to the GUI.

I don't get an error when accessing in web gui - just a timeout.

The way I initially noticed something was wrong was that I noticed a red "X" on the node in the web gui of one of my OTHER nodes in the cluster. Note that in this screenshot, I'm accessing node YYYY's web GUI - node YYYY seems perfectly healthy, and I can access it just fine. Node XXXX is the troublemaker, and you can see that it shows up with red indicators in the gui here. Also it shows "Permission denied - invalid PVE ticket (401)" on the status page for XXXX. However I can SSH into node XXXX just fine.

2021-11-18_09-37-26-Window.png


/var/log/pveproxy/access.log shows nothing when I try to access the web gui, however I DO see where node YYYY is being issued a 401 response (probably the one which is producing the "permission denied" message above):

Code:
root@XXXX:~# tail -f /var/log/pveproxy/access.log
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:35:53 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:36:03 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:36:14 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:36:24 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:36:48 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:36:59 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:37:09 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:37:19 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:37:29 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
::ffff:{IP Addr of YYYY} - - [18/11/2021:09:37:39 -0600] "GET /api2/json/nodes/XXXX/status HTTP/1.1" 401 -
 
This could be an issue with either the time, or the certificate.
Check if time is in sync between the nodes. If that is the case, try updating the certificates of the cluster with pvecm updatecerts --force.
 
This could be an issue with either the time, or the certificate.
Check if time is in sync between the nodes. If that is the case, try updating the certificates of the cluster with pvecm updatecerts --force.

Looks like they're out of sync by about 9 seconds - is that a large enough difference to be of concern?
 
Looks like they're out of sync by about 9 seconds - is that a large enough difference to be of concern?

One other interesting behavior I'm noticing (don't know if it's relevant) is that when I SSH into YYYY (the healthy node) from my desktop, the connection is immediate. However when I SSH into XXXX (the unhealthy one), it takes a good solid 30 seconds for it to connect. This is odd, because XXXX is actually a faster server, and both of these hosts are on the same network, running through the same router and switch. Makes me wonder if the network is doing something funky with ipv6 on the newer proxmox, but not on the old.
 
Last edited:
Did you upgrade from PVE 6 to 7 or is it a new install?
If it is an upgrade PVE 6 install, I'd suggest switching to `chrony` from `systemd-timesyncd` as it works a lot better by implementing the NTP spec instead of just SNTP.

Have you rebooted the node since the issues began? Could be a stuck process that slows everything down.
You could check the output of ps aux for any processes in `D` state.
 
Did you upgrade from PVE 6 to 7 or is it a new install?
If it is an upgrade PVE 6 install, I'd suggest switching to `chrony` from `systemd-timesyncd` as it works a lot better by implementing the NTP spec instead of just SNTP.

Have you rebooted the node since the issues began? Could be a stuck process that slows everything down.
You could check the output of ps aux for any processes in `D` state.

It's a new install. I'm actually thinking of just decommissioning it and rebuilding it - it's not currently hosting anything important, and ultimately may be less trouble. But I hate not knowing what's causing it. :-(
 
Well this is very very strange - Poking around the filesystem, I see that write permissions for most of my /etc/pve/ directory are absent on host XXXX:

Code:
root@XXXX:/etc/pve# ll
total 14K
drwxr-xr-x  2 root www-data    0 Dec 31  1969 .
drwxr-xr-x 87 root root      177 Nov 18 10:57 ..
-r--r-----  1 root www-data  451 Nov  9 00:50 authkey.pub
-r--r-----  1 root www-data  451 Nov  9 00:50 authkey.pub.old
-r--r-----  1 root www-data  501 Dec 31  1969 .clusterlog
-r--r-----  1 root www-data  521 Sep  9 13:01 corosync.conf
-r--r-----  1 root www-data   16 Dec 19  2018 datacenter.cfg
-rw-r-----  1 root www-data    2 Dec 31  1969 .debug
dr-xr-xr-x  2 root www-data    0 Mar 28  2019 firewall
dr-xr-xr-x  2 root www-data    0 Jul 15  2020 ha
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 local -> nodes/XXXX
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 lxc -> nodes/XXXX/lxc
-r--r-----  1 root www-data   37 Dec 31  1969 .members
dr-xr-xr-x  2 root www-data    0 Dec 19  2018 nodes
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 openvz -> nodes/XXXX/openvz
dr-x------  2 root www-data    0 Dec 19  2018 priv
-r--r-----  1 root www-data 2.1K Dec 19  2018 pve-root-ca.pem
-r--r-----  1 root www-data 1.7K Dec 19  2018 pve-www.key
lr-xr-xr-x  1 root www-data    0 Dec 31  1969 qemu-server -> nodes/XXXX/qemu-server
-r--r-----  1 root www-data 1.5K Oct 27 09:24 replication.cfg
-r--r-----  1 root www-data  966 Dec 31  1969 .rrd
dr-xr-xr-x  2 root www-data    0 Jul 15  2020 sdn
-r--r-----  1 root www-data  557 Sep 17 16:13 storage.cfg
-r--r-----  1 root www-data  335 Sep 22 08:25 user.cfg
-r--r-----  1 root www-data  734 Dec 31  1969 .version
dr-xr-xr-x  2 root www-data    0 Jul 15  2020 virtual-guest
-r--r-----  1 root www-data 5.4K Dec 31  1969 .vmlist
-r--r-----  1 root www-data  263 Oct 27 09:24 vzdump.cron

A similar listing from node YYYY shows that most of the files/dirs have write permissions, at least for owner:

Code:
root@YYYY:~# ll /etc/pve
total 14K
drwxr-xr-x   2 root www-data    0 Dec 31  1969 .
drwxr-xr-x 107 root root      206 Nov 18 10:57 ..
-rw-r-----   1 root www-data  451 Nov 18 00:51 authkey.pub
-rw-r-----   1 root www-data  451 Nov 18 00:51 authkey.pub.old
-r--r-----   1 root www-data 8.4K Dec 31  1969 .clusterlog
-rw-r-----   1 root www-data  521 Sep  9 13:01 corosync.conf
-rw-r-----   1 root www-data   16 Dec 19  2018 datacenter.cfg
-rw-r-----   1 root www-data    2 Dec 31  1969 .debug
drwxr-xr-x   2 root www-data    0 Mar 28  2019 firewall
drwxr-xr-x   2 root www-data    0 Jul 15  2020 ha
lrwxr-xr-x   1 root www-data    0 Dec 31  1969 local -> nodes/YYYY
lrwxr-xr-x   1 root www-data    0 Dec 31  1969 lxc -> nodes/YYYY/lxc
-r--r-----   1 root www-data  313 Dec 31  1969 .members
drwxr-xr-x   2 root www-data    0 Dec 19  2018 nodes
lrwxr-xr-x   1 root www-data    0 Dec 31  1969 openvz -> nodes/YYYY/openvz
drwx------   2 root www-data    0 Dec 19  2018 priv
-rw-r-----   1 root www-data 2.1K Dec 19  2018 pve-root-ca.pem
-rw-r-----   1 root www-data 1.7K Dec 19  2018 pve-www.key
lrwxr-xr-x   1 root www-data    0 Dec 31  1969 qemu-server -> nodes/YYYY/qemu-server
-rw-r-----   1 root www-data 1.5K Oct 27 09:24 replication.cfg
-r--r-----   1 root www-data 9.1K Dec 31  1969 .rrd
drwxr-xr-x   2 root www-data    0 Jul 15  2020 sdn
-rw-r-----   1 root www-data  557 Sep 17 16:13 storage.cfg
-rw-r-----   1 root www-data  335 Sep 22 08:25 user.cfg
-r--r-----   1 root www-data  813 Dec 31  1969 .version
drwxr-xr-x   2 root www-data    0 Jul 15  2020 virtual-guest
-r--r-----   1 root www-data 5.5K Dec 31  1969 .vmlist
-rw-r-----   1 root www-data  263 Oct 27 09:24 vzdump.cron

It seems like this would have obviously big consequences, possibly including weird behavior I'm seeing. How on earth could these permissions have been removed? I rarely log into this box directly via ssh, and I haven't done any poking around with permissions that I can think of.
 
As Dietmar pointed out in the other thread, this might be because of pmxcfs being mounted read only.

Do you want to continue debugging the cause for all these issues here, or do you plan on reinstalling the complete node?
 
As Dietmar pointed out in the other thread, this might be because of pmxcfs being mounted read only.

Do you want to continue debugging the cause for all these issues here, or do you plan on reinstalling the complete node?

Honestly I'm just gonna nuke the node and reinstall. There's a decent chance that it's not even the node's fault, and I've messed up something upstream in my network. We could spend days on it and still find nothing.

I appreciate your help along the way though - Thank you for your time! I'll leave this closed for now and start a new thread if I see this behavior again in the future.

Thanks!
 
Alright, feel free to ping me if you open a new thread regarding issues with that node.
 
  • Like
Reactions: loneboat

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!