Node getting "unknown" status after upgrade to pve8

Lumber4236 · Jan 19, 2024

Hi,

I just updated my 2 pve7 nodes to 8, and while everything seems to have be fine at first, one of my 2 nodes now is in "Unknown" state in PVE web UI, and appears with a grey icon. My services are still up, but it seems i can't do anything on that node

Any idea how i can fix this ?

Thanks in advance,

Moayad · Jan 22, 2024

Hi,

Did you see anything interesting in the syslog of the `castor` node?
Could you please post the output of `systemctl status pveproxy` and `systemctl status pvedaemon`?

Lumber4236 · Jan 23, 2024

hey ! Sorry for not responding quickly. This was just the beginning of a bunch of disagreement that came from a very noisy kern.log on castor, filling the root partition, and that i failed to recognize quickly enough.

I managed to contain it with an aggressive rotate and purge script, but that's not a proper solution.

Attached is a small portion of the kern.log, and here is the output of the 2 commands you provided to me. As far as i can see, there's an error with some PCI device, but the kernel is able to correct it, while spamming my logs. But i could very well miss something important here. Even if i'm not, how should i deal with such a growing log ? is there anyway to tell the kernel to be less verbose ? I don't want to miss other, unrelated issues in the future.

Code:

root@castor:/home/mathieu# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:57 EST; 20h ago
    Process: 41015 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
   Main PID: 2048 (pveproxy)
      Tasks: 4 (limit: 76755)
     Memory: 201.9M
        CPU: 53.079s
     CGroup: /system.slice/pveproxy.service
             ├─  2048 pveproxy
             ├─161400 "pveproxy worker"
             ├─163218 "pveproxy worker"
             └─203650 "pveproxy worker"

Jan 23 12:08:35 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:08:35 castor pveproxy[2048]: worker 161400 started
Jan 23 12:13:28 castor pveproxy[136417]: worker exit
Jan 23 12:13:28 castor pveproxy[2048]: worker 136417 finished
Jan 23 12:13:28 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:13:28 castor pveproxy[2048]: worker 163218 started
Jan 23 16:20:14 castor pveproxy[138179]: worker exit
Jan 23 16:20:14 castor pveproxy[2048]: worker 138179 finished
Jan 23 16:20:14 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 16:20:14 castor pveproxy[2048]: worker 203650 started

root@castor:/home/mathieu# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:52 EST; 20h ago
   Main PID: 2039 (pvedaemon)
      Tasks: 4 (limit: 76755)
     Memory: 283.5M
        CPU: 30.980s
     CGroup: /system.slice/pvedaemon.service
             ├─2039 pvedaemon
             ├─2040 "pvedaemon worker"
             ├─2041 "pvedaemon worker"
             └─2042 "pvedaemon worker"

Jan 23 09:16:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:31:45 castor pvedaemon[2041]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:46:45 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 10:01:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:30 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:34 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:07 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:11 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:21:09 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:36:10 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'

Dunuin · Jan 23, 2024

It's always the same PCIe device. What's the device "1b4b:9215" at address "0000:05:00.0"? This would help to identify it: pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
I would first try to fix the problem and maybe replace the faulty PCIe/Device or Board instead of searching for ways to hide those error messages.

Lumber4236 · Jan 23, 2024

That's a SATA connector (i passtrough it to a TrueNAS scale vm, and experience no issue whatsoever with that vm or the disks)

Code:

root@castor:/home/mathieu# pvesh get /nodes/castor/hardware/pci --pci-class-blacklist "" | grep :05
│ 0x010601 │ 0x9215 │ 0000:05:00.0 │         17 │ 0x1b4b │ 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller                      │      │ 0x9215           │                                                     │ 0x1b4b           │ Marvell Technology Group Ltd. │ Marvell Technology Group Ltd. │

Dunuin · Jan 23, 2024

Then I would try to replace that SATA controller and see if the errors stops. Or at least unplug, clean the contacts and plug it in again in case its just a bad connection. I personally would trust any data written to those disks with permanent errors...

Search

Search

Node getting "unknown" status after upgrade to pve8

Lumber4236

New Member

Moayad

Proxmox Staff Member

Lumber4236

New Member

Attachments

Dunuin

Distinguished Member

Lumber4236

New Member

Dunuin

Distinguished Member

We value your privacy