Node getting "unknown" status after upgrade to pve8

Lumber4236

New Member
May 28, 2022
17
1
3
Hi,

I just updated my 2 pve7 nodes to 8, and while everything seems to have be fine at first, one of my 2 nodes now is in "Unknown" state in PVE web UI, and appears with a grey icon. My services are still up, but it seems i can't do anything on that node

Any idea how i can fix this ?

Thanks in advance,

2024-01-19_16-51-40.png
 
Hi,

Did you see anything interesting in the syslog of the `castor` node?
Could you please post the output of `systemctl status pveproxy` and `systemctl status pvedaemon`?
 
hey ! Sorry for not responding quickly. This was just the beginning of a bunch of disagreement that came from a very noisy kern.log on castor, filling the root partition, and that i failed to recognize quickly enough.

I managed to contain it with an aggressive rotate and purge script, but that's not a proper solution.

Attached is a small portion of the kern.log, and here is the output of the 2 commands you provided to me. As far as i can see, there's an error with some PCI device, but the kernel is able to correct it, while spamming my logs. But i could very well miss something important here. Even if i'm not, how should i deal with such a growing log ? is there anyway to tell the kernel to be less verbose ? I don't want to miss other, unrelated issues in the future.

2024-01-23_12-09-23.png

Code:
root@castor:/home/mathieu# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:57 EST; 20h ago
    Process: 41015 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
   Main PID: 2048 (pveproxy)
      Tasks: 4 (limit: 76755)
     Memory: 201.9M
        CPU: 53.079s
     CGroup: /system.slice/pveproxy.service
             ├─  2048 pveproxy
             ├─161400 "pveproxy worker"
             ├─163218 "pveproxy worker"
             └─203650 "pveproxy worker"

Jan 23 12:08:35 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:08:35 castor pveproxy[2048]: worker 161400 started
Jan 23 12:13:28 castor pveproxy[136417]: worker exit
Jan 23 12:13:28 castor pveproxy[2048]: worker 136417 finished
Jan 23 12:13:28 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:13:28 castor pveproxy[2048]: worker 163218 started
Jan 23 16:20:14 castor pveproxy[138179]: worker exit
Jan 23 16:20:14 castor pveproxy[2048]: worker 138179 finished
Jan 23 16:20:14 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 16:20:14 castor pveproxy[2048]: worker 203650 started

root@castor:/home/mathieu# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:52 EST; 20h ago
   Main PID: 2039 (pvedaemon)
      Tasks: 4 (limit: 76755)
     Memory: 283.5M
        CPU: 30.980s
     CGroup: /system.slice/pvedaemon.service
             ├─2039 pvedaemon
             ├─2040 "pvedaemon worker"
             ├─2041 "pvedaemon worker"
             └─2042 "pvedaemon worker"

Jan 23 09:16:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:31:45 castor pvedaemon[2041]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:46:45 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 10:01:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:30 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:34 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:07 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:11 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:21:09 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:36:10 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
 

Attachments

Last edited:
It's always the same PCIe device. What's the device "1b4b:9215" at address "0000:05:00.0"? This would help to identify it: pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
I would first try to fix the problem and maybe replace the faulty PCIe/Device or Board instead of searching for ways to hide those error messages.
 
Last edited:
  • Like
Reactions: Lumber4236
That's a SATA connector (i passtrough it to a TrueNAS scale vm, and experience no issue whatsoever with that vm or the disks)

Code:
root@castor:/home/mathieu# pvesh get /nodes/castor/hardware/pci --pci-class-blacklist "" | grep :05
│ 0x010601 │ 0x9215 │ 0000:05:00.0 │         17 │ 0x1b4b │ 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller                      │      │ 0x9215           │                                                     │ 0x1b4b           │ Marvell Technology Group Ltd. │ Marvell Technology Group Ltd. │
 
Then I would try to replace that SATA controller and see if the errors stops. Or at least unplug, clean the contacts and plug it in again in case its just a bad connection. I personally would trust any data written to those disks with permanent errors...