Node getting "unknown" status after upgrade to pve8

Lumber4236

New Member
May 28, 2022
17
1
3
Hi,

I just updated my 2 pve7 nodes to 8, and while everything seems to have be fine at first, one of my 2 nodes now is in "Unknown" state in PVE web UI, and appears with a grey icon. My services are still up, but it seems i can't do anything on that node

Any idea how i can fix this ?

Thanks in advance,

2024-01-19_16-51-40.png
 
Hi,

Did you see anything interesting in the syslog of the `castor` node?
Could you please post the output of `systemctl status pveproxy` and `systemctl status pvedaemon`?
 
hey ! Sorry for not responding quickly. This was just the beginning of a bunch of disagreement that came from a very noisy kern.log on castor, filling the root partition, and that i failed to recognize quickly enough.

I managed to contain it with an aggressive rotate and purge script, but that's not a proper solution.

Attached is a small portion of the kern.log, and here is the output of the 2 commands you provided to me. As far as i can see, there's an error with some PCI device, but the kernel is able to correct it, while spamming my logs. But i could very well miss something important here. Even if i'm not, how should i deal with such a growing log ? is there anyway to tell the kernel to be less verbose ? I don't want to miss other, unrelated issues in the future.

2024-01-23_12-09-23.png

Code:
root@castor:/home/mathieu# systemctl status pveproxy
● pveproxy.service - PVE API Proxy Server
     Loaded: loaded (/lib/systemd/system/pveproxy.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:57 EST; 20h ago
    Process: 41015 ExecReload=/usr/bin/pveproxy restart (code=exited, status=0/SUCCESS)
   Main PID: 2048 (pveproxy)
      Tasks: 4 (limit: 76755)
     Memory: 201.9M
        CPU: 53.079s
     CGroup: /system.slice/pveproxy.service
             ├─  2048 pveproxy
             ├─161400 "pveproxy worker"
             ├─163218 "pveproxy worker"
             └─203650 "pveproxy worker"

Jan 23 12:08:35 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:08:35 castor pveproxy[2048]: worker 161400 started
Jan 23 12:13:28 castor pveproxy[136417]: worker exit
Jan 23 12:13:28 castor pveproxy[2048]: worker 136417 finished
Jan 23 12:13:28 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 12:13:28 castor pveproxy[2048]: worker 163218 started
Jan 23 16:20:14 castor pveproxy[138179]: worker exit
Jan 23 16:20:14 castor pveproxy[2048]: worker 138179 finished
Jan 23 16:20:14 castor pveproxy[2048]: starting 1 worker(s)
Jan 23 16:20:14 castor pveproxy[2048]: worker 203650 started

root@castor:/home/mathieu# systemctl status pvedaemon
● pvedaemon.service - PVE API Daemon
     Loaded: loaded (/lib/systemd/system/pvedaemon.service; enabled; preset: enabled)
     Active: active (running) since Mon 2024-01-22 20:08:52 EST; 20h ago
   Main PID: 2039 (pvedaemon)
      Tasks: 4 (limit: 76755)
     Memory: 283.5M
        CPU: 30.980s
     CGroup: /system.slice/pvedaemon.service
             ├─2039 pvedaemon
             ├─2040 "pvedaemon worker"
             ├─2041 "pvedaemon worker"
             └─2042 "pvedaemon worker"

Jan 23 09:16:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:31:45 castor pvedaemon[2041]: <root@pam> successful auth for user 'root@pam'
Jan 23 09:46:45 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 10:01:45 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:30 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 12:07:34 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:07 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:16:11 castor pvedaemon[2042]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:21:09 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
Jan 23 16:36:10 castor pvedaemon[2040]: <root@pam> successful auth for user 'root@pam'
 

Attachments

Last edited:
It's always the same PCIe device. What's the device "1b4b:9215" at address "0000:05:00.0"? This would help to identify it: pvesh get /nodes/{nodename}/hardware/pci --pci-class-blacklist ""
I would first try to fix the problem and maybe replace the faulty PCIe/Device or Board instead of searching for ways to hide those error messages.
 
Last edited:
  • Like
Reactions: Lumber4236
That's a SATA connector (i passtrough it to a TrueNAS scale vm, and experience no issue whatsoever with that vm or the disks)

Code:
root@castor:/home/mathieu# pvesh get /nodes/castor/hardware/pci --pci-class-blacklist "" | grep :05
│ 0x010601 │ 0x9215 │ 0000:05:00.0 │         17 │ 0x1b4b │ 88SE9215 PCIe 2.0 x1 4-port SATA 6 Gb/s Controller                      │      │ 0x9215           │                                                     │ 0x1b4b           │ Marvell Technology Group Ltd. │ Marvell Technology Group Ltd. │
 
Then I would try to replace that SATA controller and see if the errors stops. Or at least unplug, clean the contacts and plug it in again in case its just a bad connection. I personally would trust any data written to those disks with permanent errors...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!