Proxmox host regularly loses network connection, needs full reboot to restore

fonix232

Member
Jul 20, 2021
5
0
6
31
I'm running the latest 7.4-13 of Proxmox, with kernel 6.1.15-1 at this moment, on my QNAP TS-h973AX NAS. This device runs on an AMD Ryzen V1500B APU, with 2x Intel I225-V 2.5Gb network interfaces and a Marvell Aquantia AQC107 10Gb interface.

I'm only using the 10Gb interface at this moment, although the two other are part of the same default bridge.

A peculiarity that started happening is that every few days, I lose network connection, without any reason - kernel and system logs show absolutely no issue, then suddenly corosync notices that it can't see my other node, and begins complaining about the lack of quorum:

Code:
Jun 13 01:18:58 nas systemd[1]: Starting Daily PVE download activities...
Jun 13 01:19:00 nas pveupdate[680312]: <root@pam> starting task UPID:nas:000A617D:01CA3B6F:6487B5F4:aptupdate::root@pam:
Jun 13 01:19:02 nas pveupdate[680317]: update new package list: /var/lib/pve-manager/pkgupdates
Jun 13 01:19:05 nas pveupdate[680312]: <root@pam> end task UPID:nas:000A617D:01CA3B6F:6487B5F4:aptupdate::root@pam: OK
Jun 13 01:19:05 nas systemd[1]: pve-daily-update.service: Succeeded.
Jun 13 01:19:05 nas systemd[1]: Finished Daily PVE download activities.
Jun 13 01:19:05 nas systemd[1]: pve-daily-update.service: Consumed 6.238s CPU time.
Jun 13 02:09:29 nas pmxcfs[2256]: [dcdb] notice: data verification successful
Jun 13 02:45:13 nas pmxcfs[2256]: [status] notice: received log
Jun 13 02:45:17 nas pmxcfs[2256]: [status] notice: received log
Jun 13 03:09:29 nas pmxcfs[2256]: [dcdb] notice: data verification successful
Jun 13 03:10:01 nas CRON[695793]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 13 03:10:01 nas CRON[695794]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jun 13 03:10:01 nas CRON[695793]: pam_unix(cron:session): session closed for user root
Jun 13 03:33:35 nas corosync[2261]:   [KNET  ] link: host: 1 link: 0 is down
Jun 13 03:33:35 nas corosync[2261]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 13 03:33:35 nas corosync[2261]:   [KNET  ] host: host: 1 has no active links
Jun 13 03:33:37 nas corosync[2261]:   [TOTEM ] Token has not been received in 2250 ms
Jun 13 03:33:37 nas corosync[2261]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 13 03:33:41 nas corosync[2261]:   [QUORUM] Sync members[1]: 2
Jun 13 03:33:41 nas corosync[2261]:   [QUORUM] Sync left[1]: 1
Jun 13 03:33:41 nas corosync[2261]:   [TOTEM ] A new membership (2.3b5) was formed. Members left: 1
Jun 13 03:33:41 nas corosync[2261]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jun 13 03:33:41 nas corosync[2261]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 13 03:33:41 nas corosync[2261]:   [QUORUM] Members[1]: 2
Jun 13 03:33:41 nas corosync[2261]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] notice: members: 2/2256
Jun 13 03:33:41 nas pmxcfs[2256]: [status] notice: node lost quorum
Jun 13 03:33:41 nas pmxcfs[2256]: [status] notice: members: 2/2256
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] crit: received write while not quorate - trigger resync
Jun 13 03:33:41 nas pmxcfs[2256]: [dcdb] crit: leaving CPG group
Jun 13 03:33:41 nas pve-ha-lrm[2325]: unable to write lrm status file - unable to open file '/etc/pve/nodes/nas/lrm_status.tmp.2325' - Permission denied
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] notice: start cluster connection
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] crit: cpg_join failed: 14
Jun 13 03:33:42 nas pmxcfs[2256]: [dcdb] crit: can't initialize service
Jun 13 03:33:48 nas pmxcfs[2256]: [dcdb] notice: members: 2/2256
Jun 13 03:33:48 nas pmxcfs[2256]: [dcdb] notice: all data is up to date
Jun 13 03:34:10 nas pvescheduler[699050]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 13 03:34:10 nas pvescheduler[699049]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 13 03:35:10 nas pvescheduler[699185]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 13 03:35:10 nas pvescheduler[699184]: replication: cfs-lock 'file-replication_cfg' error: no quorum!


The network interface is detected as up, and shows traffic, however the NAS becomes unreachable, nor can it reach the outside world. Bringing `vmbr0` down and then back up, network connection is restored, however VMs will still be unreachable until reboot.

This is incredibly annoying as I can't leave the NAS to run while I'm away, nor can I remotely access it if it goes down.

I've tried excluding the Intel interfaces from the bridge, down- and upgrading the kernel, to no avail.

What exactly could be going wrong here?
 
Well ... remove one interface from the problem bridge, install tmux, install tmux logging, and then capture dmesg -w?


The only arcana there is tmux logging - do this, start tmux, then ctrl-bI(capital eye) to make it active. ctrl-bP will toggle logging.

git clone https://github.com/tmux-plugins/tpm ~/.tmux/plugins/tpm

I guess you could just do this w/o changing any of the network config, since you're going to be hands on until you get it stable for remote access.
 
Well ... remove one interface from the problem bridge, install tmux, install tmux logging, and then capture dmesg -w?


The only arcana there is tmux logging - do this, start tmux, then ctrl-bI(capital eye) to make it active. ctrl-bP will toggle logging.

git clone https://github.com/tmux-plugins/tpm ~/.tmux/plugins/tpm

I guess you could just do this w/o changing any of the network config, since you're going to be hands on until you get it stable for remote access.

Thanks, I'll try - although as I said, removing interfaces from the bridge didn't help much (or at all).

Dmesg so far hasn't shown any hardware related failure, it's like if routing just suddenly... Stopped working. And it's almost precisely on the dot every ~4 days. Last time this happened on 9th June at 3:50am, and I reset it around 11am when I noticed the downtime. Now it happened at 3:33am on 13th, will see how long this reset lasts.
 
Whelp, it happened again. There's nothing in the tmux logs beyond the early boot info (last message is at sys timestamp 103.959676), however I'm starting to have a feeling it might be related to some buffer. The main reason I think so is because I did some heavy file copying (some 300GB) from the NAS (I'm running TrueNAS in a container), then streamed a high bitrate video, during which it died... And on top of that, when I tried to access it through the serial console, that also died after outputting ~2000 characters.

Could this be a memory related issue? I have 2x16GB RAM installed, and 30000MB was assigned to the TrueNAS VM, which it will obviously use due to ZFS. I'll try to decrease RAM assignment to the VM and see if it improves stability.
 
So your PVE host has 32GB memory and you are running a CT with 30GB memory....
That is a tight fit.
As you mentioned yourself, decrease the CT memory.
 
So your PVE host has 32GB memory and you are running a CT with 30GB memory....
That is a tight fit.
As you mentioned yourself, decrease the CT memory.
Well, ZFS is memory-hungry... With the 30TB of disks I have in this bad boy, and deduplication enabled (although I am using an NVMe array for dedup cache purposes), it's a heavy load.
 
Can you try to use your 2.5Gb NIC just to rule out the 10Gb NIC/switch/driver issues?

Also, if you have two nodes in the cluster I recommend that you assign one of the nodes (a 'primary' one) two votes, so that the cluster stays in quorum when the other node is down (it's better than having both nodes going out of quorum).
 
Can you try to use your 2.5Gb NIC just to rule out the 10Gb NIC/switch/driver issues?

Also, if you have two nodes in the cluster I recommend that you assign one of the nodes (a 'primary' one) two votes, so that the cluster stays in quorum when the other node is down (it's better than having both nodes going out of quorum).

The 2.5G interfaces are Intel I225-V, which are by default buggy on hardware level :( Nonetheless I'll try to switch over.

Unfortunately, lowering the RAM usage of the TrueNAS container did not help at all - I got yet another shutdown today a little past 3am.

However I'm now starting to see another pattern - every time this happens, it's precisely around 3am. The number of days between network downage is random (anywhere between 1 and 5), but the network goes down sometime after 3am. No exceptions.

Another interesting issue is that once the network goes down, and I try to access the onboard serial port (ttyS0), after a while (sometimes even during the initial ip a command), the serial interface ALSO dies completely. I've managed to recover the logs though (it's not much):

Code:
Jun 14 04:58:29 nas kernel: perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
Jun 14 05:36:03 nas kernel: perf: interrupt took too long (3147 > 3141), lowering kernel.perf_event_max_sample_rate to 63500
Jun 15 05:25:17 nas kernel: hrtimer: interrupt took 5251 ns
Jun 16 09:29:09 nas kernel: vmbr0: port 2(tap200i0) entered disabled state
Jun 16 09:29:09 nas kernel: vmbr0: port 1(eth2) entered disabled state
Jun 16 09:29:09 nas kernel: vmbr0: port 2(tap200i0) entered disabled state
Jun 16 09:29:09 nas kernel: device eth2 left promiscuous mode
Jun 16 09:29:09 nas kernel: vmbr0: port 1(eth2) entered disabled state
Jun 16 09:29:28 nas kernel: atlantic 0000:0b:00.0 eth2: atlantic: link change old 10000 new 0
Jun 16 09:29:28 nas kernel: atlantic 0000:0b:00.0 eth2: atlantic: link change old 10000 new 0
Jun 16 09:29:28 nas kernel: vmbr0: port 1(eth2) entered blocking state
Jun 16 09:29:28 nas kernel: vmbr0: port 1(eth2) entered disabled state
Jun 16 09:29:28 nas kernel: device eth2 entered promiscuous mode
Jun 16 09:29:33 nas kernel: atlantic 0000:0b:00.0 eth2: atlantic: link change old 0 new 10000
Jun 16 09:29:33 nas kernel: vmbr0: port 1(eth2) entered blocking state
Jun 16 09:29:33 nas kernel: vmbr0: port 1(eth2) entered forwarding state
Jun 16 09:29:33 nas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vmbr0: link becomes ready
Jun 16 09:30:30 nas kernel: ttyS ttyS0: 1 input overrun(s)

As you can see, kernel logs contain nothing out of place (I copy-pasted the last few lines without removing anything), on 14th the kernel set some perf variables, on the 15th it complained about an interrupt taking too long, and on 16th, you can see when I began restarting the network interface, with a straightforward ifdown vmbr0 && ifup vmbr0.

Even in full syslog, I don't see anything particular to happen when the network goes down:

Code:
Jun 16 00:00:44 nas systemd[1]: Starting Rotate log files...
Jun 16 00:00:44 nas systemd[1]: Starting Daily man-db regeneration...
Jun 16 00:00:44 nas systemd[1]: Reloading PVE API Proxy Server.
Jun 16 00:00:44 nas systemd[1]: man-db.service: Succeeded.
Jun 16 00:00:44 nas systemd[1]: Finished Daily man-db regeneration.
Jun 16 00:00:45 nas pveproxy[365440]: send HUP to 2312
Jun 16 00:00:45 nas pveproxy[2312]: received signal HUP
Jun 16 00:00:45 nas pveproxy[2312]: server closing
Jun 16 00:00:45 nas pveproxy[2312]: server shutdown (restart)
Jun 16 00:00:45 nas systemd[1]: Reloaded PVE API Proxy Server.
Jun 16 00:00:46 nas systemd[1]: Reloading PVE SPICE Proxy Server.
Jun 16 00:00:46 nas spiceproxy[365461]: send HUP to 2318
Jun 16 00:00:46 nas spiceproxy[2318]: received signal HUP
Jun 16 00:00:46 nas spiceproxy[2318]: server closing
Jun 16 00:00:46 nas spiceproxy[2318]: server shutdown (restart)
Jun 16 00:00:46 nas systemd[1]: Reloaded PVE SPICE Proxy Server.
Jun 16 00:00:46 nas pvefw-logger[169962]: received terminate request (signal)
Jun 16 00:00:46 nas pvefw-logger[169962]: stopping pvefw logger
Jun 16 00:00:46 nas systemd[1]: Stopping Proxmox VE firewall logger...
Jun 16 00:00:46 nas systemd[1]: pvefw-logger.service: Succeeded.
Jun 16 00:00:46 nas systemd[1]: Stopped Proxmox VE firewall logger.
Jun 16 00:00:46 nas systemd[1]: pvefw-logger.service: Consumed 7.194s CPU time.
Jun 16 00:00:46 nas systemd[1]: Starting Proxmox VE firewall logger...
Jun 16 00:00:46 nas pvefw-logger[365471]: starting pvefw logger
Jun 16 00:00:46 nas systemd[1]: Started Proxmox VE firewall logger.
Jun 16 00:00:46 nas systemd[1]: logrotate.service: Succeeded.
Jun 16 00:00:46 nas systemd[1]: Finished Rotate log files.
Jun 16 00:00:47 nas spiceproxy[2318]: restarting server
Jun 16 00:00:47 nas spiceproxy[2318]: starting 1 worker(s)
Jun 16 00:00:47 nas spiceproxy[2318]: worker 365475 started
Jun 16 00:00:47 nas pveproxy[2312]: restarting server
Jun 16 00:00:47 nas pveproxy[2312]: starting 3 worker(s)
Jun 16 00:00:47 nas pveproxy[2312]: worker 365476 started
Jun 16 00:00:47 nas pveproxy[2312]: worker 365477 started
Jun 16 00:00:47 nas pveproxy[2312]: worker 365478 started
Jun 16 00:00:52 nas spiceproxy[169966]: worker exit
Jun 16 00:00:52 nas spiceproxy[2318]: worker 169966 finished
Jun 16 00:00:52 nas pveproxy[169969]: worker exit
Jun 16 00:00:52 nas pveproxy[169968]: worker exit
Jun 16 00:00:52 nas pveproxy[169967]: worker exit
Jun 16 00:00:52 nas pveproxy[2312]: worker 169969 finished
Jun 16 00:00:52 nas pveproxy[2312]: worker 169967 finished
Jun 16 00:00:52 nas pveproxy[2312]: worker 169968 finished
Jun 16 00:09:29 nas pmxcfs[2252]: [dcdb] notice: data verification successful
Jun 16 00:17:01 nas CRON[367667]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 00:17:01 nas CRON[367668]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 16 00:17:01 nas CRON[367667]: pam_unix(cron:session): session closed for user root
Jun 16 01:09:29 nas pmxcfs[2252]: [dcdb] notice: data verification successful
Jun 16 01:17:01 nas CRON[375766]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 01:17:01 nas CRON[375767]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 16 01:17:01 nas CRON[375766]: pam_unix(cron:session): session closed for user root
Jun 16 02:09:29 nas pmxcfs[2252]: [dcdb] notice: data verification successful
Jun 16 02:17:01 nas CRON[383871]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 02:17:01 nas CRON[383872]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 16 02:17:01 nas CRON[383871]: pam_unix(cron:session): session closed for user root
Jun 16 03:09:29 nas pmxcfs[2252]: [dcdb] notice: data verification successful
Jun 16 03:10:01 nas CRON[391027]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 03:10:01 nas CRON[391028]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jun 16 03:10:01 nas CRON[391027]: pam_unix(cron:session): session closed for user root
Jun 16 03:16:19 nas corosync[2257]:   [KNET  ] link: host: 1 link: 0 is down
Jun 16 03:16:19 nas corosync[2257]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun 16 03:16:19 nas corosync[2257]:   [KNET  ] host: host: 1 has no active links
Jun 16 03:16:20 nas corosync[2257]:   [TOTEM ] Token has not been received in 2250 ms
Jun 16 03:16:20 nas corosync[2257]:   [TOTEM ] A processor failed, forming new configuration: token timed out (3000ms), waiting 3600ms for consensus.
Jun 16 03:16:24 nas corosync[2257]:   [QUORUM] Sync members[1]: 2
Jun 16 03:16:24 nas corosync[2257]:   [QUORUM] Sync left[1]: 1
Jun 16 03:16:24 nas corosync[2257]:   [TOTEM ] A new membership (2.3d3) was formed. Members left: 1
Jun 16 03:16:24 nas corosync[2257]:   [TOTEM ] Failed to receive the leave message. failed: 1
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] notice: members: 2/2252
Jun 16 03:16:24 nas pmxcfs[2252]: [status] notice: members: 2/2252
Jun 16 03:16:24 nas corosync[2257]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Jun 16 03:16:24 nas corosync[2257]:   [QUORUM] Members[1]: 2
Jun 16 03:16:24 nas corosync[2257]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 16 03:16:24 nas pmxcfs[2252]: [status] notice: node lost quorum
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] crit: received write while not quorate - trigger resync
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] crit: leaving CPG group
Jun 16 03:16:24 nas pve-ha-lrm[2320]: unable to write lrm status file - unable to open file '/etc/pve/nodes/nas/lrm_status.tmp.2320' - Permission denied
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] notice: start cluster connection
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] crit: cpg_join failed: 14
Jun 16 03:16:24 nas pmxcfs[2252]: [dcdb] crit: can't initialize service
Jun 16 03:16:30 nas pmxcfs[2252]: [dcdb] notice: members: 2/2252
Jun 16 03:16:30 nas pmxcfs[2252]: [dcdb] notice: all data is up to date
Jun 16 03:17:01 nas CRON[391979]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 03:17:01 nas CRON[391980]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 16 03:17:01 nas CRON[391979]: pam_unix(cron:session): session closed for user root
Jun 16 03:17:15 nas pvescheduler[392000]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 16 03:17:15 nas pvescheduler[392001]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 16 03:18:15 nas pvescheduler[392135]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!
Jun 16 03:18:15 nas pvescheduler[392134]: replication: cfs-lock 'file-replication_cfg' error: no quorum!
Jun 16 03:19:15 nas pvescheduler[392270]: jobs: cfs-lock 'file-jobs_cfg' error: no quorum!

At midnight, the syslog rotation scheduled task runs, alongside other maintenance tasks. Then the PVE SPICE proxy server and firewall logger get restarted. After that, there's the hourly cronjob logs (I do not have any cronjobs at the moment, beyond the default ones installed by Proxmox), then at 03:16:19, corosync realises there's no connection to my other instance.

The only even remotely relevant bit of logging here is the daily EXTfs scrub task, which seems to run successfully and without any issues just minutes before the network goes down:

Code:
Jun 16 03:10:01 nas CRON[391027]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Jun 16 03:10:01 nas CRON[391028]: (root) CMD (test -e /run/systemd/system || SERVICE_MODE=1 /sbin/e2scrub_all -A -r)
Jun 16 03:10:01 nas CRON[391027]: pam_unix(cron:session): session closed for user root

However I doubt this is relevant.

I think I'll resolve this with a simple cron job that runs at, say, 4am, tries to ping my gateway, and if it's unavailable, reboots the device.


As for the quorum question, I really wish there was a way to disable all the "co-working" bits and bobs of clusters - all I need is the ability to manage all my nodes from a single interface. I get it, it's not a feature enterprise users would want when they're running 16-24-32 machine clusters, but for these low node number home-use purposes it would be vastly useful.



One thing that occurred to me - I have a bunch of disks (literally all but the boot disk) that are directly passed through to the TrueNAS VM. Not just as block devices (since that wouldn't provide SMART readouts etc.), but the whole PCI-E controller is attached to the VM:

1686913362675.png

These disks are all ZFS-based, and are detected during boot (however when they get attached to the VM, the pools etc. are all removed). Is it possible that the daily ZFS scrubbing task triggers this issue?

The network interfaces are on completely different PCIe buses:

Code:
0b:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02)
0c:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
0d:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 03)
 
Hello
Sorry if I'm getting a little stuck on the subject, but I have 2 quick questions.
I have this NAS and I would also like to run Proxmox instead of QTS or QuTS, how did you go about installing it?
Have you found a way to connect a screen for bios configuration and proxmox installation?
 
Hi, did you resolve this? I fear I may have a similar issue. My server randomly becomes unreachable until the next reboot..
Sometimes this happens 2min after I start the server, other times it will not happen for days..

When I view logs with journalctl -r I don't see anything that indicates a cause.

For example, my server went offline at 20.55 yesterday, when cloning/destroying a vm. I thought maybe it was disk or heavy load issue. But right after reboot I redid the same task 3 times over to really create load on the disks, no issues.. At 21.06 I decicded to power reset.

1710501999215.png

Any id-+ea what else I can check?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!