astramateria

New Member
Nov 11, 2024
2
0
1
Setup:

Hardware:
Motherboard: MSI B350M Gaming Pro (MS-7A39)
CPU: AMD Ryzen7 1700 8 core
GPU: AMD Radeon RX 470
RAM: 16GB DDR4 @ 1300MHz
Storage: 2 2TB WD HDD, 1 1TB WD SSD. Both SATA
Network: ASM1182e 2-Port PCIe x1 Gen2 Packet Switch; static IP
Boot drive: Crucial BX500 240GB 3D NAND SATA 2.5-inch SSD
Proxmox version: 8.2.7

VMs:

"Hardware"
RAM: 8GB
CPU: 4 (1 sockets, 4 cores)[x86-64-v2-AES]
BIOS: SealBIOS
Display: Default
Machine: Default (i440fx)
SCSI Controller: VirtIO SCSI single
Hard Disk: local-lvm
Network Device (net0): virtio=::::,bridge=vmbr0,firewall=1
OS: TrueNAS-SCALE-24.10.0.2

Problem:

1. Turn on Proxmox host. Wait for bootup.
2. Go to Proxmox web GUI and start up one VM (TrueNAS).
3. Wait a few hours or overnight. System uptime ranges from 2 hours to 12 hours.

PROBLEMS:

This is the most common problem:

4. Web GUI is not responsive; previously, Proxmox server can be reached via internal IP address of 192.168.1.151:8006
5. Proxmox server is non-existent on network; not visible on router.
6. Physical machine is on, Ethernet port lights are on. Connected monitor via HDMI to GPU does not show anything. When I restart Proxmox, I search the logs. I noticed that a normal shutdown would be something like:

Code:
Nov 09 13:44:39 athena systemd-shutdown: Watchdog running with a hardware timeout of 10min.
Nov 09 13:44:39 athena systemd-shutdown: Syncing filesystems and block devices.
Nov 09 13:44:39 athena systemd-shutdown: Sending SIGTERM to remaining processes...
Nov 09 13:44:39 athena systemd-journald[519]: Journal stopped

I have seen a bunch of different end log entries that appear after a dirty shut down. I will include them all below, but the most common ones are related to

Code:
Nov 06 19:54:57 athena pvescheduler[1488]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'

Code:
...
Nov 06 20:29:00 athena pvescheduler[6500]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 06 20:30:00 athena pvescheduler[6658]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 06 20:31:00 athena pvescheduler[6815]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 06 20:31:37 athena postfix/qmgr[1227]: 1CF5B1A036F: from=<root@athena.local>, size=1110, nrcpt=1 (queue active)
Nov 06 20:32:00 athena pvescheduler[6974]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 06 20:32:07 athena postfix/smtp[6900]: connect to gmail-smtp-in.l.google.com[142.251.2.26]:25: Connection timed out
Nov 06 20:32:07 athena postfix/smtp[6900]: connect to gmail-smtp-in.l.google.com[2607:f8b0:4023:c06::1a]:25: Network is unreachable

Code:
...
Nov 06 21:45:25 athena systemd[1]: apt-daily.service: Deactivated successfully.
Nov 06 21:45:25 athena systemd[1]: Finished apt-daily.service - Daily apt download activities.
Nov 06 21:45:27 athena systemd[1]: session-7.scope: Deactivated successfully.
Nov 06 21:45:27 athena systemd-logind[913]: Session 7 logged out. Waiting for processes to exit.
Nov 06 21:45:27 athena systemd-logind[913]: Removed session 7.
Nov 06 21:45:27 athena pvedaemon[1263]: <root@pam> end task UPID:athena:0000365B:0004E4AD:672C53F4:vncshell::root@pam: OK
Nov 06 21:46:00 athena pvescheduler[14088]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'

Code:
...
Nov 08 22:01:01 athena pvescheduler[25277]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:02:01 athena pvescheduler[25435]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:03:01 athena pvescheduler[25592]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:04:01 athena pvescheduler[25749]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:05:01 athena pvescheduler[25906]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:06:01 athena pvescheduler[26061]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:07:01 athena pvescheduler[26219]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Nov 08 22:08:01 athena pvescheduler[26377]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'


Here are there rest of the dirty shutdowns that are (seemingly) not related to the above:
Code:
...
Nov 05 21:19:11 athena postfix/smtp[5994]: connect to alt1.gmail-smtp-in.l.google.com[108.177.104.27]:25: Connection timed out
Nov 05 21:19:11 athena postfix/smtp[5994]: connect to alt1.gmail-smtp-in.l.google.com[2607:f8b0:4003:c04::1a]:25: Network is unreachable
Nov 05 21:19:41 athena postfix/smtp[5994]: connect to alt2.gmail-smtp-in.l.google.com[142.250.152.26]:25: Connection timed out
Nov 05 21:19:41 athena postfix/smtp[5994]: 7B5A41A0D1D: to=<user@server.com>, relay=none, delay=2382, delays=2291/0.01/90/0, dsn=4.4.1, status=deferred (connect to alt2.gmail-smtp-in.l.google.com[142.250.152.26]:25: Connection timed out)
Nov 05 21:19:59 athena chronyd[1018]: Selected source 45.61.187.39 (2.debian.pool.ntp.org)
Nov 05 21:21:56 athena pvedaemon[1221]: <root@pam> successful auth for user 'root@pam'
Nov 05 21:22:13 athena kernel: usb 2-2: USB disconnect, device number 2
-- Reboot --

Code:
...
Nov 06 16:35:16 athena kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Nov 06 16:35:16 athena kernel: tap100i0: entered allmulticast mode
Nov 06 16:35:16 athena kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Nov 06 16:35:16 athena kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Nov 06 16:35:16 athena pvedaemon[1219]: <root@pam> end task UPID:athena:0000132F:00022284:672C0B43:qmstart:100:root@pam: OK
Nov 06 16:35:17 athena pvedaemon[5023]: starting vnc proxy UPID:athena:0000139F:00022374:672C0B45:vncproxy:100:root@pam:
Nov 06 16:35:17 athena pvedaemon[1219]: <root@pam> starting task UPID:athena:0000139F:00022374:672C0B45:vncproxy:100:root@pam:

Code:
...
Nov 06 18:29:48 athena kernel:  dm_bufio libcrc32c xhci_pci xhci_pci_renesas crc32_pclmul r8169 igc i2c_piix4 xhci_hcd ahci realtek libahci wmi gpio_amdpt
Nov 06 18:29:48 athena kernel: CPU: 0 PID: 1256 Comm: pvedaemon worke Tainted: P      D    O       6.8.12-3-pve #1
Nov 06 18:29:48 athena kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A39/B350M GAMING PRO (MS-7A39), BIOS 2.P7 09/02/2024
Nov 06 18:29:48 athena kernel: RIP: 0010:smp_call_function_many_cond+0x133/0x500
Nov 06 18:29:48 athena kernel: Code: 7f 08 48 63 d0 e8 bd 4f 5d 00 3b 05 b7 32 38 02 73 25 48 63 d0 49 8b 37 48 03 34 d5 e0 dc ea b3 8b 56 08 83 e2 01 74 0a f3 90 <8b> 4e 08 83 e1 01 75 f6 83 c0 01 eb c1 48 83 c4 48 5b 41 5c 41 5d
Nov 06 18:29:48 athena kernel: RSP: 0018:ffffb83340eb7c78 EFLAGS: 00000202
Nov 06 18:29:48 athena kernel: RAX: 0000000000000003 RBX: 0000000000000246 RCX: 0000000000000001
Nov 06 18:29:48 athena kernel: RDX: 0000000000000001 RSI: ffff9b36ae3bca40 RDI: 0000000000000000

Code:
Nov 07 10:09:03 athena kernel:  ? __pfx_worker_thread+0x10/0x10
Nov 07 10:09:03 athena kernel:  kthread+0xf2/0x120
Nov 07 10:09:03 athena kernel:  ? __pfx_kthread+0x10/0x10
Nov 07 10:09:03 athena kernel:  ret_from_fork+0x47/0x70
Nov 07 10:09:03 athena kernel:  ? __pfx_kthread+0x10/0x10
Nov 07 10:09:03 athena kernel:  ret_from_fork_asm+0x1b/0x30
Nov 07 10:09:03 athena kernel:  </TASK>

Code:
...
Nov 07 17:08:00 athena kernel: Code: 00 00 00 00 00 66 90 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 89 c2 85 c0 75 2c 64 48 8b 04 25 10 00 00
Nov 07 17:08:00 athena kernel: RSP: 002b:00007ffe68720298 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
Nov 07 17:08:00 athena kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000755284cc3293
Nov 07 17:08:00 athena kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
Nov 07 17:08:00 athena kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Nov 07 17:08:00 athena kernel: R10: 0000755284bb1e50 R11: 0000000000000246 R12: 0000000000000001
Nov 07 17:08:00 athena kernel: R13: 000059956cc579b0 R14: 0000000000000001 R15: 0000000000000001
Nov 07 17:08:00 athena kernel:  </TASK>

Code:
...
Nov 08 19:27:46 athena kernel: RSP: 002b:00007ffd816ae4b8 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
Nov 08 19:27:46 athena kernel: RAX: ffffffffffffffda RBX: ffffffffffffda38 RCX: 00007c3588d5a8f7
Nov 08 19:27:46 athena kernel: RDX: 0000000000000000 RSI: 0000000000002000 RDI: 00007c3570400000
Nov 08 19:27:46 athena kernel: RBP: 0000000000000005 R08: 0000000000002000 R09: 0000000000000000
Nov 08 19:27:46 athena kernel: R10: 4c8b0775b4876907 R11: 0000000000000206 R12: 00000000000002c8
Nov 08 19:27:46 athena kernel: R13: 00005be59be0cbfc R14: 0000000000000040 R15: 0000000000000050
Nov 08 19:27:46 athena kernel:  </TASK>

A less common problem is a few times where I witnessed the CPU diagnostic light lit on the motherboard, and the system unresponsive; all fans are on but there is no network connectivity or display output.

I have tried:

This issue, but it still hasn't solved my instability issues after following the instructions to delete /var/lib/pve-manager/pve-replication-state.json
In fact, after I deleted that, the file shows up again. I do not have replication set up.

Stress test with stress-ng overnight: system stayed stable well into the morning before I ended it.

MemTest86+ with the included Proxmox install and a newer one from their website: Both passed.

Booting a different OS (Linux Mint) via USB and it seems stable for a long time; at least 8 hours. Network ports work, so I know it's a hardware issue.

I am currently trying to find a CPU stress tester so I can rule out any hardware issues; I am aware of this issue with the Ryzen 1000 series.

Any help would be appreciated. I am at my wits' end.
 
Last edited:
EDIT: Disk drives are in error. Actually:

1 Crucial SATA SSD 240.06 GB (boot)
1 Western Digital SATA HDD 4 TB
2 Western Digital SATA HDD 2 TB each

All drives are healthy according to SMART data.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!