System Crash: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'

mish

New Member
Feb 28, 2021
6
0
1
28
For the last few months my home server has been frequently crashing, with the last lines in syslog invariably being the one written in the title:

Bash:
Jul 18 05:24:59 mish-server pvescheduler[117070]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:25:59 mish-server pvescheduler[117209]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:26:59 mish-server pvescheduler[117349]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:27:59 mish-server pvescheduler[117487]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:28:59 mish-server pvescheduler[117622]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:29:59 mish-server pvescheduler[117763]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:30:59 mish-server pvescheduler[117903]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:31:59 mish-server pvescheduler[118043]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'
Jul 18 05:32:59 mish-server pvescheduler[118183]: replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json'

I've been putting up with this mostly because the server is just for my own usage / as a hobby so it hasn't been a huge inconvenience to go over and press the power button to turn it off and then on, every few days. However, I am now feeling very much like I'd like to have this solved finally :)

The server is a single node, so I am not using any replication, and the file mentioned /var/lib/pve-manager/pve-replication-state.json is completely empty.

On a single node server, should this file still contain information?

If it shouldn't, how can I stop pvescheduler from repeatedly trying to find the file until it crashes?

Is this even the root cause of the crashing? It should be noted that my syslog is filled with a lot of these messages so it's highly possible it's not the reason why the server crashes, if so, what would be a better log to look at for this sort of information?

Either way it would be great to stop this log spam..
 
can you post the output of
Code:
cat -A /var/lib/pve-manager/pve-replication-state.json
?
if the file is completely empty, it shouldn't run into that error

if you're not running replication, you can also simply delete that file
and no, this error should not cause a crash/reboot

what is the content of the logs before the crash? (aside from this logline?)
 
Hi, thanks for replying!

The output is empty:
Bash:
root@mish-server:~# cat -A /var/lib/pve-manager/pve-replication-state.json
$
root@mish-server:~#

I've now deleted the file, even if it's not the cause of the crash, maybe that will stop the logspam?

Back to the more important issue, with syslog I'm seeing output from SMART shortly before the crashes sometimes, but nothing that looks overly problematic:

Bash:
Jul 19 02:17:01 mish-server CRON[139019]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 19 02:43:59 mish-server smartd[906]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 61 to 63
Jul 19 02:43:59 mish-server smartd[906]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 39 to 37

Could the `Airflow_Temperature_Cel` be too high and causing a shut down? I had to google about SMART output and it seems that the closer to 0 the value is, the hotter the device is, with 255 being the coldest. Could the hard drives be getting too hot to the point of the server crashing?

I don't know if this is related but whenever the server crashes, it is still on. When I go to the computer the fans are running and the lights are on, it's just that I can't reach it over any network. I suppose I should try plugging in a monitor but this is very inconvenient as my monitors are mounted to arms far out of reach of a cable.

Other than syslog and journal, where should I be looking?

For the latest crash, the final output of the last journal is this:
Bash:
Jul 19 02:57:13 mish-server systemd[1]: systemd-logind.service: Watchdog timeout (limit 3min)!
Jul 19 02:57:13 mish-server systemd[1]: systemd-logind.service: Killing process 924 (systemd-logind) with signal SIGABRT.
Jul 19 02:57:13 mish-server systemd[1]: systemd-udevd.service: Watchdog timeout (limit 3min)!
Jul 19 02:57:13 mish-server systemd[1]: systemd-udevd.service: Killing process 544 (systemd-udevd) with signal SIGABRT.

This doesn't happen with the other crashes though, but is the last log entry I've seen before when I then rebooted the server again this morning
 
Last edited:
Could the `Airflow_Temperature_Cel` be too high and causing a shut down? I had to google about SMART output and it seems that the closer to 0 the value is, the hotter the device is, with 255 being the coldest. Could the hard drives be getting too hot to the point of the server crashing?
those smartd message are sadly really unreliable, since those smart attributes are not really standardized across vendors (and often not even across models!)
i mean technically it's possible that the server gets stuck if the root drive does not respond anymore, but i think more common would be only a 'read-only' filesystem
(and it should still be reachable over the network)

I don't know if this is related but whenever the server crashes, it is still on. When I go to the computer the fans are running and the lights are on, it's just that I can't reach it over any network. I suppose I should try plugging in a monitor but this is very inconvenient as my monitors are mounted to arms far out of reach of a cable.
mhmm plugging in a monitor or similar will still probably be the easiest way to check if it's hanging completely or if it's something else

For the latest crash, the final output of the last journal is this:
looks definitely weird, and probably should not happen, but since it's not always there, maybe it's unrelated
 
mhmm plugging in a monitor or similar will still probably be the easiest way to check if it's hanging completely or if it's something else
Hmm yes I was worried you might say that. Okay next time the server crashes I will have to figure out the easiest way of plugging in a display. thanks for the help :) will update once I've done that
 
So I ran my server with a monitor plugged in, and it managed to run a whole 30 minutes before crashing, a new record. Normally it makes it at least 24 hours.

Anyway, there was unfortunately nothing to see on the display that appeared post boot, and none of the boot display messages were out of the ordinary either.

Looking at syslog I don't see anything useful except more SMART readings that don't seem to be dangerous in anyway.

It is especially hot here today, 38 degrees C, I wonder if the problem is overheating. I have two graphics cards in the server for plex transcoding, I might try removing one and seeing if that has any impact, and in general monitoring the temperature of the components if I can. I guess I can find a good linux tool for this or show it on my Grafana dashboard.

I tried running memtest86+ at boot but there was no display, so I guess I should try running the non FOSS memtest86 too.

Will report back again with any updates, in case this thread can help someone else in the future.

In the meantime any suggestions from anyone who reads this are much appreciated :)
 
I'm also having the same issue. I am not sure if this is a docker issue, as I also have docker installed alongside proxmox. The /var/lib/pve-manager/pve-replication-state.json gets filles with a lot of garbage in some of the cases and I am having to delete it for the server to come back online.

The usual issue starts like this..
1. My docker proxy containers stop working
2. Unable to login as root into the server
3. Long press of power to shutdown and restart server
4. login into rescue to "fsck" the file system
5. empty /var/lib/pve-manager/pve-replication-state.json
6. reboot
 
Hello, I think I am the third person having a similar issue. I realized that my file system entered a read-only state. By analyzing the logs, the last logs were the following:
  1. replication: invalid json data in '/var/lib/pve-manager/pve-replication-state.json
  2. SMART Prefailure Attribute: 194 Temperature_Celsius changed from 62 to 63
  3. Failed SMART usage Attribute: 169 Bad_Block_Count.
Any case that this error causes a stress scenario for the SSD?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!