Proxmox Node lvm read-only after some days

RogerSik

Active Member
Jan 17, 2019
11
2
43
We have running some proxmox node with 7.4-17 and added a new server Lenovo System x3650 M5. These server has no load and crashes after some day (currently strangely at weekend) with a Read-Only filesystem. The node itself is running but no new commands can't be executed or files / data saved.

I upgraded now the firmware of the storage controller to the newest version. But i hoped to get maybe here more help, what I could look additionally. Whats me currently annoys that I can't force the downtime / read-only mode, so I'm not knowing if the firmware upgrade fixed this issue.

The smart ctl is fine and im running 2 Proxmox VMs with Ubuntu and stresstest-ng to make some load on this new cluster. We are using Ceph but except that this node is in our cluster integrated there is no real usage on it.
 
Last edited:
Can you list your entire setup in detail and the syslogs from the period in question of all nodes. Are you sure it always happens on weekends and not just after X days? If it always happens on weekends, the question would be what happens.
 
I'm sure its happening on weekens because we have this setup now for about 4 weeks and this error only occurs over weekend. I'm everday testing / looking after the proxmox node with

Code:
touch a
rm -f a

and the
Code:
touch: cannot touch 'a': Read-only file system
happens 3x now at Monday.

My setup:
6x Nodes
* 4 x Intel(R) Xeon(R) CPU E3-1220 v5 @ 3.00GHz (1 Socket)
* 12 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (1 Socket)
* 16 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (1 Socket)
* 16 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (1 Socket)
* 12 x Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz (1 Socket)

New node:
* Lenovo System x3650 M5 (24xSFF) Storage Rack Server
* 72 x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (2 Sockets)

Using SSD as local storage and Ceph.

problematic node:

Code:
systemctl status syslog
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' resumed (module 'builtin:eek:mfile') [v8.2102.0 try [URL]https://www.rsyslog.com/e/2359[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' resumed (module 'builtin:eek:mfile') [v8.2102.0 try [URL]https://www.rsyslog.com/e/2359[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' resumed (module 'builtin:eek:mfile') [v8.2102.0 try [URL]https://www.rsyslog.com/e/2359[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' resumed (module 'builtin:eek:mfile') [v8.2102.0 try [URL]https://www.rsyslog.com/e/2359[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]
Nov 27 10:03:11 zlp7 rsyslogd[1542]: action 'action-9-builtin:eek:mfile' suspended (module 'builtin:eek:mfile'), next retry is Mon Nov 27 10:03:41 2023, retry nbr 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try [URL]https://www.rsyslog.com/e/2007[/URL] ]

unproblematic node
Code:
systemctl status syslog
Nov 03 00:00:13 zlp6 systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 1494 (rsyslogd) on client request.
Nov 03 00:10:13 zlp6 rsyslogd[1494]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1494" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Nov 05 00:00:13 zlp6 systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 1494 (rsyslogd) on client request.
Nov 05 00:10:13 zlp6 rsyslogd[1494]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1494" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Nov 12 00:00:12 zlp6 systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 1494 (rsyslogd) on client request.
Nov 12 00:10:12 zlp6 rsyslogd[1494]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1494" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Nov 19 00:00:13 zlp6 systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 1494 (rsyslogd) on client request.
Nov 19 00:10:13 zlp6 rsyslogd[1494]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1494" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Nov 26 00:00:13 zlp6 systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 1494 (rsyslogd) on client request.
Nov 26 00:10:13 zlp6 rsyslogd[1494]: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="1494" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
 
Last edited:
Problematic node
Code:
cat /var/log/syslog
 

Attachments

  • syslogs-zlp7.txt
    42.5 KB · Views: 1
With as many errors as logged by the controller, it should probably be gone or a disc has such damage that it causes the controller to crash.

Possible measures:
  • Re-seat the controller
  • Cross swap the controller with another server and check if the error migrates with it
  • Buy a new controller on eBay or something like that
 
We could find out how to trigger it:
Code:
/sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo --verbose --quiet-unsupported

after that command the filesystem goes read-only. This command is also executed once at week: /usr/lib/systemd/system/fstrim.timer
 
Last edited:
  • Like
Reactions: Kingneutron
Our solution was to buy another ssd model. The ssd wasn't defective and the controller also not. It seems that the controller was not happy with that model.
 
  • Like
Reactions: Kingneutron

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!