Proxmox VE sometimes gets unavailable when recording a stream via cvlc in an Ubuntu client

gantim

New Member
Sep 25, 2024
5
0
1
My Proxmox cluster consists of three nodes. First is a Dell notebook without battery running usually only Home Assistant, second is a Lenovo ThinkCentre M715 Gen 2 currently only running an Ubuntu VM and third (only for having third vote for quorum, always doing nothing) is a VM on my Synology NAS. Did a cluster just to see how it works, probably will the ThinkCentre in the long term be my only Proxmox VE.

Now I wrote some web interface running on my Synology which triggers TV recordings from Kathrein TechnoTrend TT-Smart C2821 receiver. These are being made by a call to cvlc for writing the IP stream to a file from the Ubuntu VM running on the ThinkCentre. Actually this is being scheduled by the at command on that Ubuntu VM, but I don't think that causes the problem.

Sometimes the Proxmox host goes down. The Ubuntu VM isn't reachable, the Proxmox host isn't reachable. This always happens while a TV recording is running, every few weeks, maybe every 30th recording or every 10th hour or something like that. It once happened again within 24 hours, other times it does weeks of running while recording flawlessly. I have no idea what triggers it.

After asking in a German Ubuntu forum I found out there is some invalid file handle occurring in the strace. I changed that the recordings are not directly stored on the NAS, but locally in the VM, and afterwards being moved to the NAS destination. Thought it helped, but it didn't (yesterday it happened again). But even if it fails in any way this is inside the VM! It should not be able to make the Proxmox host offline!

The ThinkCentre is running headless. I'm measuring the power consumption, usually it is using around 9W and if it gets unavailable, the power consumption is around 20W. Then I toggle power remotely (the measuring plug can do this), and after booting it is available again ...

Any idea which logging would help analyzing the problem? How can it be that the Proxmox host is getting down, independent of what cvlc on the VM is doing wrong?
 
Skimmed through your post & conclude 2 things; Firstly we don't have concrete (100%) evidence that your problem is linked to those recordings & secondly this issue may not be linked to Proxmox at all, but to the running of that Ubuntu VM.

I have 2 questions:

1. Have you / do you EVER have other occasions when that PVE server has crashed? (Different VM? Backup jobs? etc.)
2. Since you are only running one VM on that server, have you tested the scenario of running that Ubuntu as bare-metal on the server. This would be a good test for ruling out Proxmox as the cause.

Good luck.
 
It is right that the cause is not sure, just observation that the recording is always active when it happens. And as I experienced it may occur only once in a month (while recording one record of 3 hours a day in average I guess, sometimes just recorded some nonsense to provoke the error), other times two days in a row. Without recordings it ran at least weeks, maybe months without outage.

It was never any other activity involved when the server crashed, and it never crashed when no recording was active. Guess it happened about 10-15 times.

I'd like to keep the installation as-is for now. Changing the software on that headless ThinkCentre server is some effort, and as it should run all the VMs in the long term, I'd have to revert all this afterwards. Some time ago I added a cronjob every hour that does top -nb1 appended to a file in the Ubuntu VM as well as on the Proxmox host, but both just didn't add more details. Both just stopped adding the output when the server hung. And if I change to bare metal, when can I be sure that it does not occur anymore? The frequency is low, and it may get lower then.

I'd like to have some logging or post-mortem analysis (looked into syslog etc but didn't see anything that seems to have caused the problem) without changing to much. Some logging may be every minute, or trace every file action to see the last thing that was happening before the crash or something like that. We'll see next time it happens in the next strace of that cvlc. Last analysis when I thought to have the cause I saw an invalid handle when writing the recording to the NAS. Then I changed to write the recording and the strace file to /tmp which was an unfortunate idea as both disappeared when booting. So I hope I can see something next time it happens again in that strace file, it gets written to the user's home now which does the recording (as well as the recording itself, being moved to NAS afterwards with ionice).

What I don't like at all about this problem is that I can't reliably reproduce/provoke it. Just doing many recordings and hoping it doesn't happen when it is one I'd really like to have.

As I wrote: My assumption is that the Ubuntu VM is doing something horrible which causes the Proxmox to go down - and I think this isn't supposed to happen. Whatever a VM does, it should not be able to make the Proxmox host completely unresponsive. OK, the Ubuntu VM can make use of all 6 cores (and 16 of 32 GB RAM), I could reduce this. But how could I trace resources so that I can see what happens on Proxmox side if the issue occurs again?
 
the recording is always active when it happens
Accepted - assuming that no other parallel activity was happening then - you've at least got circumstantial evidence.

Changing the software on that headless ThinkCentre server is some effort, and as it should run all the VMs in the long term, I'd have to revert all this afterwards.
Swap out the OS disk physically, so you'd still have the original one as is, or simply boot from a different drive (even a USB flash drive). You'd still have to "pacify" the other 2 nodes of the cluster & depending on your setup has the potential for hassle - but doable. But I understand you fully not wanting to mess with the server as-is.

Back to your issue, I'm thinking out aloud:

I assume you've checked thermals & RAM. Clean out the fan on that Mini PC.

What disk are you using? Have you tested it? Smart data (usually useless!)? Swap it out?

When that recording is happening is it a NW intensive operation? I'm not familiar with your specific setup but you should know. I'm guessing you are using some sort of on-board NIC. I've had in the past (specifically on a Mini PC probably similar to that ThinkCentre), that under stress the NIC can really cause a total PCI foul up to the point that the CPU can't communicate with anything. I always put it down to either bad NIC or MB/chipset design or PSU underpower / fluctuation issues. (In general many Mini PCs have sub-standard PSU's).

Are you using one NIC only for everything: host/cluster/VMs?

What is the history of the HW? Was the CPU ever changed etc.

Things maybe worth trying:

1. Full stress test (incl. NW).
2. Check / change PSU brick (I guess it is an external one - maybe even get one with a higher wattage rating).
3. Add a NIC (USB adapter maybe) & segregate that Ubuntu VM to only use that NIC.
4. Maybe even disable the O/B NIC & only use the above added one.
5. Add an additional disk just for the Ubuntu VM.

If my mind tells me anything else - I will try & come back here.

Good. luck.
 
the Ubuntu VM can make use of all 6 cores (and 16 of 32 GB RAM), I could reduce this.
If the host only has 6 cores then this can cause high latency. Maybe if Proxmox has 6 hyper-threads it might work. Please test with only 4 cores for the VM.
Do you use ZFS and reduce the ARC size (otherwise it takes 50%). Is your VM using PCI(e) passthrough, that causes all VM memory to be pinned in host memory, Add to that 50% ZFS ARC and you'll get into trouble with swapping, which might cause a rare ZFS deadlock (and possibly other weird problems). Please test with 12GB and 4 cores for the VM.
 
The cvlc just writes the stream to disk it gets from the Kathrein receiver, shouldn't do any peaks or resource problems. If it is an HD sender, it writes about 2 GB in half an hour, otherwise about 1 GB per hour. Nothing that should be in any way near a limit.

Did a test recording 2 HD streams, it uses a bit of power consumption, about 13-18 Watt, 15-20% CPU on host, 5-10% on Ubuntu VM.

The ThinkCentre is a refurbished device below 100€, ran some tests when I got it, everything seemed OK. RAM is new Lexar 2x Single 16 GB, disk is new Lexar NQ790 1 TB. Did not use ZFS, so RAM should not be a problem.

Will have a look what I do the next days, thank you for the input!
 
Last edited:
Looking at the parts link you posted above it appears you have the AMD Ryzen 5 2400GE, this has 4C/8T so I'm not sure what you meant by:

the Ubuntu VM can make use of all 6 cores

In any event I'm going to agree with leeksteken above (& go a bit further) so give that VM 4 threads leaving another 4 threads for the host itself.

One other thing I notice is that the Lexar NVME you are using is a 4X4 - it may be hogging those PCIe lanes. This will be MB dependent so IDK.
 
There are some things I found out resp. prepared.
First: Doing several recordings at once I may be able to reproduce it with much higher probability. Three HD recordings at the same time, and it hung after some minutes. Hope it'll behave this way again when I do the next test.
Second: I can run my recording on the Proxmox host directly. Will remove vlc packages (including lots of dependencies), smb share to my NAS TV recordings and maybe at command and access from my web server after the tests, but this is all that is needed to run it. There are still small things to do to make it fully run (vlc refuses to run as root, but write to SMB share on Proxmox only works as root), but I don't need to install Ubuntu directly on the host and will do bare metal as suggested. On Proxmox Debian.

Another thing that I want to test if the source of the stream matters. I can record from public internet servers of ARD and ZDF, from my AVM DVB-C repeater (but only 1 recording at the time, it has only 1 tuner) and from that Kathrein TV receiver which I used for all tests by now. Maybe it relates to the network and behaves differently if the source is another one.

The thing that still wonders me is:
Why can a VM pull the host down? Shouldn't all accesses be virtualized in a way that the VM may crash or hang, but the host should continue to run?

OK, if PCI is completely blocked nothing is possible anymore. That could explain it. But it occurred more often when not using that disk, but recording directly to the NAS. May be random. I will see and report.
 
Last edited:
The automatic transfer of the recording to the NAS is not being done correctly, but that is not necessary for the test. I started 4 recordings directly on Proxmox, and it did the same like on the VM. The power consumption after some time was not changing anymore (usually it gets 1-3 Watt up and down when running normally), it was constantly 19 Watt and the machine could not be pinged anymore.

As this happened while the Ubuntu VM was idle running, I now repeat the test while the VM is stopped (no VM active on the Proxmox host) to exclude other causes.

Next try could be to use ffmpeg or another tool to record the stream instead of cvlc to see if that is behaving better.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!