data lost for time window

HaVecko

New Member
Feb 15, 2024
6
0
1
Time window data lost
Hello to all,
I have a problem with VM on Proxmox. I have a Nakivo for backup and it stopped to work. Nakivo support team repaired it but I realized some server is not possible to backup. I tried to reboot one of them (linux). After that Nakivo started to backup this server. I did it with another WinSrv. There was a problem - server hadn't data from date when the problem started. There was a "data hole". Totaly empty from date when issue started to date when I rebooted the server (20 days). In this time was possible to use the server data normaly.
Does anyone has any expirience with this situation?
Thank you for any reaction
 
That sounds like the guest filesystem wasn’t flushing data properly to disk, so Nakivo snapshots weren’t consistent. I’ve seen cases where rebooting forces pending writes to be committed, which might explain why the backups worked again after restart. You may want to check your VM’s qemu guest agent and disk caching settings in Proxmox, and maybe look at the hypervisor logs during that 20-day window
 
Hi @HaVecko , welcome to the forum.

To preface: I have no direct experience with Nakivo and don’t know how it integrates with PVE. Everything below is speculation based only on what you’ve shared.

That said, I do know a bit about how certain other backup solutions integrate with PVE. If Nakivo works in a similar way, by inserting a QEMU disk filter to capture writes, then it’s theoretically possible that during the issue, writes were captured in a staging area but never flushed to disk. We don’t know how the repair was performed or whether QEMU was left in a consistent state. Most likely it was, since you had to reboot the VM.

To get to an RCA you’d need: a full understanding of Nakivo’s integration model, logs leading up to the original issue, the state of the VM and hypervisor prior to repair, exact steps taken to perform the repair, and the state of both VM and hypervisor after the repair but before the reboot, along with system logs covering the entire timeline.

The chances of having all of the above are slim. This kind of investigation would also go well beyond the scope of free forum support. I’d recommend engaging both your backup vendor and hypervisor vendor.

That sounds like the guest filesystem wasn’t flushing data properly to disk
I can’t imagine any filesystem in existence holding cached writes for 20 days. If such a filesystem did exist, it should be banned and erased from any system immediately.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi hassaan2090, thank you for your answer. I need study more about proxmox and qemu

Hi @bbgeek17, I agree with you. It is weird. The truth is, data has been lost from linux either from windows. I had a cacti on one linux server and all graphs have hole from 20th August till it has repaired. I have checked some graph in the critical period and it looked fine. Another symptom - win srv associated in AD lost communication to AD. We had to remove from AD and put to AD again. Weird is, it happened on different OS too. It is pointing to proxmox I mean.
Nakivo - I'm not sure how it works but whenever it starts to backup, proxmox locks VM and makes snapshot but the VM is locked while backup ends.
(sorry for my english, I'm not native)

Question: is it possible to monitor proxmox to predicate this behavior?
 
Last edited:
I am a little confused about what data was lost - actual data inside VM, ie files/databases/updates, or monitoring stats from external service?
It is pointing to proxmox I mean.
Nakivo - I'm not sure how it works but whenever it starts to backup, proxmox locks VM and makes snapshot but the VM is locked while backup ends.
You should probably invest some time to understand how this critical process works and how it affects your production.
Question: is it possible to monitor proxmox to predicate this behavior?
Since nobody knows what exactly happened, I’d say it’s impossible to design a monitoring procedure to reliably catch this event.

Correct me if I misunderstood: you lost user data written in the last 20 days but only became aware of it after the VM reboot. Does that mean you never accessed the data during that time, or that you did access it and it appeared to be there until the reboot?

Please understand, nobody else has reported such an issue on the forum. I’m not privy to reports via the Enterprise channel, presumably if it had been reported there, we would have seen a fix by now. You do have a somewhat unique setup, using Nakivo, which we don’t see often here.

Chances are this is related to the specific combination of Nakivo and Proxmox. My recommendation to work with both vendors still stands. Of course, you’d need active subscriptions with both to pursue that path.

Cheers.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
I am a little confused about what data was lost - actual data inside VM, ie files/databases/updates, or monitoring stats from external service?
We lost data inside VM - files and data from DB too (MS SQL express). Accounting software is on that server, accountant didn't notice anything till reboot. Coleague lost data from outlook pst or ost. It was like we were in 20th of August not 11th September. Like it was recovered from backup. Everything exclude AD connection worked correctly after reboot. The Nakivo is running on Synology, it is independent system. Nakivo SW was stucked. It didn't report about backup. I had a holiday I realized it later. I'm not sure how Nakivo works but it needs only credential for proxmox. It is not installed inside Proxmox. Only qemu-agent is important on hosts.

Correct me if I misunderstood: you lost user data written in the last 20 days but only became aware of it after the VM reboot. Does that mean you never accessed the data during that time, or that you did access it and it appeared to be there until the reboot?

Everything was working well all the time. Accountant used software. Every data were visible and actual. We didn't realize there is any problem till we had to reboot servers because Nakivo had a problem backup them. After reboot data disappeared.

It is not possible to replicate this problem...I hope. Syslog (graylog) doesn't have any data from this time because it was impacted too.
 
Last edited:
With the above additional information I am even more convienced that there was likely a Qemu snapshot/filter/staging that was lost (not replayed) on reboot. It is not surprising that your windows server got unsynced when its data went back, its a security precaution in AD.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
I'm thinking about a mechanism to prevent this in the future or detect it as soon as possible, preferably through some proactive monitoring. Can you think of anything? Do you have an idea? How can I check VM is in this mode? VM was'nt locked in GUI, ho can I investigate state of VM?
 
Last edited:
Hi @HaVecko,

I don’t have any concrete suggestions for you, mainly because we simply don’t know what happened, only a theory.

As a reminder, I’ve never worked with your particular backup vendor. It’s entirely possible that the whole theory is wrong and nothing is as it seems. You also haven’t provided any hard technical facts. The analyses is likely beyond free forum help and more in the realm of hiring a PVE Technical partner for some Professional Services.

If I were in your position, I’d take the following steps:

a) Request a full, detailed RCA from your vendors (assuming you have the appropriate support contracts).
b) If you don’t, collect and analyze any logs that may still exist from the time of the issue.
c) Understand exactly how your backup process integrates with PVE - there are multiple ways this can be done.
d) Document the system state during backup and build a process to trigger alerts if some or all of those artifacts linger afterward.

Other than that, you may need to wait until your backups fail again and continue the investigation from there.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
It is documented on Nakivo web page, I hadn't this information before :rolleyes:
https://helpcenter.nakivo.com/Knowl...Replication-Issues/Proxmox-Backup-Failure.htm
Sounds like my theory was right :-)

That said, I believe PBS , which uses similar approach (or should I say was the first to use this approach) , has this situation under control. IMHO a backup solution that leaves your data at risk when host runs out of local space should find a way to monitor that critical resource and abort, or alert with high priority.

Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:
It occurred to me, now that you have vendor confirmation, you can build a test PVE (perhaps even with nested virtualization). Build a VM where you can generate traffic that will overwhelm the backup throughput and temporary location. This should place your VM into precarious state. You can then build your monitoring/alerting tools around that.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox