can I trust proxmox's backup systems?

corradodemartin · Feb 4, 2023

Hi everyone,
I manage a proxmox server hosted by OVH. This server is configured in raid 1. The log showed an error on a disk that was automatically put offline.

Code:

Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: Disk failure on nvme0n1p3, disabling device.
md/raid1:md3: Operation continuing on 1 devices.
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: Disk failure on nvme0n1p5, disabling device.
md/raid1:md5: Operation continuing on 1 devices.
Feb 04 11:31:23 ns3220199 kernel: blk_update_request: I/O error, dev nvme0n1, sector 46634688 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: nvme0n1p3: rescheduling sector 9446600
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:46634696)
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:680216)
Feb 04 11:31:23 ns3220199 kernel: blk_update_request: I/O error, dev nvme0n1, sector 34900056 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: nvme0n1p3: rescheduling sector 31719512
Feb 04 11:31:23 ns3220199 kernel: blk_update_request: I/O error, dev nvme0n1, sector 602332672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: nvme0n1p5: rescheduling sector 554882560
Feb 04 11:31:23 ns3220199 kernel: blk_update_request: I/O error, dev nvme0n1, sector 242041424 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: nvme0n1p5: rescheduling sector 194591312
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: nvme0n1p5: rescheduling sector 152205568
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:46634264)
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: redirecting sector 37751944 to other mirror: nvme1n1p3
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: redirecting sector 9446600 to other mirror: nvme1n1p3
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: redirecting sector 860666656 to other mirror: nvme1n1p5
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md3: redirecting sector 31719512 to other mirror: nvme1n1p3
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: redirecting sector 860666472 to other mirror: nvme1n1p5
Feb 04 11:31:23 ns3220199 systemd: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: redirecting sector 554882560 to other mirror: nvme1n1p5
Feb 04 11:31:23 ns3220199 systemd: systemd-journald.service: Failed with result 'watchdog'.
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: redirecting sector 194591312 to other mirror: nvme1n1p5
Feb 04 11:31:23 ns3220199 systemd: systemd-journald.service: Consumed 19.459s CPU time.
Feb 04 11:31:23 ns3220199 kernel: md/raid1:md5: redirecting sector 152205568 to other mirror: nvme1n1p5
Feb 04 11:31:23 ns3220199 systemd: systemd-journald.service: Scheduled restart job, restart counter is at 1.
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:1564944)
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:993760)
Feb 04 11:31:23 ns3220199 kernel: Read-error on swap-device (259:1:1283472)
Feb 04 11:31:23 ns3220199 systemd: Stopping Flush Journal to Persistent Storage...

OVH has yet to respond to my request. I tried restarting the server and the disk came back online, the smart test does not detect any faults. Has this ever happened to you with this provider? I own a dedicated server. Once the server was restarted, one virtual machine was corrupted and wouldn't start. Some backups also didn't work. I use three backup strategies. One short-term strategy that keeps the last 3 complete snapshots on an OVH NFS cloud storage (3 out of 3 corrupted). A long-term strategy that backs up with proxmox backup server on a corporate nas (I am still restoring, I don't know if it will work EDIT: IT WORKS). The third strategy uses Idrive Mirror (I am currently testing a restore EDIT: IT WORKS). Fortunately, I took a snapshot of the corrupted machine and did a revert. At this point, I wonder if I can trust proxmox's backup systems.

When the corrupted machine started windows ask me to choose a keyboard and next try to repair. So from terminal I understood that the driver was missing. I followed this steps to install the correct driver in windows 2019 server and then retry to fix the boot without success.

EDIT: I see this topic : https://forum.proxmox.com/threads/corrupt-filesystem-after-snapshot.32232/
So nfs mount for backup may corrupt my VMs

Any ideas?
thanks

Add the virtio driver ISO to VM.

Get the driver from https://fedorapeople.org/groups/virt/virtio-win/repo/latest/

Use Troubleshoot -> Advanced Options -> Command Prompt

Identify your driver latter mappings via wmic logicaldisk get deviceid, volumename, description

In my case virtio-win install ISO (CD-ROM Disc) was assigned to E:

Load the driver via the CLI e.g. drvload e:\viostor\2k19\amd64\viostor.inf.

After loading the driver, run wmic logicaldisk get deviceid, volumename, description again.

F: was where the windows install became mounted in my case

Use the DISM command to inject the storage controller driver

E.g. dism /image:f:\ /add-driver /driver:e:\viostor\w10\amd64\viostor.inf

corradodemartin · Feb 6, 2023

I edited vzdump.conf to use a temp dir before upload data into nfs share. I scheduled every day a VMs reboot to detect if something goes wrong

Search

Search

can I trust proxmox's backup systems?

corradodemartin

New Member

corradodemartin

New Member