Node Unexpected Restarting

aluidens

New Member
Jul 4, 2025
1
0
1
## ️ Proxmox VE Issues Report – Node Unexpected Restarting`

**Environment**:
- **Proxmox VE Version**: VE 8.4.1
- **Cluster**: Yes (2+ nodes; one node currently removed for repair)
- **Node Affected**: `Node-01`
- **Critical VM**: Gateway router (VM 100)

---

### Summary of Issues

#### 1. Unexpected System Reboots
- **Behavior**: The node was rebooting unexpectedly without clear logs indicating a crash or panic.
- **Suspected Cause**: A replication job for VM 100 was targeting a node that had been removed from the cluster for hardware repairs.
- **Resolution**: After removing the replication job, the reboots stopped.
- **Concern**: Is this expected behavior in Proxmox VE when a replication target becomes unavailable?

#### 2. Storage Offline Warnings
- **Storage**: `NAS-01` (NFS)
- **Status**: Offline due to pending HDD replacement.
- **Logs**: Repeated messages every 10 seconds:
```
storage 'NAS-01' is not online
```
- **Action Taken**: Disabled the storage via GUI to suppress log spam.
- **Question**: Is there a more graceful way to handle temporarily unavailable storage without disabling it entirely?

#### 3. Mail Delivery Failures
- **Postfix**: Attempting to send mail despite no mail relay being configured.
- **Logs**:
```
postfix/smtp: connect to sgmx1.domainname.net:25: Connection timed out
postfix/smtp: Network is unreachable
```
- **Action Taken**: Postfix disabled as mail delivery is not required.
- **Question**: Is there a recommended way to suppress mail delivery attempts in non-mail environments?

---
### Additional Notes
- No signs of `kernel panic`, `oom-killer`, or `watchdog` in logs.
- `journalctl` showed no clean boot sequences during the reboots.
- The gateway VM (VM 100) was marked for replication but not HA.
- The removed node was not gracefully decommissioned before hardware repair.

---
### ❓ Questions for Proxmox Support
1. **Is it expected for a node to reboot if a replication target becomes unreachable, even without HA enabled?**
2. **What is the best practice for handling critical VMs (e.g., routers) that are part of replication jobs?**
3. **Can replication jobs be automatically paused or disabled when a target node is removed from the cluster?**
4. **Is there a way to suppress storage polling errors without disabling the storage entirely?**
5. **Should Proxmox attempt mail delivery by default, and how can this be cleanly disabled cluster-wide?**
 
Regarding the unexpected system reboots:

I suggest throttling throughput on your replication job.
If you have just one NIC, replication (or backups) could be saturating your network, interfering with the cluster communication. This can cause the node to fence itself and shut down.
 
  • Like
Reactions: UdoB