Node Unexpected Restarting

aluidens

New Member
Jul 4, 2025
1
0
1
## ️ Proxmox VE Issues Report – Node Unexpected Restarting`

**Environment**:
- **Proxmox VE Version**: VE 8.4.1
- **Cluster**: Yes (2+ nodes; one node currently removed for repair)
- **Node Affected**: `Node-01`
- **Critical VM**: Gateway router (VM 100)

---

### Summary of Issues

#### 1. Unexpected System Reboots
- **Behavior**: The node was rebooting unexpectedly without clear logs indicating a crash or panic.
- **Suspected Cause**: A replication job for VM 100 was targeting a node that had been removed from the cluster for hardware repairs.
- **Resolution**: After removing the replication job, the reboots stopped.
- **Concern**: Is this expected behavior in Proxmox VE when a replication target becomes unavailable?

#### 2. Storage Offline Warnings
- **Storage**: `NAS-01` (NFS)
- **Status**: Offline due to pending HDD replacement.
- **Logs**: Repeated messages every 10 seconds:
```
storage 'NAS-01' is not online
```
- **Action Taken**: Disabled the storage via GUI to suppress log spam.
- **Question**: Is there a more graceful way to handle temporarily unavailable storage without disabling it entirely?

#### 3. Mail Delivery Failures
- **Postfix**: Attempting to send mail despite no mail relay being configured.
- **Logs**:
```
postfix/smtp: connect to sgmx1.domainname.net:25: Connection timed out
postfix/smtp: Network is unreachable
```
- **Action Taken**: Postfix disabled as mail delivery is not required.
- **Question**: Is there a recommended way to suppress mail delivery attempts in non-mail environments?

---
### Additional Notes
- No signs of `kernel panic`, `oom-killer`, or `watchdog` in logs.
- `journalctl` showed no clean boot sequences during the reboots.
- The gateway VM (VM 100) was marked for replication but not HA.
- The removed node was not gracefully decommissioned before hardware repair.

---
### ❓ Questions for Proxmox Support
1. **Is it expected for a node to reboot if a replication target becomes unreachable, even without HA enabled?**
2. **What is the best practice for handling critical VMs (e.g., routers) that are part of replication jobs?**
3. **Can replication jobs be automatically paused or disabled when a target node is removed from the cluster?**
4. **Is there a way to suppress storage polling errors without disabling the storage entirely?**
5. **Should Proxmox attempt mail delivery by default, and how can this be cleanly disabled cluster-wide?**