Add incident report: 2026-04-03 Immich outage via VPS OOM
This commit is contained in:
40
05_Resources/Home Lab Incidents.md
Normal file
40
05_Resources/Home Lab Incidents.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
# Home Lab Incident Reports
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-04-03 — Immich public outage via VPS OOM
|
||||||
|
|
||||||
|
### What happened
|
||||||
|
- Immich still worked internally in the homelab
|
||||||
|
- Public access failed first with HTTP 500, later with 502
|
||||||
|
- Pangolin had been OOM-killed on the VPS
|
||||||
|
- After that, Traefik could no longer resolve pangolin via Docker internal DNS, so it could not fetch dynamic config
|
||||||
|
|
||||||
|
### Evidence
|
||||||
|
```bash
|
||||||
|
# Check for OOM events
|
||||||
|
sudo dmesg -T | grep -i -E 'oom|out of memory|killed process'
|
||||||
|
|
||||||
|
# Pangolin log showed SIGKILL
|
||||||
|
docker logs pangolin
|
||||||
|
|
||||||
|
# Traefik log showed:
|
||||||
|
# Get "http://pangolin:3001/api/v1/traefik-config"
|
||||||
|
# lookup pangolin on 127.0.0.11:53
|
||||||
|
# read: connection refused
|
||||||
|
```
|
||||||
|
|
||||||
|
### Root cause
|
||||||
|
- **Primary cause:** VPS memory exhaustion
|
||||||
|
- **Secondary cause:** broken Docker service discovery / network state after the OOM event
|
||||||
|
|
||||||
|
### Fix
|
||||||
|
```bash
|
||||||
|
docker compose down
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Follow-up actions
|
||||||
|
- [ ] Add swap to VPS to prevent OOM cascade
|
||||||
|
- [ ] Add memory monitoring/alerting on VPS
|
||||||
|
- [ ] Consider adding Traefik health-check/config-refresh cron job as a resilience measure
|
||||||
Reference in New Issue
Block a user