diff --git a/05_Resources/Home Lab Incidents.md b/05_Resources/Home Lab Incidents.md new file mode 100644 index 0000000..dbdce46 --- /dev/null +++ b/05_Resources/Home Lab Incidents.md @@ -0,0 +1,40 @@ +# Home Lab Incident Reports + +--- + +## 2026-04-03 — Immich public outage via VPS OOM + +### What happened +- Immich still worked internally in the homelab +- Public access failed first with HTTP 500, later with 502 +- Pangolin had been OOM-killed on the VPS +- After that, Traefik could no longer resolve pangolin via Docker internal DNS, so it could not fetch dynamic config + +### Evidence +```bash +# Check for OOM events +sudo dmesg -T | grep -i -E 'oom|out of memory|killed process' + +# Pangolin log showed SIGKILL +docker logs pangolin + +# Traefik log showed: +# Get "http://pangolin:3001/api/v1/traefik-config" +# lookup pangolin on 127.0.0.11:53 +# read: connection refused +``` + +### Root cause +- **Primary cause:** VPS memory exhaustion +- **Secondary cause:** broken Docker service discovery / network state after the OOM event + +### Fix +```bash +docker compose down +docker compose up -d +``` + +### Follow-up actions +- [ ] Add swap to VPS to prevent OOM cascade +- [ ] Add memory monitoring/alerting on VPS +- [ ] Consider adding Traefik health-check/config-refresh cron job as a resilience measure