From 2f137fccfaa30df5f70cbe58a50876b2cd69f7ab Mon Sep 17 00:00:00 2001 From: Orik Date: Fri, 3 Apr 2026 07:19:54 +0000 Subject: [PATCH] Add incident report: 2026-04-03 Immich outage via VPS OOM --- 05_Resources/Home Lab Incidents.md | 40 ++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 05_Resources/Home Lab Incidents.md diff --git a/05_Resources/Home Lab Incidents.md b/05_Resources/Home Lab Incidents.md new file mode 100644 index 0000000..dbdce46 --- /dev/null +++ b/05_Resources/Home Lab Incidents.md @@ -0,0 +1,40 @@ +# Home Lab Incident Reports + +--- + +## 2026-04-03 — Immich public outage via VPS OOM + +### What happened +- Immich still worked internally in the homelab +- Public access failed first with HTTP 500, later with 502 +- Pangolin had been OOM-killed on the VPS +- After that, Traefik could no longer resolve pangolin via Docker internal DNS, so it could not fetch dynamic config + +### Evidence +```bash +# Check for OOM events +sudo dmesg -T | grep -i -E 'oom|out of memory|killed process' + +# Pangolin log showed SIGKILL +docker logs pangolin + +# Traefik log showed: +# Get "http://pangolin:3001/api/v1/traefik-config" +# lookup pangolin on 127.0.0.11:53 +# read: connection refused +``` + +### Root cause +- **Primary cause:** VPS memory exhaustion +- **Secondary cause:** broken Docker service discovery / network state after the OOM event + +### Fix +```bash +docker compose down +docker compose up -d +``` + +### Follow-up actions +- [ ] Add swap to VPS to prevent OOM cascade +- [ ] Add memory monitoring/alerting on VPS +- [ ] Consider adding Traefik health-check/config-refresh cron job as a resilience measure