1.1 KiB
1.1 KiB
Home Lab Incident Reports
2026-04-03 — Immich public outage via VPS OOM
What happened
- Immich still worked internally in the homelab
- Public access failed first with HTTP 500, later with 502
- Pangolin had been OOM-killed on the VPS
- After that, Traefik could no longer resolve pangolin via Docker internal DNS, so it could not fetch dynamic config
Evidence
# Check for OOM events
sudo dmesg -T | grep -i -E 'oom|out of memory|killed process'
# Pangolin log showed SIGKILL
docker logs pangolin
# Traefik log showed:
# Get "http://pangolin:3001/api/v1/traefik-config"
# lookup pangolin on 127.0.0.11:53
# read: connection refused
Root cause
- Primary cause: VPS memory exhaustion
- Secondary cause: broken Docker service discovery / network state after the OOM event
Fix
docker compose down
docker compose up -d
Follow-up actions
- Add swap to VPS to prevent OOM cascade
- Add memory monitoring/alerting on VPS
- Consider adding Traefik health-check/config-refresh cron job as a resilience measure