Files
second-brain/05_Resources/Home Lab Incidents.md

1.1 KiB

Home Lab Incident Reports


2026-04-03 — Immich public outage via VPS OOM

What happened

  • Immich still worked internally in the homelab
  • Public access failed first with HTTP 500, later with 502
  • Pangolin had been OOM-killed on the VPS
  • After that, Traefik could no longer resolve pangolin via Docker internal DNS, so it could not fetch dynamic config

Evidence

# Check for OOM events
sudo dmesg -T | grep -i -E 'oom|out of memory|killed process'

# Pangolin log showed SIGKILL
docker logs pangolin

# Traefik log showed:
#   Get "http://pangolin:3001/api/v1/traefik-config"
#   lookup pangolin on 127.0.0.11:53
#   read: connection refused

Root cause

  • Primary cause: VPS memory exhaustion
  • Secondary cause: broken Docker service discovery / network state after the OOM event

Fix

docker compose down
docker compose up -d

Follow-up actions

  • Add swap to VPS to prevent OOM cascade
  • Add memory monitoring/alerting on VPS
  • Consider adding Traefik health-check/config-refresh cron job as a resilience measure