System status

Live health for TradesMen Messenger.

Public, no login required. Probes refresh every 30 seconds — the same checks our on-call team gets paged from.

A core system is down.
On-call has been paged. We update this page as the picture changes.
Just now

Components

Each tile probes the underlying system in real time. Latency is measured per request and ages out automatically.

PostgreSQL

Operational

SELECT 1 in 12 ms

12 ms

Redis

Operational

PING in 0 ms

0 ms

Realtime / WebSocket

Down

No websocket heartbeat seen yet.

never heartbeat

Background workers

Down

No workers have reported a heartbeat yet.

0 workers 0 failed jobs

Encrypted media storage

Down

Local media volume is missing or read-only.

local driver

Push notifications

Degraded

No push providers enabled.

APNs off FCM off 0 queued

Live traffic

Public counters from the last 24 hours. We never expose per-user data on this page.

0 online right now
0 active devices · last 24h
0 messages relayed · last 24h
0 calls placed · last 24h

Recent incidents (last 90 days)

No incidents reported.

Nothing has crossed the incident threshold in the last 90 days.

Incident titles and internal notes are private. Severity, status, and timestamps are published.

Operating principles

How we run the service.

No magic-number SLAs we can't meet. Here's how we actually decide what counts as healthy, when we page on-call, and how we communicate during an incident.

Component-level objectives

ComponentLatency target (p95)Page-on-call when
API (PostgreSQL)< 200 msProbe fails 2 consecutive checks
Cache (Redis)< 50 msPING fails or RTT > 200 ms for 2 minutes
WebSocketHeartbeat < 60 sHeartbeat older than 90 s
Workers (push, retention, exports)Heartbeat < 60 s per workerAny worker stale > 5 min, or failed-job depth > 0
Storage (encrypted media)Reachability check < 1 sProbe fails 2 consecutive checks
Push (APNs / FCM)Provider config validOutbox backlog grows for 5 min straight

Service objectives

  • Availability: we target > 99.5% rolling 30-day for the API, the WebSocket, and message delivery. We publish actuals only after we have the operational history to back them up.
  • Push delivery: > 99% of pushes reach APNs/FCM within 5 seconds of enqueue under healthy conditions.
  • Recovery: data-export jobs complete within 24 hours; failed jobs surface in the admin dashboard before they age out.
  • Backups: encrypted snapshots taken at least daily; restore is drilled.
  • Communication: the banner at the top of this page updates within one polling interval (30 s). Public incident summaries are published after resolution.

How status works

Each component runs its own deep health check. The probes published here are the same ones our on-call rotation gets paged from — there is no second "marketing" status board.

  • PostgreSQL — primary persistence; SELECT 1 with measured round-trip.
  • Redis — cache and presence; PING with measured round-trip.
  • WebSocket — heartbeat from the realtime worker (60 s SLA).
  • Workers — push, cleanup, data export, retention; alongside the failed-jobs depth.
  • Storage — encrypted media volume / S3 bucket reachability.
  • Push — APNs / FCM provider config and outbox backlog.

Banner state rolls up to the worst component state: any "down" → down, any "degraded" → degraded, otherwise operational.

Incident response

If something is degraded, here's what happens behind the scenes:

  • On-call is paged through internal alerting — not via email.
  • The status board updates automatically as the picture changes.
  • We follow our breach-response policy for incidents that touch user data.
  • Post-incident, a public summary is added to the timeline above.

Need to reach security urgently? security@example.com.

For integrators

Wiring this into your own dashboard.

If you operate alongside us — running a crew, a fleet, or a status hub of your own — pull from the JSON feed. The contract below is stable.

Polling endpoint

GET /status.json returns the same data this page renders. No auth, Cache-Control: no-store, JSON shape:

{
  "overall_state": "operational | degraded | down",
  "components": [
    { "component": "...", "label": "...", "state": "...",
      "detail": "...", "latency_ms": 12, "last_checked_at": "..." }
  ],
  "snapshot": {
    "online_now": 0,
    "active_devices_last_24h": 0,
    "messages_last_24h": 0,
    "calls_last_24h": 0
  },
  "incidents": [
    { "severity": "...", "status": "...",
      "discovered_at": "...", "contained_at": "...", "resolved_at": "..." }
  ],
  "generated_at": "ISO-8601 UTC"
}

Poll at most every 30 seconds — that's the cadence the page itself uses, and it's the upstream check interval too.

Embedding the banner

The simplest integration is a periodic fetch('/status.json') with the JS already shipped at /assets/js/app.js as a reference implementation. The poller pauses while the tab is hidden and resumes on focus, so you don't burn battery on background tabs.

  • Render overall_state as a coloured pill.
  • Render components[] as tiles; sort by state so failures bubble up.
  • Treat the absence of incidents[] as "no incidents in the last 90 days".

RSS / webhook subscriptions for status changes are Planned. Until then, polling is the way.