Live health for TradesMen Messenger.
Public, no login required. Probes refresh every 30 seconds — the same checks our on-call team gets paged from.
Components
Each tile probes the underlying system in real time. Latency is measured per request and ages out automatically.
PostgreSQL
OperationalSELECT 1 in 12 ms
Redis
OperationalPING in 0 ms
Realtime / WebSocket
DownNo websocket heartbeat seen yet.
Background workers
DownNo workers have reported a heartbeat yet.
Encrypted media storage
DownLocal media volume is missing or read-only.
Push notifications
DegradedNo push providers enabled.
Live traffic
Public counters from the last 24 hours. We never expose per-user data on this page.
Recent incidents (last 90 days)
No incidents reported.
Nothing has crossed the incident threshold in the last 90 days.
Incident titles and internal notes are private. Severity, status, and timestamps are published.
How we run the service.
No magic-number SLAs we can't meet. Here's how we actually decide what counts as healthy, when we page on-call, and how we communicate during an incident.
Component-level objectives
| Component | Latency target (p95) | Page-on-call when |
|---|---|---|
| API (PostgreSQL) | < 200 ms | Probe fails 2 consecutive checks |
| Cache (Redis) | < 50 ms | PING fails or RTT > 200 ms for 2 minutes |
| WebSocket | Heartbeat < 60 s | Heartbeat older than 90 s |
| Workers (push, retention, exports) | Heartbeat < 60 s per worker | Any worker stale > 5 min, or failed-job depth > 0 |
| Storage (encrypted media) | Reachability check < 1 s | Probe fails 2 consecutive checks |
| Push (APNs / FCM) | Provider config valid | Outbox backlog grows for 5 min straight |
Service objectives
- Availability: we target > 99.5% rolling 30-day for the API, the WebSocket, and message delivery. We publish actuals only after we have the operational history to back them up.
- Push delivery: > 99% of pushes reach APNs/FCM within 5 seconds of enqueue under healthy conditions.
- Recovery: data-export jobs complete within 24 hours; failed jobs surface in the admin dashboard before they age out.
- Backups: encrypted snapshots taken at least daily; restore is drilled.
- Communication: the banner at the top of this page updates within one polling interval (30 s). Public incident summaries are published after resolution.
How status works
Each component runs its own deep health check. The probes published here are the same ones our on-call rotation gets paged from — there is no second "marketing" status board.
- PostgreSQL — primary persistence;
SELECT 1with measured round-trip. - Redis — cache and presence; PING with measured round-trip.
- WebSocket — heartbeat from the realtime worker (60 s SLA).
- Workers — push, cleanup, data export, retention; alongside the failed-jobs depth.
- Storage — encrypted media volume / S3 bucket reachability.
- Push — APNs / FCM provider config and outbox backlog.
Banner state rolls up to the worst component state: any "down" → down, any "degraded" → degraded, otherwise operational.
Incident response
If something is degraded, here's what happens behind the scenes:
- On-call is paged through internal alerting — not via email.
- The status board updates automatically as the picture changes.
- We follow our breach-response policy for incidents that touch user data.
- Post-incident, a public summary is added to the timeline above.
Need to reach security urgently? security@example.com.
Wiring this into your own dashboard.
If you operate alongside us — running a crew, a fleet, or a status hub of your own — pull from the JSON feed. The contract below is stable.
Polling endpoint
GET /status.json returns the same data this page renders. No auth, Cache-Control: no-store, JSON shape:
{
"overall_state": "operational | degraded | down",
"components": [
{ "component": "...", "label": "...", "state": "...",
"detail": "...", "latency_ms": 12, "last_checked_at": "..." }
],
"snapshot": {
"online_now": 0,
"active_devices_last_24h": 0,
"messages_last_24h": 0,
"calls_last_24h": 0
},
"incidents": [
{ "severity": "...", "status": "...",
"discovered_at": "...", "contained_at": "...", "resolved_at": "..." }
],
"generated_at": "ISO-8601 UTC"
}
Poll at most every 30 seconds — that's the cadence the page itself uses, and it's the upstream check interval too.
Embedding the banner
The simplest integration is a periodic fetch('/status.json') with the JS already shipped at /assets/js/app.js as a reference implementation. The poller pauses while the tab is hidden and resumes on focus, so you don't burn battery on background tabs.
- Render
overall_stateas a coloured pill. - Render
components[]as tiles; sort by state so failures bubble up. - Treat the absence of
incidents[]as "no incidents in the last 90 days".
RSS / webhook subscriptions for status changes are Planned. Until then, polling is the way.