Skip to main content

Heartbeats — Dead-Man’s-Switch

If your trading bot crashes between placing orders and the next intended cancel, those orders sit on the book exposed to adverse selection until you notice and intervene. The heartbeat dead-man’s-switch protects unattended bots by auto-cancelling all of your resting orders if the server stops receiving heartbeats for longer than a deadline you choose. This mirrors Polymarket’s POST /heartbeats dead-man’s-switch behaviour — the same path, the same interval_ms-style cadence, and the same “miss a heartbeat → all your resting orders are auto-cancelled” guarantee. A PM SDK’s heartbeat loop keeps your orders protected when pointed at PolySimulator.
The 200 response body differs from Polymarket — it is not wire-identical. PolySimulator returns {"ok": true, "expires_at_ms": <int>}, whereas Polymarket’s /heartbeats returns a {"status": <string>} body (and its newer SDKs thread a heartbeat_id through each call). If your bot reads PM’s status / heartbeat_id field off the response, it will see None / a KeyError against PolySimulator — read ok / expires_at_ms instead. The dead-man’s-switch fires on the absence of a heartbeat, so a bot that ignores the response body entirely (just keeps pinging on a timer) is fully protected on both platforms.

Endpoints

PolySimulator exposes two paths that route through the same handler:
PathNotes
POST /heartbeatsPolymarket-shape root path (no /v1/) — for SDKs ported from PM without URL rewrite.
POST /v1/heartbeatsPolySimulator canonical /v1/ alias — preferred for new code.
Both accept the same body and emit the same response.

Request

{
  "interval_ms": 5000,
  "client_label": "alpha-bot"
}
FieldTypeRequiredNotes
interval_msintYesDeadline between heartbeats, in milliseconds. Bounded [1000, 60000].
client_labelstringNoFree-form label so a single API key can register multiple independent heartbeats from different bot processes. Defaults to "" (single-stream). Max 64 chars.

Response — 200 OK

{
  "ok": true,
  "expires_at_ms": 1715518800123
}
FieldTypeNotes
okboolAlways true on a successful registration / refresh.
expires_at_msintWall-clock (Unix milliseconds) at which the dead-man’s-switch will fire if no further heartbeat arrives.
The expires_at_ms value is last_heartbeat_at_ms + interval_ms + grace, where:
  • grace = max(1000ms, 0.25 × interval_ms) — absorbs network jitter and the 1-second sweeper tick so bots pinging at exact intervals don’t trigger spurious cancels.

How to use it (the heartbeat loop)

The expected pattern is to ping at half your interval_ms so you stay comfortably ahead of expiry even with one missed beat.
import asyncio
import httpx

API_KEY = "ps_live_..."
INTERVAL_MS = 5000          # 5-second deadline
BEAT_EVERY = INTERVAL_MS / 2 / 1000  # ping every 2.5s

async def heartbeat_loop():
    async with httpx.AsyncClient(
        base_url="https://api.polysimulator.com",
        headers={"X-API-Key": API_KEY},
    ) as client:
        while True:
            try:
                resp = await client.post(
                    "/v1/heartbeats",
                    json={"interval_ms": INTERVAL_MS},
                )
                resp.raise_for_status()
            except Exception as e:
                # NOTE: if the heartbeat call itself fails (network blip
                # or a 5xx), DON'T treat that as "bot is dead" — the
                # server-side switch fires only on absence, not error.
                # Just log and try again on the next tick.
                print(f"heartbeat failed: {e!r}")
            await asyncio.sleep(BEAT_EVERY)

asyncio.run(heartbeat_loop())

What happens when a heartbeat is missed?

A background sweeper runs every ~1 second. When it finds a registration whose expires_at_ms is in the past, it:
  1. Removes the registration from the Redis registry.
  2. Cancels all pending limit orders for the API key’s account (same logic as POST /v1/cancel-all — refunds BUY notional, returns SELL shares to position).
  3. Logs a structured warning: heartbeat: dead-man's-switch triggered for user_id=... tier=... — cancelled N orders.
  4. Increments polysim_heartbeat_dead_mans_switch_triggered_total{tier=...} once per registration that fired (the metric counts registrations, not individual orders).
To resume, the bot just calls POST /v1/heartbeats again — a new registration is created from scratch.
The dead-man’s-switch cancels every resting order for the API key’s account, including orders placed by other processes sharing the same key. If you run multiple bot strategies on one key, use client_label to register independent heartbeats — but be aware that the cancel-all still cancels EVERY pending order, not just the labelled subset. Use distinct API keys per strategy if you need strategy-level isolation.

Bounds and error responses

FieldBoundOut-of-range response
interval_ms[1000, 60000]422 Unprocessable Entity (Pydantic validation envelope)
client_label≤ 64 chars422 Unprocessable Entity
The [1000, 60000] bound is deliberate:
  • Below 1000ms would burn rate-limit quota (2 RPS per registration just for heartbeats) with no safety benefit beyond what 1-second sweeps already provide.
  • Above 60000ms defeats the point — a crashed bot would stay exposed for a full minute before its orders cancel.

Storage

Heartbeat registrations live in Redis:
  • A sorted set heartbeats:expiry indexes every registration by its expires_at_ms. The sweeper scans this set every second with ZRANGEBYSCORE 0 now to find expired registrations.
  • A per-registration hash heartbeats:reg:{api_key_id}:{client_label} holds the metadata (user_id, interval_ms, last_heartbeat_at_ms, tier) needed by the sweeper to invoke cancel-all.
Both writes happen atomically in a single pipeline so a sweep can’t see half-written state.

Durability across api-worker restarts

State is in Redis, so restarting the api process (deploy, OOM, graceful reload) preserves every active heartbeat. A bot’s next refresh after the restart simply bumps the expiry forward — no need to re-register from scratch.

Multi-worker correctness

In a multi-worker deployment (the production topology), a refresh landing on worker A and a sweep running on worker B are coordinated via Redis:
  • The sweeper acquires a 60-second leader lock; only one worker sweeps at any time. A crash on the leader promotes a follower within at most 60 s without manual intervention.
  • The sweep itself uses WATCH / MULTI around ZREM so a refresh that lands between the candidate scan and the claim aborts the transaction and keeps the bot alive — no false dead-man fires from race conditions across workers.

Observability

MetricTypeLabelsNotes
polysim_heartbeat_dead_mans_switch_triggered_totalCountertierIncremented once per timed-out registration (NOT once per cancelled order).
There’s also a structured warning log on every fire — search for heartbeat: dead-man's-switch triggered in your logs.