Tempo Runtime — Node → Go, Cosmos → Cassandra
Customer hot path — sub-50ms p95, 100K reads/sec/region
The customer-facing read service of the Tempo CMS ecosystem. Given tenant + channel + pageType (+ pageId + zone + targeting + Expo variant), returns the prioritized list of modules that render for that customer in that moment. Mid-rewrite from Node + Cosmos to Go + Cassandra.
The problem
Node + Cosmos was costly, slow to cold-start (3 minutes), and the ingestion pipeline was a periodic full-DB scan that took 3–5 minutes to propagate a publish. A complete runtime outage blanks the homepage — so the rewrite had to be canary-driven, dual-write, shadow-compared, and gradually ramped by tenant group, never flag-day.
Three-tier cache topology
Ristretto L1 (in-process LFU, microseconds) catches the long tail of homepage hot keys without crossing the network. Meghacache L2 (18 nodes across 3 DCs, single-digit ms) catches everything else. Cassandra origin (single-digit-to-low-tens of ms, LOCAL_ONE consistency) is only touched on TTL expiry or genuine cache miss.
The Meghacache cluster is shared with the legacy Node runtime during cutover, with a version-namespaced key schema so the two services can’t poison each other’s reads.
Cassandra schema designed for one-shot reads
Cosmos was a per-container JSON document store. Cassandra is wide-column with explicit partition and clustering keys. The new schema partitions by (tenant, channel, pageType) and clusters by module version, so a single read returns the whole page payload without scatter-gather across nodes.
Kafka-driven sync replaces periodic full scans
Tempo Service emits to KAFKA-V2-TEMPO-MOD-PROD on every commit. The Content Sync Service consumes that stream and writes Cassandra (Go runtime) or Cosmos (legacy Node runtime). Propagation budget went from 3–5 minutes (full DB scan) to seconds.
Tenant-group sharding for bounded blast radius
The Go deployment is split into three tenant groups via separate KITT files — TG1 (WM_GLASS), TG2 (CA + MX), TG3 (SAMs + long-tail). A bad rollout is bounded by tenant group rather than affecting all 38 tenants simultaneously.
Shadow-compare migration
The Node service POSTs every request + response to the Go canary asynchronously, behind a CCM feature flag (isCallToGoLangEnabled, CallGoLangByQueryParam, CallGolangCompareTimeOut). A daily checksum + record-count diff job compares Cosmos vs Cassandra with an alert threshold of > 0.1% drift. Canary started at uswest-stage-az-301; tenant-group cutover is in progress.
Defense in depth at the edge
Echonyx OPUS sits at the edge with auth: resign and explicitly allow-lists only GET /api/v[12]/tempo/layouts — the rest of the runtime surface is denied at edge. Istio sidecars enforce mTLS for all internal traffic. The selection algorithm (tenant filter → channel filter → page resolution → zone filter → targeting evaluation → Expo overlay) runs inside that hardened perimeter.
What I shipped
- Contributed to the Go + Echo rewrite of the customer-facing runtime, including the three-tier cache (Ristretto L1 → Meghacache L2 → Cassandra) and the selection algorithm port.
- Helped design the Cassandra schema partitioning by (tenant, channel, pageType) so a single read returns the whole page payload without scatter-gather.
- Validated the Kafka-driven Content Sync ingestion against the legacy periodic full-DB scan with daily checksum + record-count diffs at > 0.1% drift threshold.
- Implemented the shadow-compare wiring inside the Node service — every request + response asynchronously POSTed to the Go canary behind a CCM feature flag for safe canary ramping.
- Tenant-group sharded deployment (TG1 / TG2 / TG3) so a bad rollout is bounded by tenant group, not by the whole fleet.