NKNeelesh K.
All case studies
Tempo · Runtime·Contributor on Go rewrite + migration·2025 – present·Active rollout Customer hot path

Tempo Runtime — Node → Go, Cosmos → Cassandra

Customer hot path — sub-50ms p95, 100K reads/sec/region

35–45 ms
p95 latency (Go)
vs Node 45–50 ms
100K
Reads/sec/region
Cassandra sizing target
82%
Memory ↓
Go ~180 MB vs Node ~1 GB
18×
Cold start ↓
~10s vs ~180s
20%
Throughput ↑
~55 TPS/core vs Node 45
80%
Image size ↓
Go ~100 MB vs Node ~500 MB

The customer-facing read service of the Tempo CMS ecosystem. Given tenant + channel + pageType (+ pageId + zone + targeting + Expo variant), returns the prioritized list of modules that render for that customer in that moment. Mid-rewrite from Node + Cosmos to Go + Cassandra.

The problem

Node + Cosmos was costly, slow to cold-start (3 minutes), and the ingestion pipeline was a periodic full-DB scan that took 3–5 minutes to propagate a publish. A complete runtime outage blanks the homepage — so the rewrite had to be canary-driven, dual-write, shadow-compared, and gradually ramped by tenant group, never flag-day.

Three-tier cache topology

Ristretto L1 (in-process LFU, microseconds) catches the long tail of homepage hot keys without crossing the network. Meghacache L2 (18 nodes across 3 DCs, single-digit ms) catches everything else. Cassandra origin (single-digit-to-low-tens of ms, LOCAL_ONE consistency) is only touched on TTL expiry or genuine cache miss.

The Meghacache cluster is shared with the legacy Node runtime during cutover, with a version-namespaced key schema so the two services can’t poison each other’s reads.

Cassandra schema designed for one-shot reads

Cosmos was a per-container JSON document store. Cassandra is wide-column with explicit partition and clustering keys. The new schema partitions by (tenant, channel, pageType) and clusters by module version, so a single read returns the whole page payload without scatter-gather across nodes.

Kafka-driven sync replaces periodic full scans

Tempo Service emits to KAFKA-V2-TEMPO-MOD-PROD on every commit. The Content Sync Service consumes that stream and writes Cassandra (Go runtime) or Cosmos (legacy Node runtime). Propagation budget went from 3–5 minutes (full DB scan) to seconds.

Tenant-group sharding for bounded blast radius

The Go deployment is split into three tenant groups via separate KITT files — TG1 (WM_GLASS), TG2 (CA + MX), TG3 (SAMs + long-tail). A bad rollout is bounded by tenant group rather than affecting all 38 tenants simultaneously.

Shadow-compare migration

The Node service POSTs every request + response to the Go canary asynchronously, behind a CCM feature flag (isCallToGoLangEnabled, CallGoLangByQueryParam, CallGolangCompareTimeOut). A daily checksum + record-count diff job compares Cosmos vs Cassandra with an alert threshold of > 0.1% drift. Canary started at uswest-stage-az-301; tenant-group cutover is in progress.

Defense in depth at the edge

Echonyx OPUS sits at the edge with auth: resign and explicitly allow-lists only GET /api/v[12]/tempo/layouts — the rest of the runtime surface is denied at edge. Istio sidecars enforce mTLS for all internal traffic. The selection algorithm (tenant filter → channel filter → page resolution → zone filter → targeting evaluation → Expo overlay) runs inside that hardened perimeter.

What I shipped

  • Contributed to the Go + Echo rewrite of the customer-facing runtime, including the three-tier cache (Ristretto L1 → Meghacache L2 → Cassandra) and the selection algorithm port.
  • Helped design the Cassandra schema partitioning by (tenant, channel, pageType) so a single read returns the whole page payload without scatter-gather.
  • Validated the Kafka-driven Content Sync ingestion against the legacy periodic full-DB scan with daily checksum + record-count diffs at > 0.1% drift threshold.
  • Implemented the shadow-compare wiring inside the Node service — every request + response asynchronously POSTed to the Go canary behind a CCM feature flag for safe canary ramping.
  • Tenant-group sharded deployment (TG1 / TG2 / TG3) so a bad rollout is bounded by tenant group, not by the whole fleet.

Stack

Go 1.24EchoApache CassandraRistretto (L1)Meghacache (L2)Kafka (Content Sync)OpenTelemetryIstioWCNP / KITT (tenant-group sharded)Echonyx edge