Phase 12
Goal: Introduce a lightweight event bus layer using NATS that future-proofs Cruvero's extension surface without disrupting the Temporal-first architecture. Temporal remains the spine for all workflow orchestration, durability, and state management. NATS operates strictly at the edges — providing async event fan-out, MCP server lifecycle management, and embedding pipeline intake.
Status: Completed (2026-02-09)
Why Now
With Phases 1–11 complete, Cruvero has a fully operational runtime: durable agent loops, multi-agent supervision, memory, causal tracing, cost tracking, sandboxed tools, guardrails, streaming, evaluation, and enterprise hardening. What's missing is a scalable extension seam — a way for external consumers (dashboards, bots, audit pipelines, webhook relays) to react to agent events without polling Temporal queries or coupling to the SSE endpoint.
Current limitations this phase addresses:
- SSE endpoint is single-consumer and ephemeral — the
GET /api/streamendpoint works for the Web UI but can't serve multiple independent consumers, has no replay, and loses events on disconnect. - MCP server discovery is static —
CRUVERO_MCP_SERVERSenv var requires worker restart to add/remove servers. With 14+ MCP servers and growing, dynamic discovery is overdue. - Embedding pipeline is synchronous — Phase 8D's embedding work blocks workflow activities while waiting for embedding generation. Batching and backpressure need a queue.
- Audit/telemetry writes are synchronous — Phase 9C audit events and Phase 8C telemetry write directly to Postgres in activities, creating write amplification under load.
- Teams/Telegram bots need event subscriptions — The planned control plane bots need to subscribe to workflow events without polling.
Architecture Principle: Temporal is the Spine, NATS is the Nervous System
This is the critical constraint. NATS must never be in the workflow deterministic path. Activities can publish to NATS, but workflow completion must never depend on NATS availability. If NATS is down, workflows continue unaffected — events are simply not fanned out until recovery.
┌──────────────────────────────────────────────────────┐
│ Temporal Cluster │
│ ┌────────────┐ ┌────────────┐ ┌────────────────┐ │
│ │ Agent WF │ │ Graph WF │ │ Supervisor WF │ │
│ └─────┬──────┘ └─────┬──────┘ └───────┬────────┘ │
│ │ │ │ │
│ ┌─────▼───────────────▼─────────────────▼────────┐ │
│ │ Worker Activities │ │
│ │ (LLM, Tools, Memory, Traces, Cost) │ │
│ └─────────────────────┬──────────────────────────┘ │
└────────────────────────┼──────────────────────────────┘
│ fire-and-forget publish
▼
┌──────────────────────┐
│ NATS / JetStream │
│ │
│ Subjects: │
│ cruvero.events.* │
│ cruvero.mcp.* │
│ cruvero.embed.* │
│ cruvero.audit.* │
└──────┬───────────────┘
│
┌────────────┼────────────────┐
▼ ▼ ▼
┌─────────┐ ┌──────────┐ ┌──────────────┐
│ Web UI │ │ Telegram │ │ Audit Writer │
│ (SSE) │ │ Bot │ │ (batch) │
└─────────┘ └──────────┘ └──────────────┘
What Gets Refactored (Existing Code)
Since all phases are complete, this phase touches existing code at well-defined seams:
1. Event Publishing in Activities (internal/agent/activities.go)
- After each activity completes (LLM decision, tool execution, observation, trace save), publish an event to NATS.
- Non-blocking, fire-and-forget. If NATS is unavailable, log a warning and continue.
- No changes to workflow determinism or Temporal state.
2. SSE Endpoint Refactor (cmd/ui/main.go)
- Current: SSE polls Temporal queries on interval.
- Refactored: SSE subscribes to NATS [
cruvero.events](http://cruvero.events).<workflow_id>subject. - Fallback: If NATS unavailable, degrade to current polling behavior.
- Benefit: Real-time events, multiple UI tabs, no polling overhead.
3. MCP Bridge Dynamic Discovery (internal/tools/mcp_bridge.go)
- Current: Static env config loaded once at startup.
- Refactored: MCP servers announce capabilities on
cruvero.mcp.announceat startup. - Bridge subscribes and dynamically updates available tools.
- Env config remains as fallback for servers that don't support NATS announce.
4. Embedding Pipeline (internal/memory/embedding.go)
- Current: Synchronous embedding generation in activity.
- Refactored: Activity publishes embedding request to
cruvero.embed.requestsJetStream. - Separate embedding worker consumes, batches, and writes results.
- Activity can optionally wait (with timeout) or fire-and-forget for non-blocking semantic extraction.
5. Audit Event Buffering (internal/audit/)
- Current: Phase 9C writes audit events directly to Postgres per-event.
- Refactored: Publish to
cruvero.audit.eventsJetStream stream. - Dedicated consumer batches writes to Postgres (configurable batch size and flush interval).
- Guarantees: JetStream acknowledgment ensures no audit event loss.
6. Supervisor Inter-Agent Events (internal/supervisor/)
- Current: Uses Temporal signals for all inter-agent communication.
- No change to signals — they remain the durable communication path.
- Addition: Supervisor publishes observability events (agent delegated, result received, saga step) to NATS for external monitoring.
- External consumers can watch multi-agent orchestration in real-time.
What's Future-Proofed (New Extension Points)
1. Webhook Relay
Any external system can subscribe to [cruvero.events](http://cruvero.events).> and relay to webhooks. No Cruvero code changes needed — just a NATS consumer.
2. Teams/Telegram Bot Event Subscriptions
Control plane bots subscribe to approval requests, run completions, cost alerts, and error events. No polling, no coupling to Cruvero internals.
3. Multi-Worker Event Coordination
As Cruvero scales to multiple worker instances, NATS provides cross-worker event visibility without shared state.
4. Custom Alerting Pipelines
Operators define NATS consumers that trigger PagerDuty, OpsGenie, or Slack based on event patterns (cost threshold exceeded, agent stuck, tool failure rate spike).
5. External Agent Frameworks
Third-party agents can publish events to NATS subjects and participate in Cruvero's event ecosystem without being Temporal workflows.
6. Replay Event Streams
JetStream persistence means events can be replayed from any point — enabling post-hoc debugging, compliance review, and training data generation.
Sub-Phase Breakdown
This phase is split into 4 sub-phases, each independently deployable and testable. NATS is optional at every stage — CRUVERO_EVENTS_BACKEND=none disables the entire layer.
📡 Phase 12A: Core Event Bus Infrastructure
Goal: Establish the EventBus interface, NATS implementation, and first event publishers.
Duration: 1.5–2 weeks
Deliverables
1. EventBus Interface (internal/events/bus.go)
type EventBus interface {
// Publish fires an event. Non-blocking, best-effort.
Publish(ctx context.Context, subject string, event Event) error
// Subscribe returns a channel of events for the subject pattern.
Subscribe(ctx context.Context, subject string) (<-chan Event, error)
// Request sends a request and waits for a reply (for discovery).
Request(ctx context.Context, subject string, data []byte, timeout time.Duration) ([]byte, error)
// Close tears down connections.
Close() error
}
type Event struct {
ID string `json:"id"`
Type string `json:"type"` // run.started, step.completed, tool.called, etc.
Subject string `json:"subject"` // NATS subject
Source string `json:"source"` // workflow_id or component
Timestamp time.Time `json:"timestamp"`
TenantID string `json:"tenant_id,omitempty"`
RunID string `json:"run_id,omitempty"`
StepIndex int `json:"step_index,omitempty"`
Data json.RawMessage `json:"data"`
}
2. Implementations
NATSEventBus— full NATS + JetStream implementation.NoopEventBus— silent no-op for disabled mode.LogEventBus— logs events to structured logger (useful for debugging without NATS).
3. Event Taxonomy (Subject Hierarchy)
cruvero.events.run.started {run_id, prompt, model}
cruvero.events.run.completed {run_id, steps, cost, duration}
cruvero.events.run.failed {run_id, error, step_index}
cruvero.events.step.decided {run_id, step, tool, decision}
cruvero.events.step.completed {run_id, step, tool, result_summary}
cruvero.events.tool.called {run_id, step, tool, args_hash}
cruvero.events.tool.failed {run_id, step, tool, error}
cruvero.events.tool.repaired {run_id, step, tool, attempt}
cruvero.events.approval.requested {run_id, step, tool, request_id}
cruvero.events.approval.resolved {run_id, step, approved, reason}
cruvero.events.question.asked {run_id, step, question}
cruvero.events.question.answered {run_id, step, answer}
cruvero.events.cost.recorded {run_id, step, cost}
cruvero.events.cost.threshold {run_id, threshold, current}
cruvero.events.memory.saved {run_id, type, key}
cruvero.events.trace.saved {run_id, step}
cruvero.events.control.pause \{run_id\}
cruvero.events.control.resume \{run_id\}
cruvero.events.supervisor.* {supervisor events mirror above}
4. Activity Integration
- Add
EventBusto activity context (injected via worker setup). - Instrument
LLMDecideActivity,ToolExecuteActivity,ObserveActivity,SaveTraceActivity. - Each publishes 1 event after successful completion. Fire-and-forget with error logging.
5. Configuration
CRUVERO_EVENTS_BACKEND=nats|log|none (default: none)
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE= (optional NKey/JWT auth)
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero (default: cruvero)
6. Docker Compose Addition
nats:
image: nats:2.10-alpine
command: ["--jetstream", "--store_dir=/data"]
ports:
- "4222:4222"
- "8222:8222" # monitoring
volumes:
- nats-data:/data
deploy:
resources:
limits:
memory: 128M
7. Tests
- Unit: EventBus interface contract tests (run against Noop, Log, and NATS via testcontainer).
- Integration: Publish event from activity mock, verify subscriber receives it.
- Resilience: NATS unavailable — verify activity completes without error.
- Subject validation: Verify taxonomy compliance.
8. Migration
migrations/0008_events_config.up.sql— optional table for event routing rules per tenant.
Files to Add/Change
internal/events/bus.go(interface)internal/events/nats.go(NATS impl)internal/events/noop.go(no-op impl)internal/events/log.go(log impl)internal/events/types.go(Event, subject constants)internal/events/bus_test.gointernal/agent/activities.go(add publish calls)internal/config/config.go(NATS config)cmd/worker/main.go(wire EventBus)docker-compose.yml(add NATS service).env.example(NATS vars)migrations/0008_events_config.up.sqlmigrations/0008_events_config.down.sql
Exit Criteria
- EventBus interface with 3 implementations
- Activities publish events on completion
- NATS subscriber receives events in real-time
-
CRUVERO_EVENTS_BACKEND=nonedisables all publishing with zero overhead - Worker survives NATS being unavailable
- Docker Compose includes NATS with JetStream
Dependencies
- All prior phases complete
- No breaking changes to any existing workflow or activity
📡 Phase 12B: SSE Refactor & Consumer Patterns
Goal: Refactor the Web UI SSE endpoint to consume from NATS, and establish reusable consumer patterns for external integrations.
Duration: 1–1.5 weeks
Deliverables
1. SSE-over-NATS Bridge (cmd/ui/sse_nats.go)
- Subscribe to [
cruvero.events.run](http://cruvero.events.run).<workflow_id>(wildcard for all step events). - Convert NATS events to SSE format and push to connected clients.
- Support multiple concurrent SSE clients per workflow (fan-out from single NATS subscription).
- Fallback: If NATS unavailable, degrade to current Temporal query polling.
- Configurable via
CRUVERO_UI_SSE_SOURCE=nats|temporal(default:natsif events backend is NATS, elsetemporal).
2. Consumer SDK (internal/events/consumer.go)
type ConsumerConfig struct {
Subject string
Group string // queue group for load balancing
BatchSize int // for batch consumers
FlushInterval time.Duration // max time before flush
MaxPending int // backpressure limit
AckPolicy AckPolicy // explicit, none, all
}
type BatchConsumer interface {
HandleBatch(ctx context.Context, events []Event) error
}
type StreamConsumer interface {
HandleEvent(ctx context.Context, event Event) error
}
- Reusable consumer patterns: single-event handler, batch handler, filtered handler.
- Queue groups for load-balanced consumption (multiple bot instances, multiple audit writers).
- Built-in metrics: events consumed, processing latency, errors.
3. Event Replay (cmd/event-replay)
- CLI tool to replay JetStream events from a point in time.
- Useful for debugging, compliance review, and backfilling consumers that were offline.
go run ./cmd/event-replay --subject "cruvero.events.run.*" --since 2h --consumer my-debug
4. Web UI Enhancements
- Real-time event indicators in run detail view (events arrive instantly vs 1-5s poll).
- Event log panel showing raw events for debugging.
- Connection status indicator (NATS connected/degraded/polling).
5. Tests
- SSE receives events within 100ms of activity publishing.
- Multiple SSE clients receive same event.
- Graceful degradation when NATS disconnects mid-stream.
- Event replay CLI outputs correct events for time range.
Files to Add/Change
cmd/ui/sse_nats.go(NATS-backed SSE)cmd/ui/main.go(wire NATS SSE source)cmd/ui/ui/index.html(connection status indicator)internal/events/consumer.go(consumer SDK)internal/events/consumer_test.gocmd/event-replay/main.go
Exit Criteria
- SSE streams events from NATS with < 100ms latency
- Multiple UI tabs receive events simultaneously
- Graceful fallback to Temporal polling when NATS is down
- Event replay CLI works for arbitrary time ranges
- Consumer SDK supports batch and stream patterns
Dependencies
- Phase 12A complete
📡 Phase 12C: MCP Dynamic Discovery & Embedding Pipeline
Goal: Use NATS for MCP server lifecycle management and async embedding generation.
Duration: 1.5–2 weeks
Deliverables
1. MCP Server Announce Protocol
cruvero.mcp.announce — MCP server publishes tool manifest on startup
cruvero.mcp.heartbeat — periodic health signal
cruvero.mcp.deregister — graceful shutdown notification
cruvero.mcp.health.req — request/reply health check from bridge
Announce payload:
{
"server_name": "mcp-github",
"transport": "stdio",
"tools": [
{"name": "github_list_repos", "description": "...", "schema": {...}}
],
"version": "1.2.0",
"capabilities": ["tools", "resources"],
"timestamp": "2026-02-08T10:00:00Z"
}
2. Dynamic MCP Registry Update (internal/tools/mcp_bridge.go)
- Bridge subscribes to
cruvero.mcp.announce. - On new announcement: validate, diff against current tools, update in-memory tool map.
- Periodic health check via [
cruvero.mcp.health](http://cruvero.mcp.health).req(request/reply). - Stale servers (no heartbeat for 2 intervals) marked unhealthy; tools return error with explanation.
- Env config (
CRUVERO_MCP_SERVERS) remains as static fallback — announce supplements, doesn't replace.
3. MCP Server Sidecar Helper (cmd/mcp-announce)
- Lightweight wrapper that any MCP server can use to announce to NATS.
- Can be used as a process wrapper:
mcp-announce --server mcp-github -- /path/to/mcp-github. - Publishes announce on start, heartbeat every 30s, deregister on SIGTERM.
4. Async Embedding Pipeline
JetStream stream: CRUVERO_EMBED
cruvero.embed.requests — embedding generation requests
cruvero.embed.results — completed embeddings (for consumers that need them)
cruvero.embed.dlq — dead letter queue for failed embeddings
Request payload:
{
"request_id": "emb-uuid",
"source": "semantic_extract",
"run_id": "run-123",
"text": "User prefers formal tone in reports",
"namespace": "user-prefs",
"metadata": {"fact_id": "fact-456"}
}
5. Embedding Worker (cmd/embed-worker)
- Consumes from
cruvero.embed.requestsJetStream. - Batches requests (configurable:
CRUVERO_EMBED_BATCH_SIZE=32,CRUVERO_EMBED_FLUSH_MS=500). - Calls embedding provider (Phase 8D infrastructure).
- Writes results to vector store (Qdrant/pgvector).
- Publishes completion to
cruvero.embed.results. - DLQ for persistent failures after retries.
6. Activity Refactor (internal/memory/embedding.go)
- Current synchronous embedding call replaced with NATS publish.
- Two modes:
async(default): Publish and continue. Embedding available eventually.sync: Publish and wait for result oncruvero.embed.resultswith timeout. Falls back to direct call on timeout.
- Mode configurable:
CRUVERO_EMBED_MODE=async|sync|direct(direct = current behavior, no NATS).
7. Configuration
# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both (default: static)
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s
# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct (default: direct)
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4
8. Tests
- MCP announce → bridge picks up new tools within 5s.
- MCP deregister → tools marked unavailable.
- Stale MCP server → health check fails → tools return error.
- Embedding request → batched → written to vector store.
- Embedding DLQ → failed requests retried, then dead-lettered.
- Direct mode still works when NATS unavailable.
Files to Add/Change
internal/tools/mcp_bridge.go(add NATS discovery)internal/tools/mcp_announce.go(announce types)internal/memory/embedding.go(async mode)cmd/mcp-announce/main.gocmd/embed-worker/main.gointernal/events/streams.go(JetStream stream definitions)internal/config/config.go(new config vars)
Exit Criteria
- MCP server dynamically registers tools via NATS announce
- MCP health check detects and marks stale servers
- Embedding requests batched and processed asynchronously
- DLQ captures failed embeddings for retry/investigation
- All features degrade gracefully when NATS unavailable
- Static MCP config continues to work as fallback
Dependencies
- Phase 12A complete
- Phase 8D (embedding infrastructure) complete
📡 Phase 12D: Audit Buffering, Telemetry Pipeline & Ops Tooling
Goal: Move audit and telemetry writes to buffered NATS consumers, add operational tooling for the event bus.
Duration: 1–1.5 weeks
Deliverables
1. Audit Event Buffering
JetStream stream: CRUVERO_AUDIT
- Phase 9C audit events publish to
cruvero.audit.eventsinstead of direct Postgres writes. - Dedicated audit consumer batches writes (configurable:
CRUVERO_AUDIT_BATCH_SIZE=50,CRUVERO_AUDIT_FLUSH_MS=1000). - JetStream
AckExplicitensures no audit event is lost — consumer acks only after successful Postgres write. - Retention: configurable stream retention matching audit retention policy.
- PII detection still runs inline before publishing (PII flags travel with the event).
2. Telemetry Pipeline
cruvero.telemetry.metrics — cost, latency, token usage per step
cruvero.telemetry.spans — lightweight span summaries for non-OTEL consumers
- Activities publish lightweight telemetry events alongside workflow events.
- Consumer writes to Prometheus push gateway, Grafana Loki, or custom sinks.
- Decouples telemetry collection from the OTEL exporter — operators can add custom sinks without modifying worker code.
3. Event Bus Health Dashboard
- New UI page:
/events.html - Shows: NATS connection status, stream stats (messages, consumers, lag), subject activity heatmap.
- Uses NATS monitoring endpoint (
http://nats:8222/jsz) for stream metadata.
4. Event Bus CLI (cmd/event-bus)
go run ./cmd/event-bus status # connection + stream health
go run ./cmd/event-bus streams # list JetStream streams
go run ./cmd/event-bus consumers # list active consumers
go run ./cmd/event-bus subscribe <subject> # live subscribe for debugging
go run ./cmd/event-bus publish <subject> <payload> # manual event injection
go run ./cmd/event-bus purge <stream> # purge stream (with confirmation)
5. Tenant Isolation
- Multi-tenant event isolation via subject prefix:
cruvero.<tenant_id>.events.* - JetStream streams can be per-tenant or shared with subject filtering.
- Respects Phase 9A tenant boundaries — consumers only see their tenant's events.
6. Debian Deployment Update
- Update Debian deployment guide with NATS service configuration.
- NATS memory budget: ~64–128MB (well within 8GB constraint).
- JetStream file store for persistence (vs memory store).
- Systemd unit file for NATS.
7. Configuration
# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct (default: direct)
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000
CRUVERO_AUDIT_STREAM_RETENTION=365d
# Telemetry
CRUVERO_TELEMETRY_NATS=true|false (default: false)
CRUVERO_TELEMETRY_SUBJECTS=metrics,spans
# Tenant Isolation
CRUVERO_EVENTS_TENANT_ISOLATION=true|false (default: false)
8. Tests
- Audit events buffered and batch-written (verify batch size and flush timing).
- No audit event lost during consumer restart (JetStream redelivery).
- Tenant A cannot subscribe to Tenant B's events.
- Event bus CLI reports accurate stream statistics.
- Debian deployment: NATS starts, worker connects, events flow.
Files to Add/Change
internal/audit/nats_writer.go(buffered audit consumer)internal/events/telemetry.go(telemetry publisher)cmd/event-bus/main.go(ops CLI)cmd/ui/ui/events.html(event bus dashboard)cmd/ui/main.go(events API endpoints)internal/events/tenant.go(tenant subject routing)docs/[debian-deploy.md](http://debian-deploy.md)(NATS section)
Exit Criteria
- Audit events buffered through JetStream with zero loss
- Batch writes reduce Postgres write amplification by 10x+
- Event bus CLI provides operational visibility
- Tenant isolation enforced at subject level
- Debian deployment guide updated with NATS
- All features disabled cleanly with
CRUVERO_EVENTS_BACKEND=none
Dependencies
- Phase 12A and 12B complete
- Phase 9A (multi-tenancy) and 9C (audit) complete
Timeline
| Sub-Phase | Duration | Cumulative |
|---|---|---|
| 12A: Core Event Bus | 1.5–2 weeks | 2 weeks |
| 12B: SSE Refactor & Consumers | 1–1.5 weeks | 3.5 weeks |
| 12C: MCP Discovery & Embedding Pipeline | 1.5–2 weeks | 5.5 weeks |
| 12D: Audit Buffering & Ops Tooling | 1–1.5 weeks | 7 weeks |
Total estimated duration: 5–7 weeks
12A and 12B can be developed sequentially (B depends on A). 12C and 12D can begin once 12A is complete and run in parallel with 12B.
Resource Impact (8GB Debian Target)
| Component | Memory | Disk | Notes |
|---|---|---|---|
| NATS server | 64–128 MB | Minimal | JetStream file store |
| Embed worker | 32–64 MB | None | Separate process, optional |
| Event consumers | 16–32 MB each | None | Run in-process or separate |
| Total addition | ~128–256 MB | < 1 GB | Fits within 8GB budget |
NATS is one of the lightest messaging systems available. Alpine image is ~18MB. JetStream file store avoids memory pressure for persistence.
Risk Mitigation
| Risk | Mitigation |
|---|---|
| NATS becomes critical path | Strict fire-and-forget in activities. Health checks log warnings, never block. |
| Event ordering | JetStream provides per-subject ordering. Cross-subject ordering not guaranteed (and not needed). |
| Message loss | JetStream with AckExplicit for audit/embed. Best-effort for observability events. |
| Operational complexity | CRUVERO_EVENTS_BACKEND=none disables everything. Zero NATS dependency for core workflows. |
| 8GB memory pressure | NATS server capped at 128MB. JetStream uses file store. Embed worker optional. |
| Schema evolution | Events are JSON with versioned type field. Consumers handle unknown types gracefully. |
Environment Variables Summary
# Core
CRUVERO_EVENTS_BACKEND=nats|log|none
CRUVERO_NATS_URL=nats://localhost:4222
CRUVERO_NATS_CLUSTER_ID=cruvero
CRUVERO_NATS_CREDS_FILE=
CRUVERO_NATS_TLS=auto|false
CRUVERO_NATS_CONNECT_TIMEOUT=5s
CRUVERO_NATS_RECONNECT_WAIT=2s
CRUVERO_NATS_MAX_RECONNECTS=60
CRUVERO_EVENTS_SUBJECT_PREFIX=cruvero
# MCP Discovery
CRUVERO_MCP_DISCOVERY=static|nats|both
CRUVERO_MCP_HEARTBEAT_INTERVAL=30s
CRUVERO_MCP_STALE_THRESHOLD=90s
# Embedding Pipeline
CRUVERO_EMBED_MODE=async|sync|direct
CRUVERO_EMBED_BATCH_SIZE=32
CRUVERO_EMBED_FLUSH_MS=500
CRUVERO_EMBED_DLQ_MAX_RETRIES=3
CRUVERO_EMBED_WORKER_CONCURRENCY=4
# Audit Buffering
CRUVERO_AUDIT_BUFFER=nats|direct
CRUVERO_AUDIT_BATCH_SIZE=50
CRUVERO_AUDIT_FLUSH_MS=1000
# Telemetry
CRUVERO_TELEMETRY_NATS=true|false
# Tenant Events
CRUVERO_EVENTS_TENANT_ISOLATION=true|false
Success Metrics
| Metric | Target |
|---|---|
| Event publish latency (activity overhead) | < 1ms p99 |
| SSE delivery latency (event → browser) | < 100ms p99 |
| Audit event loss rate | 0% (JetStream guarantee) |
| MCP discovery latency (announce → available) | < 5s |
| Embedding pipeline throughput | 100+ embeddings/sec batched |
| Worker impact when NATS down | Zero (no errors, no latency change) |
| Memory overhead (NATS server) | < 128MB |
| Test coverage | >= 80% for internal/events/ (enforced by scripts/check-coverage.sh) |
Code Quality Requirements (SonarQube)
All Go code produced by Phase 12 prompts must pass SonarQube quality gates. Each PROMPT.md file includes these constraints in every prompt's Constraints section:
- Error handling: Every returned error must be handled explicitly — never ignore with
_ - Cyclomatic complexity: Keep functions focused and small; extract complex logic into well-named helpers. Functions under 50 lines where practical
- No dead code: No unused variables, empty blocks, or duplicated logic
- Resource cleanup: Close all resources (DB rows, response bodies, NATS connections) with proper
deferpatterns - Early returns: Avoid deeply nested conditionals — prefer guard clauses
- No magic values: Use named constants for strings and numbers
- No hardcoded secrets: All credentials via env vars
- Meaningful names: Descriptive variable and function names
- Linting gate: Run
go vet ./internal/events/...,staticcheck ./internal/events/..., andgolangci-lint run ./internal/events/...before considering the prompt complete — these catch most issues SonarQube flags - Test coverage: 80%+ coverage target for
internal/events/
Each sub-phase Exit Criteria section includes:
[ ] go vet ./internal/events/... reports no issues[ ] staticcheck ./internal/events/... reports no issues[ ] No functions exceed 50 lines (extract helpers as needed)[ ] All returned errors are handled (no _ = err patterns)
Open Questions
- NATS vs NATS with JetStream only? — JetStream adds persistence but complexity. Recommendation: JetStream for audit and embedding (must not lose), core NATS for observability events (best-effort is fine).
- Embedded NATS? — NATS can run embedded in the Go worker process. Simpler deployment but less isolation. Recommendation: External process for production, embedded option for single-node dev.
- Event schema registry? — Should events have a formal schema registry? Recommendation: Start with documented JSON conventions, add formal schemas if consumer ecosystem grows.
- NATS auth model? — NKey auth, JWT auth, or none? Recommendation: None for dev, NKey for production single-tenant, JWT for multi-tenant.
Relationship to Other Phases
| Phase | Relationship |
|---|---|
| Phase 8C (Observability) | 12D adds NATS-based telemetry pipeline as alternative sink |
| Phase 8D (Embeddings) | 12C decouples embedding generation into async pipeline |
| Phase 9A (Multi-tenancy) | 12D adds tenant-scoped event isolation |
| Phase 9C (Audit) | 12D buffers audit writes through JetStream |
| Phase 10 (Neuro) | Event bus enables real-time metacognitive monitoring |
| Phase 11C (Streaming) | NATS can carry LLM token streams to UI |
| Phase 11G (Reward Learning) | Reward signals can be published as events |
| Teams/Telegram Bots | Bots subscribe to events instead of polling |