Phase 24 — Enterprise Context Management
Closes the remaining high-ROI context optimization gaps: accurate tokenization, provider-level prompt caching, observation masking, tool schema compression, multi-turn conversation, rolling summaries, per-tenant budget policies, proactive compression, a composable context pipeline, OTel monitoring, waste detection, and serialization optimization. Twelve improvements organized into three sequential tiers — each tier's items are independent and parallelizable within the tier.
Status: Planned Depends on: Phases 1-19 complete (tool quality scoring from Phase 19 feeds waste detection) Migrations: None (all state in AgentState/TenantConfig.Metadata) Branch: dev
Why Now
With Phases 1-19 complete, Cruvero has phase-aware budgets, 5-component salience scoring, semantic tool search, and multi-tenant isolation. But several high-ROI optimizations remain:
- Inaccurate tokenization — Heuristic chars-per-token ratios (±15-20% error) waste ~25K tokens on a 128K window.
- No prompt caching — All major LLM APIs support it;
Pricing.InputCacheRead/InputCacheWritefields exist but are never used. - Full observation retention — Tool outputs stay in context long after the LLM has acted on them.
- Verbose tool schemas — 100-1,000 bytes per tool; 20+ tools consume 5-15% of context budget.
- Single-turn interaction — Fresh
{system, user}pair every step; no prior reasoning available. - One-shot summarization — Replaces previous summary entirely, losing older information.
- Hardcoded budget percentages — No per-tenant customization.
- Reactive-only compression — Only triggers after overflow.
Architecture
Current Context Assembly Pipeline
buildDecisionPrompts() (activity_llm_prompt.go)
└─ DetectPhase(stepIndex, maxSteps) → planning|executing|reviewing
└─ AllocateBudget(totalTokens, systemTokens, phase)
│ └─ phasePercentages(phase) → hardcoded % per section per phase
│ └─ returns ContextBudget with per-section caps
└─ buildContextAssemblerInput() → gathers episodes, memories, tools
└─ AssembleContext(input, budget, model) → deterministic section ordering
│ └─ [SYSTEM] → [PROCEDURES] → [AVAILABLE_TOOLS] → [WORKING_MEMORY] → ...
│ └─ enforcePromptTokenCap() → reactive truncation on overflow
└─ returns []llm.Message{system, user}
Target Pipeline (After Phase 24)
ContextPipeline.Execute(state)
├─ stageDetectPhase → planning|executing|reviewing
├─ stageAllocateBudget → per-section token budgets (with tenant overrides)
├─ stageMaskObservations → replace old observations with one-line refs
├─ stageCompressSchemas → minify/truncate/aggressive schema compression
├─ stageBuildConversation → sliding window multi-turn (if enabled)
├─ stageAssembleContext → deterministic section assembly
└─ stageProactiveCompression → utilization check + escalating compression
Three-Tier Implementation
Tier 1 — Highest ROI (Sub-phase A, independent/parallel):
1.1 Accurate Go-Native Tokenizer
1.2 Prompt Caching (All 5 Providers)
1.3 Observation Masking
1.4 Tool Schema Compression
Tier 2 — Significant Value (Sub-phase B, depends on Tier 1 tokenizer):
2.1 Multi-Turn Conversation Builder
2.2 Rolling Anchored Summaries
2.3 Per-Tenant Context Budget Policies
2.4 Proactive Compression Triggers
Tier 3 — Polish (Sub-phase C, depends on Tier 2):
3.1 Context Pipeline as Middleware
3.2 OTel Token Monitoring
3.3 Context Waste Detection
3.4 Serialization Optimization
Competitive Comparison
| Capability | LangChain/LangGraph | Cruvero (Current) | Cruvero (After Phase 24) |
|---|---|---|---|
| Budget allocation | None | Phase-aware (plan/execute/review) | Phase-aware + per-tenant overrides |
| Memory ranking | FIFO or naive vector | 5-component salience scoring | Same + rolling anchored summaries |
| Multi-tenancy | None | Full (quotas, models, tools) | + per-tenant context policies |
| Prompt caching | Basic | None | Anthropic explicit + OpenAI auto |
| Tokenization | tiktoken (Python) | Heuristic (±15-20%) | BPE via tiktoken-go (MIT, pure Go) |
| Observation masking | None | None | JetBrains-validated masking |
| Tool selection | All tools every time | Semantic search (Phase 19) | + schema compression + waste tracking |
| Conversation state | Manual checkpointing | Temporal-native durability | + sliding window multi-turn |
| Proactive compression | None | None | Configurable utilization triggers |
Core Types and Interfaces
// Tokenizer counts tokens for a given text.
type Tokenizer interface {
CountTokens(text string) int
}
type BPETokenizer struct {
enc *tiktoken.Tiktoken
}
type HeuristicTokenizer struct {
charsPerToken float64
}
type CompressionLevel int
const (
CompressionNone CompressionLevel = iota
CompressionMinify
CompressionTruncate
CompressionAggressive
)
func CompressSchema(raw json.RawMessage, level CompressionLevel) json.RawMessage
type ConversationTurn struct {
StepIndex int `json:"step_index"`
Assistant string `json:"assistant"`
User string `json:"user"`
}
type ConversationBuilder struct {
Window int
Model string
}
type BudgetOverride struct {
Working int `json:"working,omitempty"`
Episodic int `json:"episodic,omitempty"`
Semantic int `json:"semantic,omitempty"`
Tools int `json:"tools,omitempty"`
Procedural int `json:"procedural,omitempty"`
Reserved int `json:"reserved,omitempty"`
}
type ContextPolicy struct {
PhaseOverrides map[TaskPhase]BudgetOverride `json:"phase_overrides,omitempty"`
}
type ContextStage func(ctx context.Context, input *ContextPipelineState) error
type ContextPipeline struct {
stages []ContextStage
}
type AssembledContext struct {
// ... existing fields ...
IncludedTools []string `json:"included_tools"`
WastedTools []string `json:"wasted_tools,omitempty"`
WasteRatio float64 `json:"waste_ratio,omitempty"`
}
Sub-Phases
| Sub-Phase | Name | Prompts | Depends On |
|---|---|---|---|
| 24A | Tier 1: Highest ROI | 4 | — |
| 24B | Tier 2: Significant Value | 4 | 24A |
| 24C | Tier 3: Polish | 4 | 24B |
Total: 3 sub-phases, 12 prompts, 7 documentation files
Dependency Graph
24A (Tier 1) → 24B (Tier 2) → 24C (Tier 3)
Strictly sequential: each tier builds on the previous. Items within a tier are independent and parallelizable.
Environment Variables
| Variable | Default | Description |
|---|---|---|
CRUVERO_TOKENIZER_MODE | bpe | Token counting mode: bpe or heuristic |
CRUVERO_PROMPT_CACHE_ENABLED | true | Enable provider-level prompt caching |
CRUVERO_OBSERVATION_MASK_ENABLED | true | Enable observation masking for consumed outputs |
CRUVERO_OBSERVATION_MASK_WINDOW | 2 | Number of recent full observations to keep |
CRUVERO_TOOL_SCHEMA_COMPRESSION | truncate | Schema compression level: none, minify, truncate, aggressive |
CRUVERO_CONVERSATION_ENABLED | false | Enable multi-turn conversation builder |
CRUVERO_CONVERSATION_WINDOW | 5 | Max conversation turns in sliding window |
CRUVERO_SUMMARY_MODE | oneshot | Summary mode: rolling or oneshot |
CRUVERO_SUMMARY_MAX_BULLETS | 5 | Max bullets in rolling summary |
CRUVERO_COMPRESSION_THRESHOLD | 0.85 | Utilization ratio trigger for proactive compression |
CRUVERO_CONTEXT_WASTE_TRACKING | false | Enable context waste detection metrics |
All per-tenant overridable via TenantConfig.Metadata using the variable name without CRUVERO_ prefix in lowercase (e.g., tokenizer_mode, observation_mask_window).
Files Overview
New Files
| File | Sub-Phase | Description |
|---|---|---|
internal/registry/schema_compressor.go | 24A | CompressSchema() with 3 compression levels (~80 lines) |
internal/agent/conversation.go | 24B | ConversationBuilder with sliding window (~120 lines) |
internal/agent/context_pipeline.go | 24C | ContextPipeline with ordered ContextStage functions (~100 lines) |
Modified Files
| File | Sub-Phase | Change |
|---|---|---|
internal/agent/tokenizer.go | 24A | Add Tokenizer interface, BPETokenizer, keep HeuristicTokenizer |
internal/llm/anthropic.go | 24A | Content blocks + cache_control markers |
internal/llm/openai_chat.go | 24A | Parse cached_tokens from response |
internal/llm/google.go | 24A | Add CachedContent + cache manager |
internal/llm/openrouter.go | 24A | cache_control hints for Anthropic-backed models |
internal/llm/types.go | 24A | Add CacheReadTokens, CacheWriteTokens to Usage |
internal/agent/activity_llm.go | 24A, 24B | Cache metrics, conversation builder, budget overrides |
internal/agent/activity_llm_prompt.go | 24A | Observation masking |
internal/agent/context_assembler.go | 24A, 24B, 24C | Schema compression, proactive compression, waste tracking, serialization |
internal/agent/state.go | 24B | Add ConversationHistory to AgentState |
internal/agent/activity_observe.go | 24B | Rolling incremental summaries |
internal/agent/activity_memory.go | 24B | Accept existing summary in rolling mode |
internal/agent/context_budget.go | 24B | Accept optional *BudgetOverride |
internal/tenant/config.go | 24B | Add ContextPolicy struct |
New Dependency
| Library | License | Purpose |
|---|---|---|
github.com/pkoukk/tiktoken-go | MIT | BPE tokenization (cl100k_base, o200k_base) |
Success Metrics
| Metric | Target |
|---|---|
| Tokenizer accuracy | ±2% vs reference (OpenAI tokenizer playground) |
| Prompt cache hit rate | > 50% for multi-step runs |
| Observation masking token savings | 30-60% on steps with stale observations |
| Schema compression token savings | 20-50% reduction in [AVAILABLE_TOOLS] section |
| Conversation builder coherence | LLM references prior reasoning in 80%+ of multi-step runs |
| Rolling summary information retention | Critical facts preserved across 5+ summarization rounds |
| Proactive compression trigger rate | < 10% of assemblies need reactive overflow truncation |
| Context waste ratio | < 30% wasted tools across runs |
| Test coverage | >= 80% for internal/agent/ and internal/registry/ |
Code Quality Requirements (SonarQube)
All Go code produced by Phase 24 prompts must pass SonarQube quality gates:
- Error handling: Every returned error must be handled explicitly
- Cyclomatic complexity: Functions under 50 lines where practical
- No dead code: No unused variables, empty blocks, or duplicated logic
- Resource cleanup: Close all resources with proper
deferpatterns - Early returns: Prefer guard clauses over deeply nested conditionals
- No magic values: Use named constants for strings and numbers
- Meaningful names: Descriptive variable and function names
- Linting gate: Run
go vet,staticcheck, andgolangci-lint runbefore considering the prompt complete
Each sub-phase Exit Criteria section includes:
[ ] go vet ./internal/agent/... reports no issues[ ] staticcheck ./internal/agent/... reports no issues[ ] No functions exceed 50 lines (extract helpers as needed)[ ] All returned errors are handled (no _ = err patterns)
Risk Mitigation
| Risk | Mitigation |
|---|---|
| tiktoken-go adds new dependency | MIT license, pure Go, no CGo. Fallback to heuristic mode via env var. |
| Prompt caching breaks provider APIs | Feature-flagged per provider. Existing request format preserved when disabled. |
| Observation masking loses critical info | Masking at assembly time only — full observations always preserved in episodic store. |
| Conversation history grows Temporal state | Sliding window bounded; trimmed before ContinueAsNew. |
| Schema compression breaks tool schemas | All compression levels preserve JSON Schema validity. Table-driven tests validate. |
| Proactive compression aggressive | Escalation order is incremental; each strategy rechecks utilization before next. |
Relationship to Other Phases
| Phase | Relationship |
|---|---|
| Phase 10B (Salience + Context Budget) | 24A builds on existing context_budget.go and context_assembler.go |
| Phase 19 (Tool Registry Restructure) | 24C waste detection feeds Phase 19's tool quality scoring |
| Phase 25 (MCP Enterprise Architecture) | Orthogonal — context management sits in agent/LLM layer, not MCP transport |
| Phase 17 (PII Guard) | PII detection applies to context content via existing boundaries; no Phase 24 interaction |
Progress Notes
(none yet)