Skip to main content
neutral

Phase 19 — Tool Registry Restructure

Upgrades the tool registry from keyword-based discovery to semantic vector search with quality tracking. Introduces per-tool quality metrics, LLM-based auto-rating after tool execution, degradation alerting with quarantine escalation, and a feedback CLI. Builds on existing infrastructure: tool_retry_stats (migration 0005/0013), tool_quarantine (migration 0020), internal/vectorstore/, and internal/embedding/.

Status: Not Started Depends on: Phases 1-14 complete Migrations: 0026_tool_metrics (Phase 19A) Branch: dev


Why Now

With Phases 1-14 complete, Cruvero has a functional tool registry with keyword-based discovery and basic retry statistics — but tool selection and quality management have three structural problems:

  1. Keyword-only discoveryfilterRegistryForPrompt in internal/agent/workflow.go uses token scoring and exact name matching to select tools. This misses semantic relationships: a prompt about "send notification" cannot discover tools named "email_dispatch" or "slack_post" unless those exact words appear.
  2. No quality signaltool_retry_stats (migration 0005) records binary success/failure counters per tool, but there is no measure of output quality. A tool that returns low-quality results 100% of the time appears identical to one that returns high-quality results.
  3. No degradation alertingtool_quarantine (migration 0020) provides binary quarantine from the immune system, but there is no graduated degradation awareness. A tool trending toward failure has no pre-quarantine warning mechanism.

Phase 19 solves all three by introducing semantic vector search for tool discovery, quality scoring with LLM auto-rating, and degradation alerting that integrates with the existing quarantine path.


Architecture

Extended registry package: internal/registry/

All new tool quality and search functionality lives in the existing internal/registry/ package. No new package is created — this extends the existing Store, ToolDefinition, and supporting types.

┌──────────────────────────────────────────────────────────────────┐
│ registry.ToolSearcher │
│ │
│ ┌───────────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ Vector Retrieval │ │ Quality │ │ Result │ │
│ │ (embed + search │ │ Re-Ranking │ │ Assembly │ │
│ │ tool desc) │ │ (quality score │ │ (merge, │ │
│ │ │ │ + recency + │ │ format) │ │
│ │ │ │ success rate) │ │ │ │
│ └─────────┬─────────┘ └────────┬─────────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────┬───────────┘ │ │
│ │ │ │
│ 3-Stage Pipeline │ │
│ │ │
│ External deps (reused, not owned): │ │
│ ├─ internal/embedding/Embedder │ │
│ ├─ internal/vectorstore/VectorStore (collection: │ │
│ │ "tool_registry") │ │
│ └─ internal/tenant/ (multi-tenant isolation) │ │
│ │ │
│ Existing tables (extended, not replaced): │ │
│ ├─ tool_retry_stats (migration 0005/0013) │ │
│ └─ tool_quarantine (migration 0020) │ │
└──────────────────────────────────────────────────────────────────┘

Core API

// MetricsStore tracks mutable quality signals for tools.
type MetricsStore interface {
RecordExecution(ctx context.Context, toolName string, outcome ExecutionOutcome) error
RecordFeedback(ctx context.Context, toolName string, feedback ToolFeedback) error
GetMetrics(ctx context.Context, toolName string) (ToolMetrics, error)
ListDegraded(ctx context.Context, threshold float64) ([]ToolMetrics, error)
}

// ToolSearcher finds tools by semantic similarity + quality ranking.
type ToolSearcher interface {
Search(ctx context.Context, query string, k int, filter *ToolSearchFilter) ([]ScoredTool, error)
}

// ToolIndexer manages vector embeddings for tool descriptions.
type ToolIndexer interface {
IndexTool(ctx context.Context, tool ToolDefinition) error
IndexRegistry(ctx context.Context, reg ToolRegistry) error
RemoveTool(ctx context.Context, toolName string) error
}

Key Types

type ExecutionOutcome struct {
ToolName string `json:"tool_name"`
RunID string `json:"run_id"`
StepIdx int `json:"step_idx"`
Success bool `json:"success"`
LatencyMs int64 `json:"latency_ms"`
LLMRating float64 `json:"llm_rating"` // 0.0-1.0, from post-execution assessment
ErrorClass string `json:"error_class,omitempty"`
TenantID string `json:"tenant_id"`
}

type ToolFeedback struct {
ToolName string `json:"tool_name"`
UserID string `json:"user_id"`
Rating float64 `json:"rating"` // 0.0-1.0
Comment string `json:"comment,omitempty"`
TenantID string `json:"tenant_id"`
}

type ToolMetrics struct {
ToolName string `json:"tool_name"`
TenantID string `json:"tenant_id"`
TotalCalls int `json:"total_calls"`
SuccessCount int `json:"success_count"`
FailureCount int `json:"failure_count"`
AvgLatencyMs float64 `json:"avg_latency_ms"`
AvgLLMRating float64 `json:"avg_llm_rating"`
QualityScore float64 `json:"quality_score"` // composite: success_rate * avg_llm_rating
LastCalledAt time.Time `json:"last_called_at"`
DegradedSince *time.Time `json:"degraded_since,omitempty"`
}

type ScoredTool struct {
Tool ToolDefinition `json:"tool"`
Score float64 `json:"score"`
Components ScoreComponents `json:"components"`
}

type ScoreComponents struct {
Similarity float64 `json:"similarity"`
Quality float64 `json:"quality"`
Recency float64 `json:"recency"`
}

type ToolSearchFilter struct {
TenantID string `json:"tenant_id,omitempty"`
ExcludeNames []string `json:"exclude_names,omitempty"`
MinQuality float64 `json:"min_quality,omitempty"`
}

Search Pipeline

Three-stage pipeline, same pattern as Phase 18's prompt search:

Stage 1: Vector Retrieval

  1. Embed query text using embedding.Embedder.Embed() (internal/embedding/embedder.go:22)
  2. Search tool_registry collection via vectorstore.VectorStore.Search() (internal/vectorstore/store.go:35)
  3. Apply tenant isolation filter
  4. Retrieve top-K candidates (default K=30)

Stage 2: Quality Re-Ranking

Score each candidate using a weighted formula:

score = W_sim * similarity + W_qual * quality + W_rec * recency
WeightDefaultSource
W_sim (similarity)0.5Vector cosine similarity from Stage 1
W_qual (quality)0.35success_rate * avg_llm_rating from tool_metrics
W_rec (recency)0.15Recency decay from last successful call

Tools with active quarantine entries (tool_quarantine where released_at IS NULL AND (expires_at IS NULL OR expires_at > NOW())) are excluded from results.

Stage 3: Result Assembly

  1. Sort by composite score
  2. Truncate to requested limit (default 20)
  3. Return []ScoredTool with score components for transparency

Quality Tracking

LLM Auto-Rating

After each tool execution in ToolExecuteActivity, a non-blocking Temporal activity records an ExecutionOutcome including:

  • Binary success/failure (existing)
  • Execution latency
  • LLM quality rating (0.0-1.0) from a post-execution assessment prompt

The LLM rating uses a lightweight prompt asking the model to rate tool output relevance and correctness on a 0-1 scale. This runs as a child activity with short timeout (5s) and fire-and-forget semantics.

Degradation Detection

A periodic activity (or checked inline during ToolExecuteActivity) computes a rolling quality score. When the score drops below a configurable threshold:

  1. Warning — Log structured warning + emit NATS event (if Phase 12 active) or memory episode fallback
  2. Alert — Set degraded_since timestamp in tool_metrics
  3. Quarantine escalation — If quality stays below threshold for N consecutive calls, insert into existing tool_quarantine table (migration 0020) with reason referencing quality degradation

Backward Compatibility

filterRegistryForPrompt in internal/agent/workflow.go is updated to use vector search when CRUVERO_TOOL_SEARCH_SEMANTIC=true, falling back to the existing keyword scoring when disabled or when the vector store is unavailable. The function signature remains unchanged.


Sub-Phases

Sub-PhaseNamePromptsDepends On
19AFoundation: MetricsStore, Types, Migration4
19BVector Indexing + Semantic Search419A
19CQuality Tracking + Degradation Alerting419B
19DCLI, Agent Discovery Integration, Testing419C

Total: 4 sub-phases, 16 prompts, 9 documentation files

Dependency Graph

19A (Foundation) → 19B (Vector Search) → 19C (Quality Tracking) → 19D (CLI/Integration)

Strictly sequential: each sub-phase builds on the previous.


Environment Variables

VariableDefaultDescription
CRUVERO_TOOL_SEARCH_SEMANTICfalseEnable semantic vector search for tool discovery
CRUVERO_TOOL_SEARCH_COLLECTIONtool_registryVector store collection name
CRUVERO_TOOL_SEARCH_K30Vector retrieval candidates (Stage 1)
CRUVERO_TOOL_SEARCH_RESULT_LIMIT20Max tools returned to agent
CRUVERO_TOOL_SEARCH_W_SIMILARITY0.5Ranking weight: vector similarity
CRUVERO_TOOL_SEARCH_W_QUALITY0.35Ranking weight: quality score
CRUVERO_TOOL_SEARCH_W_RECENCY0.15Ranking weight: recency decay
CRUVERO_TOOL_QUALITY_ENABLEDtrueEnable quality tracking and LLM auto-rating
CRUVERO_TOOL_QUALITY_RATING_TIMEOUT5sTimeout for LLM auto-rating activity
CRUVERO_TOOL_QUALITY_DEGRADE_THRESHOLD0.3Quality score below which a tool is considered degraded
CRUVERO_TOOL_QUALITY_QUARANTINE_AFTER5Consecutive degraded calls before quarantine escalation

Files Overview

New Files

FileSub-PhaseDescription
internal/registry/metrics_types.go19AExecutionOutcome, ToolFeedback, ToolMetrics, ScoredTool, ScoreComponents
internal/registry/metrics_store.go19AMetricsStore interface + PostgresMetricsStore
internal/registry/tool_indexer.go19BToolIndexer interface + DefaultToolIndexer
internal/registry/tool_searcher.go19BToolSearcher interface + DefaultToolSearcher (3-stage pipeline)
internal/registry/scorer.go19BToolScorer (ranking formula, weight config)
internal/registry/quality.go19CQualityTracker, degradation detection, quarantine escalation
internal/registry/rating.go19CLLM auto-rating prompt + activity
internal/registry/search_config.go19BSearch config wiring from env vars
cmd/tool-feedback/main.go19DCLI to submit tool quality feedback
migrations/0026_tool_metrics.up.sql19AExtend tool quality tracking tables
migrations/0026_tool_metrics.down.sql19AReverse migration
internal/registry/metrics_types_test.go19DType validation tests
internal/registry/metrics_store_test.go19DPostgresMetricsStore tests (sqlmock)
internal/registry/tool_indexer_test.go19DIndexer tests (mock embedder + vector store)
internal/registry/tool_searcher_test.go19DSearcher pipeline tests
internal/registry/scorer_test.go19DScorer tests
internal/registry/quality_test.go19DQuality tracking + degradation tests
internal/registry/rating_test.go19DLLM rating tests

Modified Files

FileSub-PhaseChange
internal/agent/activities.go19CWire quality recording in ToolExecuteActivity
internal/agent/workflow.go19DUpdate filterRegistryForPrompt for semantic search fallback
internal/config/config.go19AAdd tool search/quality config fields
cmd/seed-registry/main.go19BAdd vector indexing after registry seed

Migration: 0026_tool_metrics

-- 0026_tool_metrics.up.sql

-- Extend tool_retry_stats with quality tracking columns
ALTER TABLE tool_retry_stats
ADD COLUMN IF NOT EXISTS total_calls INTEGER NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS avg_latency_ms DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS total_rating DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS rating_count INTEGER NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS quality_score DOUBLE PRECISION NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS last_called_at TIMESTAMPTZ,
ADD COLUMN IF NOT EXISTS degraded_since TIMESTAMPTZ;

-- Backfill total_calls from existing success + failure counts
UPDATE tool_retry_stats
SET total_calls = successes + failures
WHERE total_calls = 0 AND (successes > 0 OR failures > 0);

-- Tool feedback table for user-submitted ratings
CREATE TABLE IF NOT EXISTS tool_feedback (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL DEFAULT '_global',
tool_name TEXT NOT NULL,
user_id TEXT NOT NULL DEFAULT '',
rating DOUBLE PRECISION NOT NULL,
comment TEXT NOT NULL DEFAULT '',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_tool_feedback_tool ON tool_feedback (tenant_id, tool_name);
CREATE INDEX idx_tool_feedback_created ON tool_feedback (created_at);

Success Metrics

MetricTarget
Semantic search relevanceTop-5 results contain target tool >= 90% of test queries
Search latency (vector + re-rank)< 50ms p99
Quality score accuracyLLM rating within 0.2 of manual assessment
Degradation detectionAlert within 3 calls of quality drop
Quarantine escalationAutomatic quarantine after N consecutive degraded calls
Backward compatibilityfilterRegistryForPrompt unchanged when semantic disabled
Keyword fallbackGraceful degradation when vector store unavailable
Test coverage>= 80% for internal/registry/ (enforced by scripts/check-coverage.sh)

Code Quality Requirements (SonarQube)

All Go code produced by Phase 19 prompts must pass SonarQube quality gates:

  • Error handling: Every returned error must be handled explicitly
  • Cyclomatic complexity: Functions under 50 lines where practical
  • No dead code: No unused variables, empty blocks, or duplicated logic
  • Resource cleanup: Close all resources with proper defer patterns
  • Early returns: Prefer guard clauses over deep nesting
  • No magic values: Use named constants
  • Linting gate: Run go vet ./internal/registry/..., staticcheck ./internal/registry/..., and golangci-lint run ./internal/registry/... before considering prompts complete
  • Test coverage: 80%+ for new registry files

Risk Mitigation

RiskMitigation
Vector store unavailableSemantic search is opt-in (CRUVERO_TOOL_SEARCH_SEMANTIC=false default). Falls back to keyword search.
LLM auto-rating latencyFire-and-forget activity with 5s timeout. Tool execution is never blocked.
Cold start (no embeddings)seed-registry CLI indexes tools on seed. Keyword fallback for un-indexed tools.
Quality score gamingComposite score includes success rate, not just LLM rating. Manual feedback weighted separately.
Migration on existing dataALTER TABLE ADD COLUMN with defaults. Backfill UPDATE is idempotent.

Relationship to Other Phases

PhaseRelationship
Phase 5 (Memory)19B may reuse salience scoring patterns for recency decay
Phase 6 (Tool Registry)19A extends existing registry Store + types
Phase 8 (Embeddings + Vector)19B reuses Embedder and VectorStore with new collection
Phase 10D (Immune System)19C integrates with existing tool_quarantine for escalation
Phase 12 (Events)19C emits degradation events via NATS if available
Phase 14 (API)API endpoints can expose tool metrics via existing route patterns
Phase 18 (Prompt Library)19B mirrors the 3-stage search pipeline pattern from Phase 18 docs

Progress Notes

(none yet)