neutral

Phase 19 — Tool Registry Restructure

Upgrades the tool registry from keyword-based discovery to semantic vector search with quality tracking. Introduces per-tool quality metrics, LLM-based auto-rating after tool execution, degradation alerting with quarantine escalation, and a feedback CLI. Builds on existing infrastructure: tool_retry_stats (migration 0005/0013), tool_quarantine (migration 0020), internal/vectorstore/, and internal/embedding/.

Status: Not Started Depends on: Phases 1-14 complete Migrations: 0026_tool_metrics (Phase 19A) Branch: dev

Why Now

With Phases 1-14 complete, Cruvero has a functional tool registry with keyword-based discovery and basic retry statistics — but tool selection and quality management have three structural problems:

Keyword-only discovery — filterRegistryForPrompt in internal/agent/workflow.go uses token scoring and exact name matching to select tools. This misses semantic relationships: a prompt about "send notification" cannot discover tools named "email_dispatch" or "slack_post" unless those exact words appear.
No quality signal — tool_retry_stats (migration 0005) records binary success/failure counters per tool, but there is no measure of output quality. A tool that returns low-quality results 100% of the time appears identical to one that returns high-quality results.
No degradation alerting — tool_quarantine (migration 0020) provides binary quarantine from the immune system, but there is no graduated degradation awareness. A tool trending toward failure has no pre-quarantine warning mechanism.

Phase 19 solves all three by introducing semantic vector search for tool discovery, quality scoring with LLM auto-rating, and degradation alerting that integrates with the existing quarantine path.

Architecture

Extended registry package: `internal/registry/`

All new tool quality and search functionality lives in the existing internal/registry/ package. No new package is created — this extends the existing Store, ToolDefinition, and supporting types.

┌──────────────────────────────────────────────────────────────────┐
│                    registry.ToolSearcher                          │
│                                                                  │
│  ┌───────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │  Vector Retrieval │  │  Quality         │  │  Result      │  │
│  │  (embed + search  │  │  Re-Ranking      │  │  Assembly    │  │
│  │   tool desc)      │  │  (quality score  │  │  (merge,     │  │
│  │                   │  │   + recency +    │  │   format)    │  │
│  │                   │  │   success rate)  │  │              │  │
│  └─────────┬─────────┘  └────────┬─────────┘  └──────┬───────┘  │
│            │                     │                    │          │
│            └─────────┬───────────┘                    │          │
│                      │                                │          │
│               3-Stage Pipeline                        │          │
│                                                       │          │
│  External deps (reused, not owned):                   │          │
│  ├─ internal/embedding/Embedder                       │          │
│  ├─ internal/vectorstore/VectorStore (collection:     │          │
│  │    "tool_registry")                                │          │
│  └─ internal/tenant/ (multi-tenant isolation)         │          │
│                                                       │          │
│  Existing tables (extended, not replaced):            │          │
│  ├─ tool_retry_stats (migration 0005/0013)            │          │
│  └─ tool_quarantine (migration 0020)                  │          │
└──────────────────────────────────────────────────────────────────┘

Core API

// MetricsStore tracks mutable quality signals for tools.
type MetricsStore interface {
    RecordExecution(ctx context.Context, toolName string, outcome ExecutionOutcome) error
    RecordFeedback(ctx context.Context, toolName string, feedback ToolFeedback) error
    GetMetrics(ctx context.Context, toolName string) (ToolMetrics, error)
    ListDegraded(ctx context.Context, threshold float64) ([]ToolMetrics, error)
}

// ToolSearcher finds tools by semantic similarity + quality ranking.
type ToolSearcher interface {
    Search(ctx context.Context, query string, k int, filter *ToolSearchFilter) ([]ScoredTool, error)
}

// ToolIndexer manages vector embeddings for tool descriptions.
type ToolIndexer interface {
    IndexTool(ctx context.Context, tool ToolDefinition) error
    IndexRegistry(ctx context.Context, reg ToolRegistry) error
    RemoveTool(ctx context.Context, toolName string) error
}

Key Types

type ExecutionOutcome struct {
    ToolName   string  `json:"tool_name"`
    RunID      string  `json:"run_id"`
    StepIdx    int     `json:"step_idx"`
    Success    bool    `json:"success"`
    LatencyMs  int64   `json:"latency_ms"`
    LLMRating  float64 `json:"llm_rating"`  // 0.0-1.0, from post-execution assessment
    ErrorClass string  `json:"error_class,omitempty"`
    TenantID   string  `json:"tenant_id"`
}

type ToolFeedback struct {
    ToolName string  `json:"tool_name"`
    UserID   string  `json:"user_id"`
    Rating   float64 `json:"rating"` // 0.0-1.0
    Comment  string  `json:"comment,omitempty"`
    TenantID string  `json:"tenant_id"`
}

type ToolMetrics struct {
    ToolName      string    `json:"tool_name"`
    TenantID      string    `json:"tenant_id"`
    TotalCalls    int       `json:"total_calls"`
    SuccessCount  int       `json:"success_count"`
    FailureCount  int       `json:"failure_count"`
    AvgLatencyMs  float64   `json:"avg_latency_ms"`
    AvgLLMRating  float64   `json:"avg_llm_rating"`
    QualityScore  float64   `json:"quality_score"` // composite: success_rate * avg_llm_rating
    LastCalledAt  time.Time `json:"last_called_at"`
    DegradedSince *time.Time `json:"degraded_since,omitempty"`
}

type ScoredTool struct {
    Tool       ToolDefinition  `json:"tool"`
    Score      float64         `json:"score"`
    Components ScoreComponents `json:"components"`
}

type ScoreComponents struct {
    Similarity float64 `json:"similarity"`
    Quality    float64 `json:"quality"`
    Recency    float64 `json:"recency"`
}

type ToolSearchFilter struct {
    TenantID     string   `json:"tenant_id,omitempty"`
    ExcludeNames []string `json:"exclude_names,omitempty"`
    MinQuality   float64  `json:"min_quality,omitempty"`
}

Search Pipeline

Three-stage pipeline, same pattern as Phase 18's prompt search:

Stage 1: Vector Retrieval

Embed query text using embedding.Embedder.Embed() (internal/embedding/embedder.go:22)
Search tool_registry collection via vectorstore.VectorStore.Search() (internal/vectorstore/store.go:35)
Apply tenant isolation filter
Retrieve top-K candidates (default K=30)

Stage 2: Quality Re-Ranking

Score each candidate using a weighted formula:

score = W_sim * similarity + W_qual * quality + W_rec * recency

Weight	Default	Source
`W_sim` (similarity)	0.5	Vector cosine similarity from Stage 1
`W_qual` (quality)	0.35	`success_rate * avg_llm_rating` from `tool_metrics`
`W_rec` (recency)	0.15	Recency decay from last successful call

Tools with active quarantine entries (tool_quarantine where released_at IS NULL AND (expires_at IS NULL OR expires_at > NOW())) are excluded from results.

Stage 3: Result Assembly

Sort by composite score
Truncate to requested limit (default 20)
Return []ScoredTool with score components for transparency

Quality Tracking

LLM Auto-Rating

After each tool execution in ToolExecuteActivity, a non-blocking Temporal activity records an ExecutionOutcome including:

Binary success/failure (existing)
Execution latency
LLM quality rating (0.0-1.0) from a post-execution assessment prompt

The LLM rating uses a lightweight prompt asking the model to rate tool output relevance and correctness on a 0-1 scale. This runs as a child activity with short timeout (5s) and fire-and-forget semantics.

Degradation Detection

A periodic activity (or checked inline during ToolExecuteActivity) computes a rolling quality score. When the score drops below a configurable threshold:

Warning — Log structured warning + emit NATS event (if Phase 12 active) or memory episode fallback
Alert — Set degraded_since timestamp in tool_metrics
Quarantine escalation — If quality stays below threshold for N consecutive calls, insert into existing tool_quarantine table (migration 0020) with reason referencing quality degradation

Backward Compatibility

filterRegistryForPrompt in internal/agent/workflow.go is updated to use vector search when CRUVERO_TOOL_SEARCH_SEMANTIC=true, falling back to the existing keyword scoring when disabled or when the vector store is unavailable. The function signature remains unchanged.

Sub-Phases

Sub-Phase	Name	Prompts	Depends On
19A	Foundation: MetricsStore, Types, Migration	4	—
19B	Vector Indexing + Semantic Search	4	19A
19C	Quality Tracking + Degradation Alerting	4	19B
19D	CLI, Agent Discovery Integration, Testing	4	19C

Total: 4 sub-phases, 16 prompts, 9 documentation files

Dependency Graph

19A (Foundation) → 19B (Vector Search) → 19C (Quality Tracking) → 19D (CLI/Integration)

Strictly sequential: each sub-phase builds on the previous.

Environment Variables

Variable	Default	Description
`CRUVERO_TOOL_SEARCH_SEMANTIC`	`false`	Enable semantic vector search for tool discovery
`CRUVERO_TOOL_SEARCH_COLLECTION`	`tool_registry`	Vector store collection name
`CRUVERO_TOOL_SEARCH_K`	`30`	Vector retrieval candidates (Stage 1)
`CRUVERO_TOOL_SEARCH_RESULT_LIMIT`	`20`	Max tools returned to agent
`CRUVERO_TOOL_SEARCH_W_SIMILARITY`	`0.5`	Ranking weight: vector similarity
`CRUVERO_TOOL_SEARCH_W_QUALITY`	`0.35`	Ranking weight: quality score
`CRUVERO_TOOL_SEARCH_W_RECENCY`	`0.15`	Ranking weight: recency decay
`CRUVERO_TOOL_QUALITY_ENABLED`	`true`	Enable quality tracking and LLM auto-rating
`CRUVERO_TOOL_QUALITY_RATING_TIMEOUT`	`5s`	Timeout for LLM auto-rating activity
`CRUVERO_TOOL_QUALITY_DEGRADE_THRESHOLD`	`0.3`	Quality score below which a tool is considered degraded
`CRUVERO_TOOL_QUALITY_QUARANTINE_AFTER`	`5`	Consecutive degraded calls before quarantine escalation

Files Overview

New Files

File	Sub-Phase	Description
`internal/registry/metrics_types.go`	19A	ExecutionOutcome, ToolFeedback, ToolMetrics, ScoredTool, ScoreComponents
`internal/registry/metrics_store.go`	19A	MetricsStore interface + PostgresMetricsStore
`internal/registry/tool_indexer.go`	19B	ToolIndexer interface + DefaultToolIndexer
`internal/registry/tool_searcher.go`	19B	ToolSearcher interface + DefaultToolSearcher (3-stage pipeline)
`internal/registry/scorer.go`	19B	ToolScorer (ranking formula, weight config)
`internal/registry/quality.go`	19C	QualityTracker, degradation detection, quarantine escalation
`internal/registry/rating.go`	19C	LLM auto-rating prompt + activity
`internal/registry/search_config.go`	19B	Search config wiring from env vars
`cmd/tool-feedback/main.go`	19D	CLI to submit tool quality feedback
`migrations/0026_tool_metrics.up.sql`	19A	Extend tool quality tracking tables
`migrations/0026_tool_metrics.down.sql`	19A	Reverse migration
`internal/registry/metrics_types_test.go`	19D	Type validation tests
`internal/registry/metrics_store_test.go`	19D	PostgresMetricsStore tests (sqlmock)
`internal/registry/tool_indexer_test.go`	19D	Indexer tests (mock embedder + vector store)
`internal/registry/tool_searcher_test.go`	19D	Searcher pipeline tests
`internal/registry/scorer_test.go`	19D	Scorer tests
`internal/registry/quality_test.go`	19D	Quality tracking + degradation tests
`internal/registry/rating_test.go`	19D	LLM rating tests

Modified Files

File	Sub-Phase	Change
`internal/agent/activities.go`	19C	Wire quality recording in ToolExecuteActivity
`internal/agent/workflow.go`	19D	Update filterRegistryForPrompt for semantic search fallback
`internal/config/config.go`	19A	Add tool search/quality config fields
`cmd/seed-registry/main.go`	19B	Add vector indexing after registry seed

Migration: `0026_tool_metrics`

-- 0026_tool_metrics.up.sql

-- Extend tool_retry_stats with quality tracking columns
ALTER TABLE tool_retry_stats
    ADD COLUMN IF NOT EXISTS total_calls INTEGER NOT NULL DEFAULT 0,
    ADD COLUMN IF NOT EXISTS avg_latency_ms DOUBLE PRECISION NOT NULL DEFAULT 0,
    ADD COLUMN IF NOT EXISTS total_rating DOUBLE PRECISION NOT NULL DEFAULT 0,
    ADD COLUMN IF NOT EXISTS rating_count INTEGER NOT NULL DEFAULT 0,
    ADD COLUMN IF NOT EXISTS quality_score DOUBLE PRECISION NOT NULL DEFAULT 0,
    ADD COLUMN IF NOT EXISTS last_called_at TIMESTAMPTZ,
    ADD COLUMN IF NOT EXISTS degraded_since TIMESTAMPTZ;

-- Backfill total_calls from existing success + failure counts
UPDATE tool_retry_stats
SET total_calls = successes + failures
WHERE total_calls = 0 AND (successes > 0 OR failures > 0);

-- Tool feedback table for user-submitted ratings
CREATE TABLE IF NOT EXISTS tool_feedback (
    id          BIGSERIAL PRIMARY KEY,
    tenant_id   TEXT NOT NULL DEFAULT '_global',
    tool_name   TEXT NOT NULL,
    user_id     TEXT NOT NULL DEFAULT '',
    rating      DOUBLE PRECISION NOT NULL,
    comment     TEXT NOT NULL DEFAULT '',
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_tool_feedback_tool ON tool_feedback (tenant_id, tool_name);
CREATE INDEX idx_tool_feedback_created ON tool_feedback (created_at);

Success Metrics

Metric	Target
Semantic search relevance	Top-5 results contain target tool >= 90% of test queries
Search latency (vector + re-rank)	< 50ms p99
Quality score accuracy	LLM rating within 0.2 of manual assessment
Degradation detection	Alert within 3 calls of quality drop
Quarantine escalation	Automatic quarantine after N consecutive degraded calls
Backward compatibility	filterRegistryForPrompt unchanged when semantic disabled
Keyword fallback	Graceful degradation when vector store unavailable
Test coverage	>= 80% for internal/registry/ (enforced by scripts/check-coverage.sh)

Code Quality Requirements (SonarQube)

All Go code produced by Phase 19 prompts must pass SonarQube quality gates:

Error handling: Every returned error must be handled explicitly
Cyclomatic complexity: Functions under 50 lines where practical
No dead code: No unused variables, empty blocks, or duplicated logic
Resource cleanup: Close all resources with proper defer patterns
Early returns: Prefer guard clauses over deep nesting
No magic values: Use named constants
Linting gate: Run go vet ./internal/registry/..., staticcheck ./internal/registry/..., and golangci-lint run ./internal/registry/... before considering prompts complete
Test coverage: 80%+ for new registry files

Risk Mitigation

Risk	Mitigation
Vector store unavailable	Semantic search is opt-in (`CRUVERO_TOOL_SEARCH_SEMANTIC=false` default). Falls back to keyword search.
LLM auto-rating latency	Fire-and-forget activity with 5s timeout. Tool execution is never blocked.
Cold start (no embeddings)	`seed-registry` CLI indexes tools on seed. Keyword fallback for un-indexed tools.
Quality score gaming	Composite score includes success rate, not just LLM rating. Manual feedback weighted separately.
Migration on existing data	ALTER TABLE ADD COLUMN with defaults. Backfill UPDATE is idempotent.

Relationship to Other Phases

Phase	Relationship
Phase 5 (Memory)	19B may reuse salience scoring patterns for recency decay
Phase 6 (Tool Registry)	19A extends existing registry Store + types
Phase 8 (Embeddings + Vector)	19B reuses Embedder and VectorStore with new collection
Phase 10D (Immune System)	19C integrates with existing tool_quarantine for escalation
Phase 12 (Events)	19C emits degradation events via NATS if available
Phase 14 (API)	API endpoints can expose tool metrics via existing route patterns
Phase 18 (Prompt Library)	19B mirrors the 3-stage search pipeline pattern from Phase 18 docs

Progress Notes

(none yet)

Why Now​

Architecture​

Extended registry package: internal/registry/​

Core API​

Key Types​

Search Pipeline​

Stage 1: Vector Retrieval​

Stage 2: Quality Re-Ranking​

Stage 3: Result Assembly​

Quality Tracking​

LLM Auto-Rating​

Degradation Detection​

Backward Compatibility​

Sub-Phases​

Dependency Graph​

Environment Variables​

Files Overview​

New Files​

Modified Files​

Migration: 0026_tool_metrics​

Success Metrics​

Code Quality Requirements (SonarQube)​

Risk Mitigation​

Relationship to Other Phases​

Progress Notes​