Skip to main content
Version: Next

Agent Analysis Phase Compaction

Version: 0.4.0+

Why

Before this change, the analysis phase of discovery built a per-area LLM prompt by:

  1. Selecting exploration steps via case-insensitive substring keyword match against the area's keyword list, and
  2. JSON-marshaling every selected step's full ExplorationStep — including every row of query_result — into the prompt as {{QUERY_RESULTS}}.

On an ERP-scale run (~80 exploration steps, several with thousand-row result sets) one analysis area assembled a single prompt of 2,993,965 characters / ~1.4M tokens. Bedrock Claude rejected it:

prompt is too long: 1401958 tokens > 1000000 maximum

The discovery completed, but every area that hit the cap produced 0 insights.

The on-demand schema retrieval architecture (see agent-on-demand-schema.md) fixed the parallel problem in the exploration phase. This change extends the same idea — bounded prompts via deterministic compaction + ranked retrieval — to the analysis phase.

Two orthogonal axes

AxisBeforeAfter
Selection — which steps feed the promptCase-insensitive substring match on Query + QueryPurpose + ThinkingVector ranking against the area's identity (Name + Description + Keywords)
Representation — how each step is renderedFull ExplorationStep JSON, every row inlinedCompactResult digest (per-column statistics + head/tail rows)

Either alone is insufficient: vector ranking still produces an unbounded prompt if every selected step inlines a thousand rows; compaction alone keeps a step's rendered size small but still selects every keyword-matched step regardless of relevance.

CompactResult — per-step deterministic digest

libs/go-common/models/CompactResult replaces the raw query_result blob in the prompt. Computed once at exploration time and attached to the step:

type CompactResult struct {
RowCount int
Columns []ColumnSummary
HeadRows []map[string]any // first 5 rows, always present
TailRows []map[string]any // last 5 rows, when RowCount > 2*HeadTailRowCount
AllRows []map[string]any // every row when RowCount <= CompactInlineThreshold
}

type ColumnSummary struct {
Name string
Kind ColumnKind // number / string / boolean / timestamp / null / mixed
NullCount int
Distinct int
Min, P25, Median, P75, Max *float64 // numeric percentiles
MinTime, MaxTime string // timestamp range (ISO-8601)
Top []ValueCount // top-3 values for low-cardinality strings + booleans
}

Constants live in the same file:

ConstantDefaultRole
CompactInlineThreshold20Row-count cap below which AllRows keeps everything
TopValueCardinality20Distinct-value cap above which a string column emits no Top (PII guard)
HeadTailRowCount5Number of rows in HeadRows / TailRows

The builder is pure: same input twice → byte-identical output. NaN / +Inf / -Inf are excluded from numeric percentiles and counted toward NullCount so a single bad row can't poison the statistics.

A 2,000-row result that previously rendered as ~80 KB JSON now renders as ~1.5 KB.

Vector ranking — per-run step index

Each completed exploration step is upserted into a per-run Qdrant collection named decisionbox_run_{run_id} immediately after the step's analysis is set. The embedding text is the step's QueryPurpose + "\n[SQL]: " + QueryThinking is intentionally excluded because it tends to be exploratory chain-of-thought that hurts ranking more than it helps. The payload carries {step, purpose, row_count, has_error} so the picker can log budget-trimming decisions without re-fetching the step doc.

For each analysis area:

area_query = area.Name + " — " + area.Description + ". Keywords: " + keywords
v = embedding.Embed(area_query)
hits = run_step_index.Search(v, TopK=24, MinScore=0.30)

After vector retrieval, an exact-match boost promotes any step whose Query / QueryPurpose / Thinking contains a verbatim area keyword (case-insensitive substring) to a score of at least ExactMatchFloor = 0.55. This guarantees a step explicitly written for an area's keyword can never be ranked out by a semantically- close-but-different step. The picker emits (picked, dropped); the caller logs the dropped list to telemetry.

The collection is dropped at run completion (success or failure) by a defer in the orchestrator. On agent boot, an orphan sweep drops any per-run collection whose run id is no longer in the active discovery_runs set within a 24h window.

Per-area token budget

After ranking + compaction, the picker estimates the rendered {{QUERY_RESULTS}} JSON byte size and trims the lowest-scored steps until the prompt fits under AnalysisQueryResultsBudgetTokens = 200_000. The estimator runs the same renderer the orchestrator runs, so len(RenderCompactedSteps(s)) == EstimateCompactedRenderedSize(s) holds for any input.

Dropped steps are reported with their step number, score, and the drop reason (below_min_score or over_budget). Telemetry exposes the totals on the run document.

Flow

Orchestrator

├─ Phase 3: Exploration
│ └─ For each completed step:
│ ├─ engine builds CompactResult and attaches it to the step
│ └─ engine upserts (vector, payload) into the per-run Qdrant collection
│ └─ counting decorator bumps analysis_step_index_upserts

├─ Phase 4: Analysis (one iteration per area)
│ │
│ ├─ picker.Pick(area, steps) →
│ │ ├─ run_step_index.Search(area_query, TopK=24, MinScore=0.30)
│ │ │ └─ bumps analysis_step_index_search_calls
│ │ ├─ exact-match boost (verbatim keyword in step text)
│ │ ├─ sort by score desc, step asc on ties
│ │ └─ budget trim until rendered size ≤ AnalysisQueryResultsBudgetTokens
│ │ └─ bumps analysis_steps_dropped (per dropped step)
│ │
│ ├─ RenderCompactedSteps(picked) → fixed-size JSON
│ ├─ Substitute into prompt's {{QUERY_RESULTS}}
│ ├─ Run LLM, parse insights, validate
│ └─ Stamp telemetry on the analysis_log entry: selected_steps,
│ dropped_steps, query_results_chars

└─ defer run_step_index.Drop()

Key files

FileRole
libs/go-common/models/compact_result.goCompactResult, ColumnSummary, ValueCount, ColumnKind + tunable constants
libs/go-common/models/compact_builder.goBuildCompactResult — pure deterministic digest
services/agent/internal/discovery/run_step_index.goPer-run Qdrant collection wrapper: Upsert, Search, Drop, SweepOrphanRunStepIndexes
services/agent/internal/discovery/analysis_step_picker.goVector hits + exact-match boost + budget trimming. Pure logic, no IO.
services/agent/internal/discovery/render_query_results.goSingle source of truth for the {{QUERY_RESULTS}} JSON shape
services/agent/internal/ai/exploration.goComputes CompactResult per step + upserts via the StepIndexer interface
services/agent/internal/discovery/orchestrator.goWires picker + render into the analysis loop; defers Drop
services/agent/agentserver/agentserver.goConstructs RunStepIndex, runs the boot-time orphan sweep

Telemetry

New fields on discovery_runs:

FieldWhat it counts
analysis_step_index_upsertsHow many exploration steps were indexed for this run
analysis_step_index_search_callsHow many area-level vector searches the picker issued
analysis_steps_droppedHow many steps the picker excluded (sum across areas, all reasons)

New fields on each analysis_log entry:

FieldShape
selected_steps`[{step:int, score:float, source:"vector"
dropped_steps`[{step:int, score:float, reason:"below_min_score"
query_results_charsFinal rendered byte size of the {{QUERY_RESULTS}} block

The dashboard's debug view (Project Settings → Advanced → Show debug logs during discovery) renders these as a per-area "what fed the LLM" breakdown so a human reviewer can see which steps were sacrificed.

Tunable constants

All defined in code; documented in docs/reference/configuration.md. Tune by editing the constant + redeploying the agent — there is no runtime override surface (deliberately — they're algorithm parameters, not user knobs).

ConstantDefaultWhen to tune
models.CompactInlineThreshold20 rowsLower if your domain produces lots of 50-row aggregates and the head+tail summary is noticeably worse
models.TopValueCardinality20Lower for datasets where >20 distinct values is still meaningful (rare); higher leaks PII
models.HeadTailRowCount5Higher if the LLM consistently misses end-of-result patterns
discovery.AnalysisAreaTopK24Lower if you regularly hit budget trimming; higher if the picker is missing steps
discovery.AnalysisAreaMinScore0.30Lower for highly-multilingual runs where cosine is naturally smaller
discovery.ExactMatchFloor0.55Raise to make the exact-match path more aggressive; lower to defer to vector ranking
discovery.AnalysisQueryResultsBudgetTokens200_000Lower if the surrounding prompt grows; raise only on models with >>1M-token windows

Tests

Unit tests cover every branch in the digest builder (12 numeric / string / type-inference / boundary cases including determinism and NaN handling), every picker path (vector-only, exact-match boost, budget trim, ties, error propagation), and the renderer's golden JSON shape. The picker uses a fake Search function and the renderer runs deterministically end-to-end.

Integration tests run against a real Qdrant testcontainer:

  • TestIntegration_RunStepIndex_FullCycle — upsert 30 steps, search, drop, second drop is a no-op.
  • TestIntegration_RunStepIndex_ConcurrentUpserts — 30 goroutines upsert concurrently; collection-create races are resolved.
  • TestIntegration_RunStepIndex_MultilingualQuery — a Turkish area query retrieves the matching English-described step, gated on INTEGRATION_TEST_OPENAI_API_KEY (uses text-embedding-3-large).

Verification

After deploying:

  1. Run discovery on a project that previously hit the prompt cap (e.g. an ERP-scale run with thousand-row result sets per area).
  2. Confirm in the dashboard's debug view that analysis_steps_dropped is observable but not absurd. A well-tuned run drops 0 to a handful of steps per area.
  3. No prompt is too long errors in the agent logs.
  4. Spot-check insights against the last successful run for the same area — the same patterns should surface, possibly with slight ordering differences from the new ranker.

Known limitations

  • Embedding cost per run: one embed call per indexed step + one per area search. With text-embedding-3-large and a 100-step run with 8 areas, this is ~108 embed calls or ~$0.001. Negligible.
  • Vector ranking is approximate: a step that would have been selected by keyword match may not survive the min-score floor. The exact-match boost is the safety net — but it's still possible to produce a different mix of steps than the legacy keyword-only selector. Score logging in selected_steps makes the difference auditable.
  • Per-run collections in Qdrant aren't free: each one occupies ~kilobytes plus index overhead. The defer'd drop + boot-time sweep keeps Qdrant clean across thousands of runs.
  • The renderer never strips columns: a result with 100 columns still emits 100 ColumnSummary entries. Most warehouses return ≤30 columns; for genuinely wide results, future work could trim to the columns the LLM actually needs.