Version: 0.10.0

Data Models Reference

This page documents the core data structures used across DecisionBox.

DiscoveryResult

A complete discovery run output. Stored in the discoveries MongoDB collection.

Field	Type	Description
`id`	string	MongoDB ObjectID
`project_id`	string	Project that owns this discovery
`domain`	string	Domain (e.g., `gaming`, `social`)
`category`	string	Category (e.g., `match3`, `idle`, `casual`, `content_sharing`)
`run_type`	string	`full` (all areas), `partial` (some areas or some failed), `failed` (all areas failed)
`areas_requested`	string[]	Area IDs requested (empty for full run)
`discovery_date`	timestamp	When the discovery ran
`total_steps`	int	Number of exploration steps executed
`duration`	int64	Duration in nanoseconds
`insights`	Insight[]	Discovered patterns
`recommendations`	Recommendation[]	Actionable advice
`summary`	Summary	Aggregate stats
`created_at`	timestamp	When result was saved
`updated_at`	timestamp	Last write timestamp

Discovery logs are stored in split collections rather than embedded on the document: discovery_exploration_steps, discovery_analysis_steps, discovery_validation_results, and discovery_recommendation_log. See Discovery log collections for the rationale and the document shapes.

Insight

A discovered pattern or finding. Generated by the analysis phase.

Field	Type	Description
`id`	string	Deterministic ID: `{area}-{index}` (e.g., `churn-1`, `monetization-3`). Auto-generated if LLM omits it.
`analysis_area`	string	Which area found this (e.g., `churn`, `levels`)
`name`	string	Specific descriptive name (e.g., "Day 0-to-Day 1 Drop: 67% Never Return")
`description`	string	Detailed description with exact numbers and percentages
`severity`	string	`critical`, `high`, `medium`, or `low`
`affected_count`	int	Number of affected users (COUNT DISTINCT user_id)
`risk_score`	float64	0.0 to 1.0 risk assessment
`confidence`	float64	0.0 to 1.0 confidence level
`metrics`	map	Flexible key-value metrics (e.g., `{"churn_rate": 0.67, "avg_sessions": 3.2}`)
`indicators`	string[]	Specific metric indicators (e.g., "Session drop: 12min → 4min")
`target_segment`	string	Description of affected user segment
`source_steps`	int[]	Exploration step numbers that support this insight
`validation`	InsightValidation	Warehouse verification result (if validated)
`discovered_at`	timestamp	When this insight was generated

InsightValidation

Attached to an insight after validation by the verifier + refuter pair.

Field	Type	Description
`verifier`	StructuredVerdict	Defender-frame agent verdict with per-claim evidence rows
`refuter`	StructuredVerdict	Skeptic-frame agent verdict (omitted when refuter is disabled or the refuter run failed)
`combined`	string	Combined verdict: `confirmed`, `supported`, `rejected`, `partial`, `unverifiable`, `validation_disabled`, or `skipped_budget_cap`
`refuter_disabled`	bool	True when the refuter was intentionally skipped for this document
`validated_at`	timestamp	When validation was performed
`input_tokens`	int	LLM input tokens consumed during validation
`output_tokens`	int	LLM output tokens produced during validation

Legacy fields (status, original_count, verified_count, query, reasoning) remain populated on documents written before the verifier+refuter pipeline shipped; the dashboard renders the new fields when present and falls back to the legacy fields otherwise.

Recommendation

An actionable suggestion based on discovered insights.

Field	Type	Description
`id`	string	Recommendation ID
`category`	string	Category: `churn`, `engagement`, `monetization`, `difficulty`
`title`	string	Specific action title
`description`	string	Detailed explanation with numbers
`priority`	int	1 (critical) to 5 (optional). P1 = highest priority.
`target_segment`	string	Exact segment criteria
`segment_size`	int	Number of users in the segment
`expected_impact`	Impact	Expected improvement
`actions`	string[]	Numbered implementation steps
`related_insight_ids`	string[]	UUIDs of insights this recommendation addresses, verbatim from the same discovery's `insights[].id` (e.g., `["6e9261f5-c4ec-404b-bdf0-760a4644f384"]`). Recommendations whose `related_insight_ids` cannot be resolved to an eligible insight are dropped server-side before persistence; see the `RecommendationStep` `recommendations_dropped*` counters
`confidence`	float64	0.0 to 1.0 confidence

Impact

Expected impact of a recommendation.

Field	Type	Description
`metric`	string	What metric improves (e.g., `retention_rate`, `revenue`)
`estimated_improvement`	string	Expected improvement (e.g., "+15-20%", "+$4,975/month")
`reasoning`	string	Why this improvement is expected

Summary

Aggregate stats for a discovery run.

Field	Type	Description
`total_insights`	int	Number of insights generated
`total_recommendations`	int	Number of recommendations generated
`queries_executed`	int	Number of SQL queries executed
`errors`	string[]	Error messages from failed analysis areas (if any)

ExplorationStep

One step in the autonomous exploration phase. Drives one user-visible exploration turn — usually one LLM call, but a turn whose response can't be parsed is retried (up to a small retry budget) and every retry's token usage is summed onto the same step.

Field	Type	Description
`step`	int	Step number (1-based)
`timestamp`	timestamp	When this step ran
`action`	string	One of `query_data`, `lookup_schema`, `search_tables`, `complete`, `complete_rejected`
`thinking`	string	AI's reasoning for this query
`query_purpose`	string	Short description of query intent
`query`	string	The SQL query executed
`row_count`	int	Number of rows returned
`execution_time_ms`	int64	Query execution time in milliseconds
`error`	string	Error message if query failed
`fix_attempts`	int	Number of applied LLM repairs for this step's query (0 if it ran cleanly on the first try or if every fix call failed). `fix_attempts <= len(fix_history)`.
`fixed`	bool	True if the query was auto-fixed after a SQL error
`fix_history`	FixAttempt[]	Per-attempt fix log; omitted (or empty) when no fix was needed. Includes failed attempts (LLM error, unparseable response, filter rejection) — those carry `fixer_error` set. See FixAttempt.
`tokens_in`	int	Input tokens consumed (sum across any parse-retry rounds on this step)
`tokens_out`	int	Output tokens generated (sum across any parse-retry rounds on this step)

FixAttempt

One entry per LLM call the self-healing SQL fix loop made for a single exploration step. Stored as elements of ExplorationStep.fix_history in chronological order. Every entry — including failed attempts whose proposed SQL was never applied — is recorded so downstream tooling has visibility into the full repair trajectory, not just the last call.

Field	Type	Description
`step`	int	The parent `ExplorationStep.step` number, duplicated for self-contained export.
`attempt`	int	Zero-based retry index inside the step. Matches the `attempt` argument the executor passes to the SQL fixer.
`prompt_in`	string	Fully rendered prompt the fixer sent to the LLM (system instruction + user message).
`response_out`	string	Raw LLM response text — before SQL extraction. Populated even when extraction failed.
`sql_before`	string	The broken SQL handed to the fixer.
`sql_after`	string	The SQL the fixer proposed. Empty when the fixer failed to extract any parseable SQL (in that case `fixer_error` is set).
`error_in`	string	The warehouse error message that triggered this fix call.
`fixer_error`	string	When non-empty, the reason the proposal was NOT applied: LLM transport error, unparseable response, or post-fix security-filter rejection. When empty, the proposal was applied and the warehouse was retried with `sql_after`.
`input_tokens`	int	Input tokens consumed by this fix call.
`output_tokens`	int	Output tokens generated by this fix call.
`duration_ms`	int64	LLM call duration in milliseconds.
`timestamp`	timestamp	When this fix attempt was recorded.

The parent ExplorationStep.fix_attempts counter is the number of applied repairs (fixer_error empty), so fix_attempts <= len(fix_history) with the gap being failed-fixer rows.

AnalysisStep

Full LLM dialog for one analysis area. Captures the complete prompt and response.

Field	Type	Description
`area_id`	string	Analysis area ID (e.g., `churn`)
`area_name`	string	Display name (e.g., `Churn Risks`)
`run_at`	timestamp	When this analysis ran
`relevant_queries`	int	Number of exploration queries used as context
`tokens_in`	int	Input tokens consumed
`tokens_out`	int	Output tokens generated
`duration_ms`	int64	LLM call duration in milliseconds
`insight_count`	int	Number of insights extracted
`error`	string	Error message if analysis failed

RecommendationStep

Full LLM dialog for the recommendation-generation phase, stored in the discovery_recommendation_log collection. One row per discovery run.

Field	Type	Description
`run_at`	timestamp	When the recommendation phase ran
`prompt`	string	Fully rendered recommendation prompt sent to the LLM
`insight_count`	int	Number of eligible insights handed to the recommender
`response`	string	Raw LLM response text — before JSON cleanup
`tokens_in`	int	Input tokens consumed by the LLM call
`tokens_out`	int	Output tokens generated by the LLM call
`duration_ms`	int64	LLM call duration in milliseconds
`recommendations`	Recommendation[]	The parsed recommendations actually persisted to the discovery (after the `related_insight_ids` validity check below has dropped any unresolvable ones)
`status`	string	Optional observability marker. Empty on the happy path; set to `skipped_no_eligible_insights` when every upstream insight was rejected by the validator so the recommender was never invoked
`recommendations_dropped`	int	Number of recommendations the LLM emitted but the orchestrator discarded before persistence because their `related_insight_ids` could not be resolved to an eligible insight (sum of the two reason-specific counters). Omitted on a clean run.
`recommendations_dropped_missing_ids`	int	Subset of `recommendations_dropped` that arrived with an empty or absent `related_insight_ids` array. Omitted on a clean run.
`recommendations_dropped_unknown_id`	int	Subset of `recommendations_dropped` that cited at least one id not present in the eligible insight set — typically slug-style identifiers (`category:severity:theme`) hallucinated by the LLM in place of a real UUID. Use this counter to measure regression rates per LLM provider. Omitted on a clean run.
`error`	string	Error message if recommendation generation failed (LLM transport error or unparseable response)

ValidationResult

Warehouse verification of an insight's claims.

Field	Type	Description
`insight_id`	string	ID of the validated insight
`analysis_area`	string	Area this insight belongs to
`claimed_count`	int	Count claimed by the AI
`verified_count`	int	Count verified from the warehouse
`status`	string	`confirmed`, `adjusted`, `rejected`, `error`
`reasoning`	string	Explanation of the result
`query`	string	The verification SQL query
`validated_at`	timestamp	When validation was performed
`input_tokens`	int	Input tokens summed across every verifier LLM call for this insight (initial verification, lookup-loop rounds, forced final round). Omitted on legacy rows.
`output_tokens`	int	Output tokens summed across the same set. Omitted on legacy rows.

DiscoveryRun

Live status of a running discovery. Stored in discovery_runs collection, updated in real-time.

Field	Type	Description
`id`	string	Run ID
`project_id`	string	Project being discovered
`status`	string	`pending`, `running`, `completed`, `failed`, `cancelled`
`phase`	string	Current phase: `init`, `schema_discovery`, `exploration`, `analysis`, `validation`, `recommendations`, `saving`, `complete`
`phase_detail`	string	Human-readable phase description
`progress`	int	0 to 100 percentage
`started_at`	timestamp	When the run started
`updated_at`	timestamp	Last status update
`completed_at`	timestamp	When the run finished (null if running)
`error`	string	Error message (if failed)
`steps`	RunStep[]	Live step feed
`total_queries`	int	Total SQL queries executed
`successful_queries`	int	Queries that returned results
`failed_queries`	int	Queries that errored
`insights_found`	int	Insights generated so far

RunStep

One step in the live progress feed.

Field	Type	Description
`phase`	string	Which phase this step belongs to
`step_num`	int	Step number
`timestamp`	timestamp	When this step occurred
`type`	string	`query`, `insight`, `analysis`, `validation`, `recommendation`, `error`
`message`	string	Step description
`llm_thinking`	string	AI's reasoning text
`query`	string	SQL query (if type=query)
`query_result`	string	Query result summary
`row_count`	int	Rows returned
`query_time_ms`	int	Query execution time
`query_fixed`	bool	Whether query was auto-fixed
`insight_name`	string	Insight name (if type=insight)
`insight_severity`	string	Insight severity (if type=insight)
`error`	string	Error message (if type=error)

Feedback

User feedback on insights, recommendations, or exploration steps.

Field	Type	Description
`id`	string	Feedback ID
`project_id`	string	Project ID
`discovery_id`	string	Discovery run ID
`target_type`	string	`insight`, `recommendation`, `exploration_step`
`target_id`	string	ID of the rated item
`rating`	string	`like` or `dislike`
`comment`	string	Optional comment (typically with dislikes)
`created_at`	timestamp	When feedback was submitted

Project

Project configuration. Stored in projects collection.

Field	Type	Description
`id`	string	MongoDB ObjectID
`name`	string	Project name
`description`	string	Project description
`domain`	string	Domain (e.g., `gaming`, `social`)
`category`	string	Category (e.g., `match3`, `idle`, `casual`, `content_sharing`)
`warehouse`	WarehouseConfig	Data warehouse configuration
`llm`	LLMConfig	LLM provider configuration
`schedule`	ScheduleConfig	Discovery schedule
`profile`	map	Domain-specific profile (from JSON Schema form)
`prompts`	ProjectPrompts	Per-project prompt overrides
`status`	string	Project status
`last_run_at`	timestamp	When the last discovery ran
`last_run_status`	string	Last run result
`created_at`	timestamp	When the project was created
`updated_at`	timestamp	Last update

WarehouseConfig

Field	Type	Description
`provider`	string	Provider ID: `bigquery`, `redshift`
`project_id`	string	GCP project ID (BigQuery)
`datasets`	string[]	Dataset/schema names
`location`	string	Data location
`filter_field`	string	Multi-tenant filter column
`filter_value`	string	Multi-tenant filter value
`config`	map	Provider-specific key-value config

LLMConfig

Field	Type	Description
`provider`	string	Provider ID: `claude`, `openai`, `ollama`, `vertex-ai`, `bedrock`
`model`	string	Model identifier (free text)
`config`	map	Provider-specific key-value config (e.g., `project_id`, `location` for Vertex AI)

ScheduleConfig

Field	Type	Description
`enabled`	bool	Whether automatic discovery is enabled
`cron_expr`	string	Cron expression (e.g., `0 2 * * *` = daily at 2 AM)
`max_steps`	int	Max exploration steps for scheduled runs

Next Steps

API Reference — Endpoints that return these models
Configuration Reference — Environment variables

DiscoveryResult​

Insight​

InsightValidation​

Recommendation​

Impact​

Summary​

ExplorationStep​

FixAttempt​

AnalysisStep​

RecommendationStep​

ValidationResult​

DiscoveryRun​

RunStep​

Feedback​

Project​

WarehouseConfig​

LLMConfig​

ScheduleConfig​

Next Steps​