Skip to main content
Version: Next

Configuring Data Warehouses

Version: 0.3.0

DecisionBox connects to your existing data warehouse in read-only mode. This guide covers setup for each supported warehouse.

Google BigQuery

Prerequisites

  • A GCP project with BigQuery datasets
  • One of:
    • On GCP: Application Default Credentials (Workload Identity, gcloud auth)
    • Outside GCP: A service account JSON key

Dashboard Setup

  1. Select Google BigQuery as warehouse provider
  2. Fill in:
    • Project ID: Your GCP project ID (e.g., my-gcp-project)
    • Location: Dataset location (e.g., US, us-central1, us-east5)
  3. Enter Datasets: Comma-separated (e.g., analytics, features_prod)
  4. Optionally set Filter: field + value for multi-tenant data

Authentication

When creating a project, select one of the available auth methods:

Application Default Credentials (ADC) — No credentials needed. Works automatically on GKE (Workload Identity), Cloud Run, Compute Engine, or after gcloud auth application-default login.

Service Account Key — For cross-cloud, local, or federated access.

  1. Create a service account in GCP Console with BigQuery Data Viewer and BigQuery Job User roles
  2. Download the JSON key
  3. In the project creation form, select Service Account Key as auth method and paste the JSON key

This method also supports Workload Identity Federation (WIF) — keyless access from AWS, Azure, or any OIDC identity provider. Instead of a service account key, paste the WIF credential config JSON generated by gcloud iam workload-identity-pools create-cred-config. The GCP SDK auto-detects whether the JSON is a service account key or a WIF credential config.

Multi-Dataset Support

BigQuery projects can have multiple datasets. List all datasets you want the agent to explore:

Datasets: events_prod, features_prod, analytics

The agent discovers table schemas from all listed datasets and can query across them.

Filtering

For shared datasets with data from multiple apps/tenants:

Filter Field: app_id
Filter Value: 68a42f378e3b227c8e41b0e5

The agent adds WHERE app_id = '68a42f378e3b227c8e41b0e5' to all queries.

Cost

BigQuery charges per TB scanned (default: $7.50/TB for on-demand pricing). The cost estimation feature uses BigQuery's dry-run API to preview costs before running.

Amazon Redshift

Prerequisites

  • A Redshift cluster (provisioned) or Redshift Serverless workgroup
  • AWS credentials with Redshift Data API access

Dashboard Setup — Serverless

  1. Select Amazon Redshift as warehouse provider
  2. Fill in:
    • Workgroup Name: Your Serverless workgroup (e.g., default-workgroup)
    • Database: Database name (e.g., dev)
    • Region: AWS region (e.g., us-east-1)
  3. Enter Datasets: Schema names (e.g., public)

Dashboard Setup — Provisioned

  1. Select Amazon Redshift as warehouse provider
  2. Fill in:
    • Cluster Identifier: Your cluster ID (e.g., my-redshift-cluster)
    • Database: Database name
    • Region: AWS region
  3. Enter Datasets: Schema names

Authentication

When creating a project, select one of the available auth methods:

IAM Role — No credentials needed. Works automatically on EKS (pod IAM role), EC2 (instance profile), or with environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).

Access Keys — For cross-cloud or local access. Enter your AWS access key pair in the format ACCESS_KEY_ID:SECRET_ACCESS_KEY. The IAM user needs these permissions:

  • redshift-data:ExecuteStatement, redshift-data:DescribeStatement, redshift-data:GetStatementResult
  • redshift-serverless:GetCredentials (Serverless) or redshift:GetClusterCredentials (Provisioned)

Assume Role — For cross-account access. Enter the Role ARN of the target role (e.g., arn:aws:iam::123456789012:role/RedshiftRole). Optionally provide an External ID if the role's trust policy requires one. The agent assumes this role via STS using its base credentials (IAM role or environment).

How Redshift Queries Work

DecisionBox uses the Redshift Data API (not JDBC), which works asynchronously:

  1. ExecuteStatement — Submit SQL
  2. DescribeStatement — Poll until complete
  3. GetStatementResult — Fetch results

This means no JDBC driver is needed, and it works with both Serverless and Provisioned clusters.

Data Type Handling

Redshift types are automatically normalized:

  • INTEGER, BIGINT, SMALLINTINT64
  • VARCHAR, TEXT, CHARSTRING
  • DECIMAL, NUMERICFLOAT64 (parsed from column metadata, not string guessing)
  • BOOLEANBOOL
  • TIMESTAMP, TIMESTAMPTZTIMESTAMP

System Table Filtering

The agent automatically excludes Redshift system tables from discovery:

  • pg_* tables
  • stl_* tables (system log)
  • svv_* tables (system views)

Snowflake

Prerequisites

  • A Snowflake account (trial or production)
  • Username with access to the target database and schema
  • A virtual warehouse (e.g., COMPUTE_WH)

Dashboard Setup

  1. Select Snowflake as warehouse provider
  2. Fill in:
    • Account Identifier: Your Snowflake account (e.g., ORGNAME-ACCOUNTNAME)
    • Username: Snowflake user
    • Warehouse: Virtual warehouse name (e.g., COMPUTE_WH)
    • Database: Database name (e.g., ANALYTICS_DB)
  3. Optionally set Schema (default: PUBLIC) and Role
  4. Enter Datasets: Schema names (e.g., PUBLIC)

Authentication

When creating a project, select one of the available auth methods:

Username / Password — Enter your Snowflake password.

Key Pair (JWT) — Recommended for production.

  1. Generate an RSA key pair:
    openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
    openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
  2. Assign the public key to your Snowflake user:
    ALTER USER my_user SET RSA_PUBLIC_KEY='MIIBIjANBg...';
  3. In the project creation form, select Key Pair (JWT) and paste the PEM private key content.

Data Type Handling

Snowflake types are automatically normalized:

  • NUMBER, INT, BIGINT, SMALLINT, TINYINT, BYTEINTINT64 (in schema metadata)
  • FLOAT, DOUBLE, REAL, DECIMAL(p,s), NUMERIC(p,s)FLOAT64
  • In query results, NUMBER values with decimals are returned as FLOAT64 (the driver reports actual precision)
  • VARCHAR, STRING, CHAR, TEXTSTRING
  • BOOLEANBOOL
  • DATEDATE
  • TIMESTAMP_NTZ, TIMESTAMP_LTZ, TIMESTAMP_TZTIMESTAMP
  • VARIANT, OBJECT, ARRAYRECORD (JSON string in query results)
  • BINARY, VARBINARYBYTES

Schema Metadata

The provider uses INFORMATION_SCHEMA for table listing and column metadata. Row counts come from INFORMATION_SCHEMA.TABLES.ROW_COUNT — no full-table scans needed.

Cost

Snowflake charges per-second based on warehouse size (credits per hour). There is no dry-run API, so the cost estimation feature is not available for Snowflake.

PostgreSQL

Prerequisites

  • A PostgreSQL 12+ server accessible from the DecisionBox deployment
  • A database user with read access to the target schema
  • SSL configured (recommended for remote connections)

Dashboard Setup

  1. Select PostgreSQL as warehouse provider
  2. Fill in:
    • Host: Database hostname (e.g., db.example.com)
    • Port: Database port (default: 5432)
    • Database: Database name
    • Username: Database user
  3. Optionally set Schema (default: public) and SSL Mode (default: require)
  4. Enter Datasets: Schema names (e.g., public)

Authentication

When creating a project, select one of the available auth methods:

Username / Password — Enter host, port, database, username, and password. The connection uses the lib/pq driver with the sslmode you configure.

Connection String — For advanced configurations (Heroku, RDS, Cloud SQL, Supabase). Enter a full PostgreSQL connection string:

postgres://user:password@host:5432/dbname?sslmode=require

This method supports all lib/pq DSN parameters including sslmode, connect_timeout, search_path, and application_name.

SSL Mode

ModeDescription
disableNo SSL (only for localhost or trusted networks)
allowTry non-SSL first, fall back to SSL
preferTry SSL first, fall back to non-SSL (common RDS default)
requireSSL required, no certificate verification (default)
verify-caSSL with CA certificate verification
verify-fullSSL with CA + hostname verification (most secure)

Data Type Handling

PostgreSQL types are automatically normalized:

  • INTEGER, BIGINT, SMALLINT, SERIAL, BIGSERIALINT64
  • REAL, DOUBLE PRECISIONFLOAT64
  • NUMERIC, DECIMALFLOAT64 (parsed from driver's []byte representation)
  • VARCHAR, TEXT, CHAR, UUID, INET, CIDR, INTERVAL, MONEYSTRING
  • BOOLEANBOOL
  • DATEDATE
  • TIMESTAMP, TIMESTAMPTZTIMESTAMP
  • JSON, JSONBRECORD (JSON string in query results)
  • BYTEABYTES
  • ARRAY types → STRING (PostgreSQL text representation, e.g., {1,2,3})

Schema Metadata

The provider uses information_schema.tables and information_schema.columns for table listing and column metadata. Row counts come from pg_class.reltuples — an estimate maintained by PostgreSQL's autovacuum, no full-table scans needed.

Cost

PostgreSQL is self-hosted or managed (RDS, Cloud SQL, AlloyDB, Supabase). There is no per-query cost model, so the cost estimation feature is not available.

Databricks

Prerequisites

  • A Databricks workspace (AWS, Azure, or GCP)
  • A SQL warehouse (serverless or classic)
  • Unity Catalog enabled with access to the target catalog and schema

Dashboard Setup

  1. Select Databricks as warehouse provider
  2. Fill in:
    • Server Hostname: Your workspace hostname (e.g., xxx.cloud.databricks.com)
    • HTTP Path: SQL warehouse endpoint (e.g., /sql/1.0/warehouses/xxx)
    • Catalog: Unity Catalog catalog name (e.g., main)
  3. Optionally set Schema (default: default)
  4. Enter Datasets: Schema names (e.g., default)

Authentication

When creating a project, select one of the available auth methods:

Personal Access Token (PAT) — Simplest setup.

  1. In Databricks, go to Settings > Developer > Access tokens
  2. Generate a new token
  3. In the project creation form, select Personal Access Token and paste the token

OAuth M2M (Service Principal) — Recommended for production.

  1. Create a service principal in your Databricks workspace
  2. Create an OAuth secret for the service principal
  3. Grant the service principal access to your SQL warehouse and catalog
  4. In the project creation form, select OAuth M2M and enter client_id:client_secret

Unity Catalog Namespace

Databricks uses a 3-level namespace: catalog.schema.table. The agent qualifies table references as catalog.information_schema.tables for metadata queries. SQL queries use the schema set in the project's datasets configuration.

Data Type Handling

Databricks types are automatically normalized:

  • TINYINT, SMALLINT, INT, BIGINTINT64
  • FLOAT, DOUBLEFLOAT64
  • DECIMAL(p,s)FLOAT64 (parsed from driver's string representation)
  • STRING, CHAR, VARCHARSTRING
  • BOOLEANBOOL
  • DATEDATE
  • TIMESTAMP, TIMESTAMP_NTZTIMESTAMP
  • BINARYBYTES
  • STRUCT, ARRAY, MAP, VARIANTRECORD (JSON string in query results)
  • INTERVALSTRING

SQL Dialect

Databricks SQL extends ANSI SQL with:

  • QUALIFY — Filter window function results directly (like Snowflake)
  • PIVOT / UNPIVOT — Rotate rows to columns and vice versa
  • explode / explode_outer — Expand arrays and maps into rows
  • Delta time travelTIMESTAMP AS OF and VERSION AS OF
  • Java SimpleDateFormat — Use yyyy-MM-dd (not YYYY-MM-DD)

Cost

Databricks SQL warehouses charge per DBU (Databricks Unit) based on cluster size and runtime. There is no per-query cost estimation API, so the cost estimation feature is not available.

Microsoft SQL Server

Prerequisites

  • A SQL Server instance (SQL Server 2016 or later) reachable from the agent, or an Azure SQL Database
  • A SQL login (or Azure SQL user) with read access to the target database and schema
  • A firewall rule or VPC path that allows the agent's egress IP to reach port 1433

Dashboard Setup

  1. Select Microsoft SQL Server as warehouse provider
  2. Fill in:
    • Host: Server hostname (e.g., mssql.example.com or my-server.database.windows.net)
    • Port: Usually 1433; leave blank for the default
    • Database: Target database name (hyphens allowed — Azure SQL DB names commonly contain them)
    • Username: SQL login name
    • Schema: SQL Server schema to explore (default: dbo)
  3. Optionally set:
    • Encrypt — TDS encryption mode: true (default, TLS required), false (no TLS), strict (TDS 8.0 strict TLS), or disable
    • Trust Server Certificatetrue skips TLS certificate validation; leave false in production

Authentication

SQL Login (Username / Password) — Most common.

  1. In SQL Server: CREATE LOGIN decisionbox WITH PASSWORD = '...' then CREATE USER and GRANT SELECT on the target schema
  2. Grant VIEW DATABASE STATE if you want GetTableSchema to return accurate row counts (sys.dm_db_partition_stats)
  3. In the project creation form, select SQL Login and enter the password

Connection String — For advanced configurations (Kerberos, Azure AD access token, custom TLS). Paste a full sqlserver:// URL DSN (see the go-mssqldb connection string format):

sqlserver://user:password@host:1433?database=mydb&encrypt=true&TrustServerCertificate=false

Special characters in the password (@, :, ?, &, #, /, space) must be URL-encoded.

Azure SQL Database Notes

  • Use the fully qualified server name: myserver.database.windows.net
  • encrypt=true is required; Azure SQL rejects unencrypted TDS
  • Managed Identity / Azure AD token auth is reachable via the Connection String auth method by passing fedauth=ActiveDirectoryManagedIdentity (or similar) in the DSN query string

Data Type Handling

SQL Server types are automatically normalized:

  • TINYINT, SMALLINT, INT, BIGINTINT64
  • REAL, FLOATFLOAT64
  • DECIMAL, NUMERIC, MONEY, SMALLMONEYFLOAT64 (parsed from driver's string representation)
  • BITBOOL
  • DATEDATE
  • DATETIME, DATETIME2, SMALLDATETIME, DATETIMEOFFSETTIMESTAMP (RFC3339 string)
  • TIMESTRING
  • CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXTSTRING
  • BINARY, VARBINARY, IMAGEBYTES (in schema) / STRING (in query results)
  • UNIQUEIDENTIFIERSTRING (canonical 8-4-4-4-12 hex form)
  • XMLSTRING

SQL Server has no dedicated JSON type; JSON is stored in NVARCHAR(MAX) and returned as STRING.

SQL Dialect

T-SQL supports most ANSI SQL plus SQL-Server-specific syntax:

  • TOP n / TOP PERCENT — Row limiting without ORDER BY (when a natural order isn't needed)
  • OFFSET/FETCH — Standard paging; requires ORDER BY. Cannot be combined with TOP
  • CROSS APPLY / OUTER APPLY — SQL Server's equivalent of LATERAL joins
  • PIVOT / UNPIVOT — Rotate rows to columns and vice versa
  • STRING_AGG / STRING_SPLIT — String aggregation and splitting (2017+)
  • JSON_VALUE / JSON_QUERY / OPENJSON / ISJSON — JSON parsing (2016+)
  • TRY_CAST / TRY_CONVERT / TRY_PARSE — Safe type conversions that return NULL on failure
  • SYSUTCDATETIME() / DATEADD / DATEDIFF — Date arithmetic

No QUALIFY clause — filter window-function results via CTE or subquery.

Cost

SQL Server and Azure SQL Database have multiple pricing models (DTU, vCore, per-second serverless). There is no per-query cost estimation API, so the cost estimation feature is not available.

Cross-Cloud Authentication

DecisionBox supports accessing warehouses from a different cloud:

ScenarioAuth MethodHow
BigQuery from AWS/AzureService Account KeyPaste SA key JSON or WIF credential config
BigQuery via WIF (keyless)Service Account KeyPaste WIF credential config from gcloud iam workload-identity-pools create-cred-config
Redshift from GCP/AzureAccess KeysPaste ACCESS_KEY_ID:SECRET_ACCESS_KEY
Redshift cross-accountAssume RoleEnter Role ARN + External ID
Snowflake from any cloudPassword or Key PairPaste password or PEM private key
PostgreSQL from any cloudPassword or Connection StringEnter credentials or full DSN
Databricks from any cloudPAT or OAuth M2MPaste token or client_id:client_secret
Microsoft SQL Server / Azure SQLSQL Login or Connection StringEnter password or full sqlserver:// DSN
Any from local devADC / IAM RoleConfigure cloud CLI (gcloud auth, aws configure)

The key concept: warehouse credentials are stored encrypted via the secret provider. When creating a project, select the appropriate auth method and enter credentials inline. The agent reads credentials from the secret provider before initializing the warehouse provider.

Next Steps