Configuring Data Warehouses
Version: 0.1.0
DecisionBox connects to your existing data warehouse in read-only mode. This guide covers setup for each supported warehouse.
Google BigQuery
Prerequisites
- A GCP project with BigQuery datasets
- One of:
- On GCP: Application Default Credentials (Workload Identity, gcloud auth)
- Outside GCP: A service account JSON key
Dashboard Setup
- Select Google BigQuery as warehouse provider
- Fill in:
- Project ID: Your GCP project ID (e.g.,
my-gcp-project) - Location: Dataset location (e.g.,
US,us-central1,us-east5)
- Project ID: Your GCP project ID (e.g.,
- Enter Datasets: Comma-separated (e.g.,
analytics, features_prod) - Optionally set Filter: field + value for multi-tenant data
Authentication
When creating a project, select one of the available auth methods:
Application Default Credentials (ADC) — No credentials needed.
Works automatically on GKE (Workload Identity), Cloud Run, Compute Engine, or after gcloud auth application-default login.
Service Account Key — For cross-cloud, local, or federated access.
- Create a service account in GCP Console with
BigQuery Data ViewerandBigQuery Job Userroles - Download the JSON key
- In the project creation form, select Service Account Key as auth method and paste the JSON key
This method also supports Workload Identity Federation (WIF) — keyless access from AWS, Azure, or any OIDC identity provider.
Instead of a service account key, paste the WIF credential config JSON generated by gcloud iam workload-identity-pools create-cred-config.
The GCP SDK auto-detects whether the JSON is a service account key or a WIF credential config.
Multi-Dataset Support
BigQuery projects can have multiple datasets. List all datasets you want the agent to explore:
Datasets: events_prod, features_prod, analytics
The agent discovers table schemas from all listed datasets and can query across them.
Filtering
For shared datasets with data from multiple apps/tenants:
Filter Field: app_id
Filter Value: 68a42f378e3b227c8e41b0e5
The agent adds WHERE app_id = '68a42f378e3b227c8e41b0e5' to all queries.
Cost
BigQuery charges per TB scanned (default: $7.50/TB for on-demand pricing). The cost estimation feature uses BigQuery's dry-run API to preview costs before running.
Amazon Redshift
Prerequisites
- A Redshift cluster (provisioned) or Redshift Serverless workgroup
- AWS credentials with Redshift Data API access
Dashboard Setup — Serverless
- Select Amazon Redshift as warehouse provider
- Fill in:
- Workgroup Name: Your Serverless workgroup (e.g.,
default-workgroup) - Database: Database name (e.g.,
dev) - Region: AWS region (e.g.,
us-east-1)
- Workgroup Name: Your Serverless workgroup (e.g.,
- Enter Datasets: Schema names (e.g.,
public)
Dashboard Setup — Provisioned
- Select Amazon Redshift as warehouse provider
- Fill in:
- Cluster Identifier: Your cluster ID (e.g.,
my-redshift-cluster) - Database: Database name
- Region: AWS region
- Cluster Identifier: Your cluster ID (e.g.,
- Enter Datasets: Schema names
Authentication
When creating a project, select one of the available auth methods:
IAM Role — No credentials needed.
Works automatically on EKS (pod IAM role), EC2 (instance profile), or with environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY).
Access Keys — For cross-cloud or local access.
Enter your AWS access key pair in the format ACCESS_KEY_ID:SECRET_ACCESS_KEY.
The IAM user needs these permissions:
redshift-data:ExecuteStatement,redshift-data:DescribeStatement,redshift-data:GetStatementResultredshift-serverless:GetCredentials(Serverless) orredshift:GetClusterCredentials(Provisioned)
Assume Role — For cross-account access.
Enter the Role ARN of the target role (e.g., arn:aws:iam::123456789012:role/RedshiftRole).
Optionally provide an External ID if the role's trust policy requires one.
The agent assumes this role via STS using its base credentials (IAM role or environment).
How Redshift Queries Work
DecisionBox uses the Redshift Data API (not JDBC), which works asynchronously:
ExecuteStatement— Submit SQLDescribeStatement— Poll until completeGetStatementResult— Fetch results
This means no JDBC driver is needed, and it works with both Serverless and Provisioned clusters.
Data Type Handling
Redshift types are automatically normalized:
INTEGER,BIGINT,SMALLINT→INT64VARCHAR,TEXT,CHAR→STRINGDECIMAL,NUMERIC→FLOAT64(parsed from column metadata, not string guessing)BOOLEAN→BOOLTIMESTAMP,TIMESTAMPTZ→TIMESTAMP
System Table Filtering
The agent automatically excludes Redshift system tables from discovery:
pg_*tablesstl_*tables (system log)svv_*tables (system views)
Snowflake
Prerequisites
- A Snowflake account (trial or production)
- Username with access to the target database and schema
- A virtual warehouse (e.g.,
COMPUTE_WH)
Dashboard Setup
- Select Snowflake as warehouse provider
- Fill in:
- Account Identifier: Your Snowflake account (e.g.,
ORGNAME-ACCOUNTNAME) - Username: Snowflake user
- Warehouse: Virtual warehouse name (e.g.,
COMPUTE_WH) - Database: Database name (e.g.,
ANALYTICS_DB)
- Account Identifier: Your Snowflake account (e.g.,
- Optionally set Schema (default:
PUBLIC) and Role - Enter Datasets: Schema names (e.g.,
PUBLIC)
Authentication
When creating a project, select one of the available auth methods:
Username / Password — Enter your Snowflake password.
Key Pair (JWT) — Recommended for production.
- Generate an RSA key pair:
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub - Assign the public key to your Snowflake user:
ALTER USER my_user SET RSA_PUBLIC_KEY='MIIBIjANBg...'; - In the project creation form, select Key Pair (JWT) and paste the PEM private key content.
Data Type Handling
Snowflake types are automatically normalized:
NUMBER,INT,BIGINT,SMALLINT,TINYINT,BYTEINT→INT64(in schema metadata)FLOAT,DOUBLE,REAL,DECIMAL(p,s),NUMERIC(p,s)→FLOAT64- In query results,
NUMBERvalues with decimals are returned asFLOAT64(the driver reports actual precision) VARCHAR,STRING,CHAR,TEXT→STRINGBOOLEAN→BOOLDATE→DATETIMESTAMP_NTZ,TIMESTAMP_LTZ,TIMESTAMP_TZ→TIMESTAMPVARIANT,OBJECT,ARRAY→RECORD(JSON string in query results)BINARY,VARBINARY→BYTES
Schema Metadata
The provider uses INFORMATION_SCHEMA for table listing and column metadata.
Row counts come from INFORMATION_SCHEMA.TABLES.ROW_COUNT — no full-table scans needed.
Cost
Snowflake charges per-second based on warehouse size (credits per hour). There is no dry-run API, so the cost estimation feature is not available for Snowflake.
Cross-Cloud Authentication
DecisionBox supports accessing warehouses from a different cloud:
| Scenario | Auth Method | How |
|---|---|---|
| BigQuery from AWS/Azure | Service Account Key | Paste SA key JSON or WIF credential config |
| BigQuery via WIF (keyless) | Service Account Key | Paste WIF credential config from gcloud iam workload-identity-pools create-cred-config |
| Redshift from GCP/Azure | Access Keys | Paste ACCESS_KEY_ID:SECRET_ACCESS_KEY |
| Redshift cross-account | Assume Role | Enter Role ARN + External ID |
| Snowflake from any cloud | Password or Key Pair | Paste password or PEM private key |
| Any from local dev | ADC / IAM Role | Configure cloud CLI (gcloud auth, aws configure) |
The key concept: warehouse credentials are stored encrypted via the secret provider. When creating a project, select the appropriate auth method and enter credentials inline. The agent reads credentials from the secret provider before initializing the warehouse provider.
Next Steps
- Configuration Reference — All environment variables
- Configuring Secrets — Secret provider setup
- Adding Warehouse Providers — Support a new warehouse