User Guide/Data Management

Dataset Configuration

Customize schemas, manage metadata, and optimize your datasets

After uploading or connecting data, GoPie provides powerful tools to configure and optimize your datasets. This includes schema editing, metadata management, relationships, and performance tuning.

Schema Management

Viewing Schema Information

Access dataset schema from the dataset page:

-- Schema includes:
- Column names and data types
- Nullable constraints
- Primary/foreign keys
- Indexes
- Row count and size

GoPie automatically infers schema during import, but you can customize it to better match your needs.

Editing Column Properties

Data Type Changes

Modify column data types when needed:

Using the Schema Editor

  1. Navigate to your dataset
  2. Click "Schema" tab
  3. Click the edit icon next to a column
  4. Select new data type
  5. Preview impact and confirm

Available type conversions:

  • String → Number (if valid)
  • Number → String (always works)
  • String → Date (with format)
  • Date → String (with format)
  • Boolean → Number (0/1)

Using SQL Commands

-- Change column type
ALTER TABLE sales_data 
ALTER COLUMN price TYPE DECIMAL(10,2);

-- Cast with conversion
ALTER TABLE users
ALTER COLUMN age TYPE INTEGER 
USING age::integer;

-- Change with default for nulls
ALTER TABLE orders
ALTER COLUMN status TYPE VARCHAR(50)
USING COALESCE(status, 'pending');

Type changes may fail if data cannot be converted. Always preview changes first.

Column Descriptions

Add descriptions to help users understand your data:

// Via API
await gopie.datasets.updateColumn({
  dataset: 'sales_data',
  column: 'arr',
  description: 'Annual Recurring Revenue in USD',
  display_name: 'Annual Revenue',
  format: 'currency'
});

Column description best practices:

  • Include units of measurement
  • Explain abbreviations
  • Note data sources
  • Document calculations
  • Specify time zones for dates

Dataset Aliases

Create user-friendly names:

# Original table name
database_table: "ord_txn_2024_v2_final"

# User-friendly alias
alias: "Orders 2024"

# Column aliases
column_aliases:
  cust_id: "Customer ID"
  txn_amt: "Transaction Amount"
  ord_dt: "Order Date"

Column Metadata

Semantic Types

Assign semantic meaning to columns:

Semantic TypeDescriptionBenefits
EmailEmail addressesValidation, privacy masking
PhonePhone numbersFormatting, validation
CurrencyMonetary valuesFormatting, conversion
PercentagePercentage values0-100 validation, formatting
URLWeb addressesValidation, link rendering
Latitude/LongitudeGeographic coordinatesMap visualization

Format Settings

Configure display formatting:

{
  "columns": {
    "price": {
      "format": "currency",
      "currency": "USD",
      "decimal_places": 2
    },
    "date": {
      "format": "date",
      "date_format": "YYYY-MM-DD"
    },
    "percentage": {
      "format": "percentage",
      "decimal_places": 1,
      "multiply_by_100": false
    }
  }
}

Validation Rules

Set up data quality rules:

validations:
  email:
    - type: "format"
      pattern: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
    
  age:
    - type: "range"
      min: 0
      max: 150
    
  status:
    - type: "enum"
      values: ["active", "pending", "cancelled"]
    
  amount:
    - type: "not_null"
    - type: "positive"

Relationships

Defining Foreign Keys

Link tables together for better querying:

Identify Relationship

Find columns that reference other tables

-- orders.customer_id → customers.id
-- products.category_id → categories.id

Create Foreign Key

Define the relationship in GoPie

ALTER TABLE orders
ADD CONSTRAINT fk_customer
FOREIGN KEY (customer_id)
REFERENCES customers(id);

Enable Smart Joins

GoPie now suggests joins automatically

Relationship Types

Configure different relationship patterns:

// One-to-Many
{
  "from": "customers.id",
  "to": "orders.customer_id",
  "type": "one_to_many",
  "name": "customer_orders"
}

// Many-to-Many
{
  "from": "products.id",
  "through": "product_categories",
  "to": "categories.id",
  "type": "many_to_many",
  "name": "product_categorization"
}

// One-to-One
{
  "from": "users.id",
  "to": "profiles.user_id",
  "type": "one_to_one",
  "name": "user_profile"
}

Relationship Benefits

  • Auto-complete in natural language queries
  • Join suggestions in SQL editor
  • Referential integrity validation
  • Cascade options for updates/deletes
  • Visual relationship diagrams

Performance Optimization

Creating Indexes

Speed up queries with strategic indexes:

-- Single column index
CREATE INDEX idx_orders_date ON orders(order_date);

-- Composite index
CREATE INDEX idx_orders_customer_date 
ON orders(customer_id, order_date);

-- Partial index
CREATE INDEX idx_active_users 
ON users(email) 
WHERE status = 'active';

-- Expression index
CREATE INDEX idx_orders_month 
ON orders(DATE_TRUNC('month', order_date));

Index Recommendations

GoPie analyzes query patterns and suggests indexes:

Check the "Performance" tab for index recommendations based on your query history.

Partitioning

For large datasets, implement partitioning:

-- Range partitioning by date
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

-- List partitioning by region
CREATE TABLE orders_us PARTITION OF orders
FOR VALUES IN ('US', 'CA', 'MX');

-- Hash partitioning for distribution
CREATE TABLE orders_p0 PARTITION OF orders
FOR VALUES WITH (modulus 4, remainder 0);

Advanced Configuration

Computed Columns

Create calculated fields:

-- Add computed column
ALTER TABLE orders ADD COLUMN 
total_with_tax DECIMAL(10,2) 
GENERATED ALWAYS AS (subtotal * 1.08) STORED;

-- Virtual column (calculated on read)
CREATE VIEW orders_enriched AS
SELECT *,
  EXTRACT(YEAR FROM order_date) as order_year,
  CASE 
    WHEN total > 1000 THEN 'high'
    WHEN total > 100 THEN 'medium'
    ELSE 'low'
  END as order_tier
FROM orders;

Default Values

Set intelligent defaults:

-- Static defaults
ALTER TABLE users 
ALTER COLUMN created_at 
SET DEFAULT CURRENT_TIMESTAMP;

-- Dynamic defaults
ALTER TABLE orders
ALTER COLUMN order_number
SET DEFAULT 'ORD-' || LPAD(NEXTVAL('order_seq')::text, 6, '0');

-- Conditional defaults
ALTER TABLE products
ALTER COLUMN status
SET DEFAULT CASE 
  WHEN discontinued = true THEN 'inactive'
  ELSE 'active'
END;

Constraints

Ensure data quality with constraints:

-- Unique constraints
ALTER TABLE users 
ADD CONSTRAINT unique_email UNIQUE (email);

-- Check constraints
ALTER TABLE products
ADD CONSTRAINT positive_price 
CHECK (price > 0);

-- Complex constraints
ALTER TABLE orders
ADD CONSTRAINT valid_dates
CHECK (ship_date >= order_date);

Dataset Settings

Access Control

Configure who can access your dataset:

access:
  visibility: "organization" # or "public", "private"
  
  permissions:
    viewers:
      - "team:analytics"
      - "user:[email protected]"
    
    editors:
      - "team:data-engineering"
    
    restricted_columns:
      - column: "ssn"
        access: ["role:hr"]
      - column: "salary"
        access: ["role:management"]

Refresh Settings

For connected data sources:

{
  "refresh": {
    "schedule": "0 2 * * *", // 2 AM daily
    "timezone": "America/New_York",
    "incremental": {
      "enabled": true,
      "cursor_column": "updated_at",
      "lookback_window": "1 hour"
    },
    "notifications": {
      "on_success": ["[email protected]"],
      "on_failure": ["[email protected]"]
    }
  }
}

Retention Policies

Manage data lifecycle:

retention:
  # Keep detailed data for 90 days
  detailed_retention: "90 days"
  
  # Aggregate older data
  aggregation:
    after: "90 days"
    group_by: ["date", "category"]
    metrics: ["sum(amount)", "count(*)"]
  
  # Archive after 1 year
  archive:
    after: "365 days"
    destination: "s3://backups/gopie/archives/"
  
  # Delete after 2 years
  deletion:
    after: "730 days"
    require_confirmation: true

Schema Evolution

Handling Schema Changes

When source schemas change:

Detection

GoPie detects schema differences during sync

Review Changes

  • New columns to add
  • Removed columns to handle
  • Type changes to resolve

Apply Updates

Choose how to handle each change:

  • Auto-add new columns
  • Keep removed columns with nulls
  • Map changed types

Version Control

Track schema versions:

{
  "schema_version": "2.1.0",
  "changes": [
    {
      "version": "2.1.0",
      "date": "2024-01-15",
      "changes": [
        "Added 'category' column",
        "Renamed 'prod_id' to 'product_id'",
        "Changed 'price' type to DECIMAL(10,2)"
      ]
    }
  ],
  "compatibility": "backward"
}

Best Practices

Naming Conventions

  1. Use descriptive names

    • t1, col1, data
    • customers, order_date, total_revenue
  2. Be consistent

    • Choose customer_id or customerId (not both)
    • Use same date format throughout
  3. Avoid reserved words

    • Don't use SQL keywords as column names
    • Add prefixes if needed: user_order not order

Documentation

Document everything for team success:

  1. Dataset description - What it contains and why
  2. Column descriptions - What each field means
  3. Business rules - How data is calculated
  4. Update frequency - When data refreshes
  5. Contact info - Who maintains the dataset

Performance Tips

  1. Index frequently filtered columns
  2. Partition large tables by date
  3. Denormalize for read performance
  4. Archive old data regularly
  5. Monitor query patterns

What's Next?