User Guide/Data Management

External Data Sources

Connect GoPie to databases, cloud storage, and external APIs

While uploading files is great for static data, connecting to external data sources enables real-time analysis and automatic updates. GoPie supports various external data sources including databases, cloud storage, and APIs.

Supported Data Sources

Cloud Storage

Google Drive Integration

Connect to spreadsheets and CSV files in Google Drive:

Authorize Access

Click "Connect Google Drive" and authorize GoPie to access your files.

Select Files

Browse your Drive and select files to sync. You can choose:

  • Individual files
  • Entire folders
  • Shared drives

Configure Sync

Set up automatic synchronization:

  • Manual: Update on demand
  • Hourly: Check for changes every hour
  • Daily: Sync once per day
  • Real-time: Update when file changes (Pro plan)

Google Sheets are converted to CSV format during import. Formulas are evaluated and only values are imported.

Microsoft OneDrive Integration

Access Excel and CSV files from OneDrive:

Connect Account

Authenticate with your Microsoft account (personal or business).

Browse Files

Navigate through your OneDrive folders and SharePoint sites.

Select Data

Choose files or folders to monitor for changes.

Features:

  • Support for OneDrive Personal and Business
  • SharePoint document libraries
  • Automatic Excel to table conversion
  • Preserves data types from Excel

S3 and Compatible Object Storage

Connect to AWS S3, MinIO, or S3-compatible storage:

# Connection parameters
Endpoint: https://s3.amazonaws.com
Access Key: YOUR_ACCESS_KEY
Secret Key: YOUR_SECRET_KEY
Bucket: my-data-bucket
Path: /data/analytics/

Supported formats in S3:

  • CSV, TSV files
  • Parquet files
  • JSON/JSONL files
  • Compressed files (.gz, .zip)

Use IAM roles for secure access without embedding credentials. GoPie supports assuming roles with appropriate permissions.

Databases

PostgreSQL Connection

-- Connection string format
postgresql://username:password@host:port/database

-- With SSL
postgresql://username:password@host:port/database?sslmode=require

Features:

  • Full schema introspection
  • Support for views and materialized views
  • JSON/JSONB column handling
  • Array type support
  • Custom type mapping

Add Connection

Navigate to Data Sources → Add Database → PostgreSQL

Enter Credentials

Provide connection details:

  • Host and port
  • Database name
  • Username and password
  • SSL settings

Select Tables

Choose which tables/views to import or sync

Configure Sync

Set up refresh schedule and incremental options

MySQL/MariaDB Connection

-- Connection string format
mysql://username:password@host:port/database

-- With parameters
mysql://username:password@host:port/database?charset=utf8mb4

Supported versions:

  • MySQL 5.7+
  • MariaDB 10.2+
  • Amazon Aurora MySQL
  • PlanetScale

For cloud databases, ensure your IP is whitelisted or use GoPie's static IPs for secure connections.

Microsoft SQL Server

-- Connection string format
sqlserver://username:password@host:port?database=dbname

-- With Windows Authentication
sqlserver://host?database=dbname&trustedConnection=true

Features:

  • Support for schemas
  • Temporal tables
  • Computed columns
  • Azure SQL Database compatible

Google BigQuery

Connect to BigQuery datasets:

Service Account

Create a service account with BigQuery Data Viewer permissions

Upload Credentials

Upload the JSON key file or paste credentials

Select Dataset

Choose project and dataset to connect

{
  "type": "service_account",
  "project_id": "your-project",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...",
  "client_email": "[email protected]"
}

APIs and Web Data

REST API Integration

Connect to any REST API that returns JSON data:

# API Configuration
Base URL: https://api.example.com/v1
Authentication: Bearer token
Headers:
  - Authorization: Bearer YOUR_TOKEN
  - Accept: application/json

Endpoints:
  - /users
  - /orders
  - /products

GoPie automatically paginates through API results and handles rate limiting.

URL-Based Import

Import data directly from public URLs:

  • CSV/TSV files
  • JSON APIs
  • RSS/Atom feeds
  • Public Google Sheets
  • GitHub raw files
# Example URLs
https://example.com/data/sales.csv
https://api.github.com/repos/owner/repo/issues
https://sheets.googleapis.com/v4/spreadsheets/SHEET_ID/values/Sheet1

Connection Security

Encryption

All connections use encryption:

  • In Transit: TLS 1.2+ for all connections
  • At Rest: AES-256 encryption for stored credentials
  • Key Management: Rotating encryption keys

Access Control

  • IP Whitelisting: Restrict access to known IPs
  • VPN/Tunnel: Connect through secure tunnels
  • Read-Only Access: Recommended for data sources
  • Audit Logging: Track all access and queries

Best Practices

  1. Use Read-Only Credentials

    -- Create read-only user in PostgreSQL
    CREATE USER gopie_reader WITH PASSWORD 'secure_password';
    GRANT CONNECT ON DATABASE mydb TO gopie_reader;
    GRANT USAGE ON SCHEMA public TO gopie_reader;
    GRANT SELECT ON ALL TABLES IN SCHEMA public TO gopie_reader;
  2. Limit Access Scope

    • Grant access only to needed tables
    • Use views to restrict row-level access
    • Apply column-level permissions where possible
  3. Rotate Credentials

    • Change passwords regularly
    • Use expiring tokens when available
    • Monitor access logs

Sync Configuration

Sync Modes

ModeDescriptionUse Case
Full RefreshReplace entire datasetSmall datasets, complete accuracy needed
IncrementalAdd new/modified records onlyLarge datasets with timestamps
Append OnlyAdd new records, never updateEvent logs, audit trails
CDC (Enterprise)Real-time change captureMission-critical data

Incremental Sync Setup

For incremental syncs, specify:

  1. Cursor Column: Timestamp or auto-incrementing ID
  2. Update Strategy: How to handle modified records
  3. Delete Detection: Track removed records (optional)
-- Example: Configure incremental sync
Cursor Column: updated_at
Initial Value: 2024-01-01
Update Strategy: Merge by primary key (id)

Scheduling Options

  • Manual: Trigger sync on demand
  • Scheduled: Fixed intervals (5 min to monthly)
  • Event-Based: Webhook triggers (Enterprise)
  • Real-Time: Continuous sync (selected sources)

Managing Connected Sources

Monitoring

Track your data source health:

  • Last Sync: When data was last updated
  • Records Synced: Number of rows processed
  • Sync Duration: Time taken to complete
  • Error Count: Failed sync attempts
  • Next Sync: Scheduled update time

Troubleshooting

Common issues and solutions:

Connection Timeout

# Increase timeout settings
Connection Timeout: 30s 60s
Query Timeout: 120s 300s

Authentication Failed

  • Verify credentials are correct
  • Check IP whitelist settings
  • Ensure user has required permissions
  • Test connection from GoPie's IPs

Schema Changes

When source schema changes, you may need to:

  1. Refresh schema metadata
  2. Update column mappings
  3. Adjust sync configuration

Performance Issues

  • Add indexes on cursor columns
  • Limit sync to specific tables/columns
  • Use incremental sync for large tables
  • Schedule syncs during off-peak hours

Advanced Features

Data Transformations

Apply transformations during sync:

  • Column Mapping: Rename or reorder columns
  • Type Casting: Convert data types
  • Filtering: Exclude rows based on conditions
  • Aggregations: Pre-aggregate data
-- Example transformation
SELECT 
  id,
  UPPER(name) as name,
  CAST(price AS DECIMAL(10,2)) as price,
  created_at::date as order_date
FROM orders
WHERE status = 'completed'

Multiple Environments

Connect to different environments:

  • Development: Test data and schemas
  • Staging: Pre-production validation
  • Production: Live data access

Webhooks and Triggers

Configure webhooks for:

  • Sync completion notifications
  • Error alerts
  • Data quality checks
  • Downstream processing
{
  "event": "sync.completed",
  "dataset": "sales_data",
  "records": 15234,
  "duration": "2m 34s",
  "status": "success"
}

Best Practices

  1. Start Small

    • Test with a subset of data first
    • Verify sync behavior before full import
    • Monitor initial syncs closely
  2. Document Connections

    • Note source system details
    • Document any transformations
    • Track sync schedules and dependencies
  3. Plan for Growth

    • Design for increasing data volumes
    • Consider sync frequency vs. freshness needs
    • Monitor resource usage
  4. Maintain Quality

    • Set up data validation rules
    • Monitor for schema drift
    • Alert on anomalies

What's Next?