Data Pipeline

Understanding how data flows through GoPie from upload to query-ready state

The data pipeline in GoPie transforms raw data files into query-ready datasets through a series of automated processing steps. This document explains how data flows through the system and the optimizations applied at each stage.

Pipeline Overview

Upload Stage

File Reception

When you upload a file to GoPie:

  1. Multipart Upload: Large files are chunked for reliable transfer
  2. Format Detection: Automatic detection of CSV, Excel, Parquet, JSON
  3. Encoding Detection: UTF-8, UTF-16, and other encodings handled
  4. Compression Support: Automatic decompression of .gz, .zip files

Initial Validation

  • File size limits checking
  • Format verification
  • Malware scanning
  • Basic structural validation

Schema Inference

Automatic Type Detection

GoPie analyzes your data to determine:

  • Column Types: Integer, float, string, date, boolean
  • Null Patterns: Nullable columns identified
  • Date Formats: Various date/time formats recognized
  • Nested Structures: JSON and complex types supported

Smart Sampling

For large files:

  • Samples first 10,000 rows for quick inference
  • Validates against full dataset
  • Adjusts types based on edge cases
  • Maintains performance without sacrificing accuracy

Schema Refinement

Original Detection:
  order_date: string
  amount: string
  quantity: string

After Refinement:
  order_date: datetime
  amount: decimal(10,2)
  quantity: integer

Data Processing

Transformation Pipeline

1. Data Cleaning

  • Whitespace Trimming: Remove leading/trailing spaces
  • Null Standardization: Convert various null representations
  • Encoding Fixes: Handle special characters correctly
  • Deduplication: Optional duplicate row removal

2. Type Conversion

  • String to numeric conversion
  • Date parsing with multiple format support
  • Boolean standardization (true/false, 1/0, yes/no)
  • JSON string parsing for nested data

3. Validation

  • Constraint checking
  • Referential integrity (if defined)
  • Business rule validation
  • Data quality scoring

Performance Optimization

Parallel Processing

  • Multi-threaded file reading
  • Concurrent type conversion
  • Batch processing for large files
  • Stream processing to reduce memory usage

Memory Management

Small Files (<100MB): In-memory processing
Medium Files (100MB-1GB): Chunked processing
Large Files (>1GB): Stream processing with DuckDB

Storage Architecture

Dual Storage System

PostgreSQL (Metadata)

Stores:

  • Dataset metadata
  • Schema definitions
  • User permissions
  • Processing history

Object Storage (Data)

  • MinIO/S3 for raw files
  • Parquet format for processed data
  • Compression for space efficiency
  • Versioning support

Storage Optimization

  • Columnar Storage: Parquet format for analytics
  • Partitioning: Time-based and categorical partitioning
  • Compression: Snappy/Zstd compression
  • Statistics: Min/max values, null counts stored

Vector Indexing

Schema information is vectorized for semantic search:

  1. Column names and descriptions embedded
  2. Sample data analyzed for context
  3. Stored in Qdrant vector database
  4. Enables natural language dataset discovery

Traditional Indexing

  • B-tree indexes on frequently queried columns
  • Full-text search on text columns
  • Bitmap indexes for low-cardinality columns
  • Statistics for query optimization

Query Optimization

Query Planning

When a query is executed:

  1. Cost Estimation: Analyze query complexity
  2. Engine Selection: Choose PostgreSQL or DuckDB
  3. Pushdown Optimization: Filter early in pipeline
  4. Caching: Reuse recent results

Execution Strategies

Simple Filters: PostgreSQL
Aggregations: DuckDB
Joins: Depends on data size
Complex Analytics: Always DuckDB

Data Freshness

Update Mechanisms

Full Refresh

  • Complete dataset replacement
  • Maintains version history
  • Atomic operation
  • Zero downtime

Incremental Updates

  • Append new records
  • Update changed records
  • Delete removed records
  • Efficient for large datasets

Change Detection

  • Timestamp tracking
  • Checksum comparison
  • Row version management
  • Audit trail maintenance

Error Handling

Graceful Degradation

  • Partial file processing
  • Error row isolation
  • Warning accumulation
  • Success with caveats

Recovery Mechanisms

  • Automatic retry for transient failures
  • Checkpoint resumption
  • Transaction rollback
  • Manual intervention options

Monitoring and Observability

Pipeline Metrics

  • Processing time per stage
  • Data volume metrics
  • Error rates and types
  • Resource utilization

Health Checks

  • Queue depth monitoring
  • Processing lag alerts
  • Storage capacity warnings
  • Performance degradation detection

Best Practices

For Optimal Performance

  1. Use Parquet: For large analytical datasets
  2. Consistent Schemas: Avoid schema changes mid-file
  3. Reasonable File Sizes: 100MB-1GB optimal
  4. Clean Data: Pre-clean when possible

For Data Quality

  1. Include Headers: Always include column headers
  2. Use Standard Formats: ISO dates, consistent decimals
  3. Document Schemas: Add column descriptions
  4. Test Samples: Validate with small samples first

Advanced Features

Custom Processors

  • User-defined transformation functions
  • External API enrichment
  • Machine learning inference
  • Custom validation rules

Pipeline Templates

  • Reusable processing configurations
  • Industry-specific templates
  • Organization standards
  • Shareable workflows

Next Steps