Data Pipeline
Understanding how data flows through GoPie from upload to query-ready state
The data pipeline in GoPie transforms raw data files into query-ready datasets through a series of automated processing steps. This document explains how data flows through the system and the optimizations applied at each stage.
Pipeline Overview
Upload Stage
File Reception
When you upload a file to GoPie:
- Multipart Upload: Large files are chunked for reliable transfer
- Format Detection: Automatic detection of CSV, Excel, Parquet, JSON
- Encoding Detection: UTF-8, UTF-16, and other encodings handled
- Compression Support: Automatic decompression of .gz, .zip files
Initial Validation
- File size limits checking
- Format verification
- Malware scanning
- Basic structural validation
Schema Inference
Automatic Type Detection
GoPie analyzes your data to determine:
- Column Types: Integer, float, string, date, boolean
- Null Patterns: Nullable columns identified
- Date Formats: Various date/time formats recognized
- Nested Structures: JSON and complex types supported
Smart Sampling
For large files:
- Samples first 10,000 rows for quick inference
- Validates against full dataset
- Adjusts types based on edge cases
- Maintains performance without sacrificing accuracy
Schema Refinement
Original Detection:
order_date: string
amount: string
quantity: string
After Refinement:
order_date: datetime
amount: decimal(10,2)
quantity: integer
Data Processing
Transformation Pipeline
1. Data Cleaning
- Whitespace Trimming: Remove leading/trailing spaces
- Null Standardization: Convert various null representations
- Encoding Fixes: Handle special characters correctly
- Deduplication: Optional duplicate row removal
2. Type Conversion
- String to numeric conversion
- Date parsing with multiple format support
- Boolean standardization (true/false, 1/0, yes/no)
- JSON string parsing for nested data
3. Validation
- Constraint checking
- Referential integrity (if defined)
- Business rule validation
- Data quality scoring
Performance Optimization
Parallel Processing
- Multi-threaded file reading
- Concurrent type conversion
- Batch processing for large files
- Stream processing to reduce memory usage
Memory Management
Small Files (<100MB): In-memory processing
Medium Files (100MB-1GB): Chunked processing
Large Files (>1GB): Stream processing with DuckDB
Storage Architecture
Dual Storage System
PostgreSQL (Metadata)
Stores:
- Dataset metadata
- Schema definitions
- User permissions
- Processing history
Object Storage (Data)
- MinIO/S3 for raw files
- Parquet format for processed data
- Compression for space efficiency
- Versioning support
Storage Optimization
- Columnar Storage: Parquet format for analytics
- Partitioning: Time-based and categorical partitioning
- Compression: Snappy/Zstd compression
- Statistics: Min/max values, null counts stored
Indexing and Search
Vector Indexing
Schema information is vectorized for semantic search:
- Column names and descriptions embedded
- Sample data analyzed for context
- Stored in Qdrant vector database
- Enables natural language dataset discovery
Traditional Indexing
- B-tree indexes on frequently queried columns
- Full-text search on text columns
- Bitmap indexes for low-cardinality columns
- Statistics for query optimization
Query Optimization
Query Planning
When a query is executed:
- Cost Estimation: Analyze query complexity
- Engine Selection: Choose PostgreSQL or DuckDB
- Pushdown Optimization: Filter early in pipeline
- Caching: Reuse recent results
Execution Strategies
Simple Filters: PostgreSQL
Aggregations: DuckDB
Joins: Depends on data size
Complex Analytics: Always DuckDB
Data Freshness
Update Mechanisms
Full Refresh
- Complete dataset replacement
- Maintains version history
- Atomic operation
- Zero downtime
Incremental Updates
- Append new records
- Update changed records
- Delete removed records
- Efficient for large datasets
Change Detection
- Timestamp tracking
- Checksum comparison
- Row version management
- Audit trail maintenance
Error Handling
Graceful Degradation
- Partial file processing
- Error row isolation
- Warning accumulation
- Success with caveats
Recovery Mechanisms
- Automatic retry for transient failures
- Checkpoint resumption
- Transaction rollback
- Manual intervention options
Monitoring and Observability
Pipeline Metrics
- Processing time per stage
- Data volume metrics
- Error rates and types
- Resource utilization
Health Checks
- Queue depth monitoring
- Processing lag alerts
- Storage capacity warnings
- Performance degradation detection
Best Practices
For Optimal Performance
- Use Parquet: For large analytical datasets
- Consistent Schemas: Avoid schema changes mid-file
- Reasonable File Sizes: 100MB-1GB optimal
- Clean Data: Pre-clean when possible
For Data Quality
- Include Headers: Always include column headers
- Use Standard Formats: ISO dates, consistent decimals
- Document Schemas: Add column descriptions
- Test Samples: Validate with small samples first
Advanced Features
Custom Processors
- User-defined transformation functions
- External API enrichment
- Machine learning inference
- Custom validation rules
Pipeline Templates
- Reusable processing configurations
- Industry-specific templates
- Organization standards
- Shareable workflows
Next Steps
- Learn about AI Agents that query your processed data
- Understand Database Architecture for storage details
- Explore Vector Search for semantic capabilities