Core Concepts

Vector Search

How GoPie uses vector embeddings and semantic search to understand your data

GoPie leverages vector embeddings and Qdrant vector database to provide intelligent schema discovery and semantic search capabilities. This enables natural language queries to find relevant tables and columns even when exact names don't match.

Overview

Vector search converts text (table names, column names, descriptions) into high-dimensional numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling:

  • Finding "revenue" when users ask for "sales"
  • Matching "customer" to "client" or "user" tables
  • Understanding domain-specific terminology
  • Cross-language search capabilities

Why Vector Search Matters

Traditional keyword search fails when:

  • Users don't know exact table/column names
  • Different naming conventions exist
  • Business terms differ from technical names
  • Multiple similar concepts exist

Architecture

Component Overview

Qdrant Integration

Vector Database Features

  • High Performance: Millions of vectors with ms latency
  • Filtering: Combine vector search with metadata filters
  • Scalability: Horizontal scaling for large deployments
  • Persistence: Durable storage with WAL

Collections Structure

{
  "collection": "dataset_schemas",
  "vectors": {
    "size": 1536,
    "distance": "Cosine"
  },
  "payload": {
    "dataset_id": "uuid",
    "table_name": "string",
    "column_name": "string",
    "data_type": "string",
    "description": "string"
  }
}

Embedding Process

What Gets Embedded

1. Table Information

table_text = f"""
Table: {table_name}
Description: {table_description}
Columns: {', '.join(column_names)}
Sample queries: {common_queries}
"""

2. Column Information

column_text = f"""
Column: {column_name}
Table: {table_name}
Type: {data_type}
Description: {description}
Sample values: {sample_values}
Statistics: {min}, {max}, {avg}, {distinct_count}
"""

3. Relationships

relationship_text = f"""
{table1}.{column1} relates to {table2}.{column2}
Type: {foreign_key|one_to_many|many_to_many}
Description: {relationship_description}
"""

Embedding Models

Current Model

  • Model: OpenAI text-embedding-3-small
  • Dimensions: 1536
  • Context Window: 8191 tokens
  • Advantages: High quality, fast, cost-effective

Alternative Models

Production:
  - OpenAI text-embedding-3-large
  - Cohere embed-english-v3.0
  
Open Source:
  - sentence-transformers/all-MiniLM-L6-v2
  - BAAI/bge-large-en-v1.5
  
Multilingual:
  - OpenAI text-embedding-3-small
  - sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Search Algorithm

Query Processing

1. Query Expansion

def expand_query(user_query):
    # Add synonyms
    expanded = add_synonyms(user_query)
    
    # Add related terms
    expanded += get_related_terms(user_query)
    
    # Add common variations
    expanded += generate_variations(user_query)
    
    return expanded

Search across different embedding types:

  • Exact table/column names
  • Descriptions and documentation
  • Historical queries
  • User-provided aliases

Ranking Algorithm

Similarity Scoring

def calculate_score(query_vector, doc_vector, metadata):
    # Cosine similarity
    vector_score = cosine_similarity(query_vector, doc_vector)
    
    # Boost for exact matches
    exact_match_boost = 2.0 if is_exact_match(query, metadata) else 1.0
    
    # Recency boost
    recency_boost = calculate_recency_boost(metadata.last_accessed)
    
    # Usage frequency boost
    usage_boost = log(metadata.query_count + 1)
    
    return vector_score * exact_match_boost * recency_boost * usage_boost

Re-ranking Strategy

  1. Initial vector search (top 50)
  2. Apply metadata filters
  3. Re-rank with additional signals
  4. Return top 10 results

Semantic Understanding

Domain Adaptation

Industry Vocabularies

Finance:
  revenue: [income, sales, turnover, receipts]
  customer: [client, account, counterparty]
  transaction: [trade, deal, order, payment]

Healthcare:
  patient: [member, beneficiary, client]
  provider: [doctor, physician, practitioner]
  claim: [bill, invoice, submission]

Custom Dictionaries

Organizations can define:

  • Business-specific terms
  • Acronym expansions
  • Department terminology
  • Legacy system mappings

Session Context

# Previous queries influence current search
context_embedding = combine_embeddings([
    current_query_embedding,
    weighted_previous_queries,
    active_dataset_context
])

Temporal Context

  • Recent queries weighted higher
  • Seasonal term adjustments
  • Time-based relevance

Performance Optimization

Indexing Strategies

HNSW Index

  • Hierarchical Navigable Small World
  • Fast approximate search
  • Tunable accuracy/speed tradeoff
  • Memory efficient

Index Parameters

hnsw_config:
  m: 16  # Number of connections
  ef_construct: 100  # Construction accuracy
  ef: 50  # Search accuracy
  max_m: 16
  seed: 42

Caching Layer

Query Cache

@lru_cache(maxsize=10000)
def get_cached_embedding(text):
    return embedding_model.encode(text)

Result Cache

  • Cache frequent searches
  • Invalidate on schema changes
  • TTL-based expiration
  • User-specific caching

Batch Processing

Bulk Embeddings

# Process in batches for efficiency
def embed_batch(texts, batch_size=100):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        batch_embeddings = model.encode(batch)
        embeddings.extend(batch_embeddings)
    return embeddings

Quality Improvements

Feedback Loop

Implicit Feedback

  • Click-through rates
  • Query refinements
  • Time spent on results
  • Successful query completion

Explicit Feedback

  • User corrections
  • Result ratings
  • Alternative suggestions
  • Training examples

Continuous Learning

Model Fine-tuning

# Collect training pairs
training_data = [
    ("show revenue", "sales_transactions.total_amount"),
    ("customer list", "customers.customer_name"),
    ("monthly sales", "orders.order_date, orders.amount")
]

# Fine-tune embeddings
fine_tuned_model = train_model(base_model, training_data)

A/B Testing

  • Test different models
  • Compare ranking algorithms
  • Measure user satisfaction
  • Gradual rollout

Advanced Features

Vector + Keyword

def hybrid_search(query, alpha=0.7):
    vector_results = vector_search(query)
    keyword_results = keyword_search(query)
    
    # Weighted combination
    combined = alpha * vector_results + (1 - alpha) * keyword_results
    return deduplicate_and_sort(combined)
  • Filter by data type
  • Restrict to specific datasets
  • Department-level access
  • Time-based filtering

Fuzzy Matching

  • Handle typos
  • Phonetic matching
  • Edit distance
  • N-gram similarity

Monitoring and Debugging

Search Analytics

Key Metrics

  • Query latency (p50, p90, p99)
  • Result relevance scores
  • Cache hit rates
  • Empty result rates

Quality Metrics

# Mean Reciprocal Rank
def calculate_mrr(queries, correct_results):
    reciprocal_ranks = []
    for query, correct in zip(queries, correct_results):
        results = search(query)
        rank = get_rank(correct, results)
        reciprocal_ranks.append(1/rank if rank else 0)
    return mean(reciprocal_ranks)

Debug Tools

Embedding Visualization

  • t-SNE/UMAP projections
  • Cluster analysis
  • Similarity matrices
  • Interactive exploration

Query Analysis

{
  "query": "customer revenue last month",
  "tokens": ["customer", "revenue", "last", "month"],
  "expanded": ["client", "income", "previous", "30 days"],
  "embedding": [0.123, -0.456, ...],
  "similar_queries": ["sales by customer", "monthly income"]
}

Best Practices

Schema Design

  1. Rich Descriptions: Add meaningful column descriptions
  2. Business Terms: Include common business names
  3. Examples: Provide sample queries
  4. Relationships: Document foreign keys

Query Optimization

  1. Be Specific: Include context when possible
  2. Use Business Terms: Natural language works best
  3. Iterative Search: Refine based on results
  4. Feedback: Correct misunderstandings

Future Enhancements

Planned Features

  • Multi-modal embeddings (text + data samples)
  • Cross-lingual search
  • Query suggestion engine
  • Automated schema documentation

Research Areas

  • Few-shot learning for new domains
  • Federated vector search
  • Privacy-preserving embeddings
  • Real-time index updates

Next Steps