Core Concepts
Vector Search
How GoPie uses vector embeddings and semantic search to understand your data
GoPie leverages vector embeddings and Qdrant vector database to provide intelligent schema discovery and semantic search capabilities. This enables natural language queries to find relevant tables and columns even when exact names don't match.
Overview
What is Vector Search?
Vector search converts text (table names, column names, descriptions) into high-dimensional numerical vectors that capture semantic meaning. Similar concepts end up close together in vector space, enabling:
- Finding "revenue" when users ask for "sales"
- Matching "customer" to "client" or "user" tables
- Understanding domain-specific terminology
- Cross-language search capabilities
Why Vector Search Matters
Traditional keyword search fails when:
- Users don't know exact table/column names
- Different naming conventions exist
- Business terms differ from technical names
- Multiple similar concepts exist
Architecture
Component Overview
Qdrant Integration
Vector Database Features
- High Performance: Millions of vectors with ms latency
- Filtering: Combine vector search with metadata filters
- Scalability: Horizontal scaling for large deployments
- Persistence: Durable storage with WAL
Collections Structure
{
"collection": "dataset_schemas",
"vectors": {
"size": 1536,
"distance": "Cosine"
},
"payload": {
"dataset_id": "uuid",
"table_name": "string",
"column_name": "string",
"data_type": "string",
"description": "string"
}
}
Embedding Process
What Gets Embedded
1. Table Information
table_text = f"""
Table: {table_name}
Description: {table_description}
Columns: {', '.join(column_names)}
Sample queries: {common_queries}
"""
2. Column Information
column_text = f"""
Column: {column_name}
Table: {table_name}
Type: {data_type}
Description: {description}
Sample values: {sample_values}
Statistics: {min}, {max}, {avg}, {distinct_count}
"""
3. Relationships
relationship_text = f"""
{table1}.{column1} relates to {table2}.{column2}
Type: {foreign_key|one_to_many|many_to_many}
Description: {relationship_description}
"""
Embedding Models
Current Model
- Model: OpenAI text-embedding-3-small
- Dimensions: 1536
- Context Window: 8191 tokens
- Advantages: High quality, fast, cost-effective
Alternative Models
Production:
- OpenAI text-embedding-3-large
- Cohere embed-english-v3.0
Open Source:
- sentence-transformers/all-MiniLM-L6-v2
- BAAI/bge-large-en-v1.5
Multilingual:
- OpenAI text-embedding-3-small
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Search Algorithm
Query Processing
1. Query Expansion
def expand_query(user_query):
# Add synonyms
expanded = add_synonyms(user_query)
# Add related terms
expanded += get_related_terms(user_query)
# Add common variations
expanded += generate_variations(user_query)
return expanded
2. Multi-Vector Search
Search across different embedding types:
- Exact table/column names
- Descriptions and documentation
- Historical queries
- User-provided aliases
Ranking Algorithm
Similarity Scoring
def calculate_score(query_vector, doc_vector, metadata):
# Cosine similarity
vector_score = cosine_similarity(query_vector, doc_vector)
# Boost for exact matches
exact_match_boost = 2.0 if is_exact_match(query, metadata) else 1.0
# Recency boost
recency_boost = calculate_recency_boost(metadata.last_accessed)
# Usage frequency boost
usage_boost = log(metadata.query_count + 1)
return vector_score * exact_match_boost * recency_boost * usage_boost
Re-ranking Strategy
- Initial vector search (top 50)
- Apply metadata filters
- Re-rank with additional signals
- Return top 10 results
Semantic Understanding
Domain Adaptation
Industry Vocabularies
Finance:
revenue: [income, sales, turnover, receipts]
customer: [client, account, counterparty]
transaction: [trade, deal, order, payment]
Healthcare:
patient: [member, beneficiary, client]
provider: [doctor, physician, practitioner]
claim: [bill, invoice, submission]
Custom Dictionaries
Organizations can define:
- Business-specific terms
- Acronym expansions
- Department terminology
- Legacy system mappings
Contextual Search
Session Context
# Previous queries influence current search
context_embedding = combine_embeddings([
current_query_embedding,
weighted_previous_queries,
active_dataset_context
])
Temporal Context
- Recent queries weighted higher
- Seasonal term adjustments
- Time-based relevance
Performance Optimization
Indexing Strategies
HNSW Index
- Hierarchical Navigable Small World
- Fast approximate search
- Tunable accuracy/speed tradeoff
- Memory efficient
Index Parameters
hnsw_config:
m: 16 # Number of connections
ef_construct: 100 # Construction accuracy
ef: 50 # Search accuracy
max_m: 16
seed: 42
Caching Layer
Query Cache
@lru_cache(maxsize=10000)
def get_cached_embedding(text):
return embedding_model.encode(text)
Result Cache
- Cache frequent searches
- Invalidate on schema changes
- TTL-based expiration
- User-specific caching
Batch Processing
Bulk Embeddings
# Process in batches for efficiency
def embed_batch(texts, batch_size=100):
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = model.encode(batch)
embeddings.extend(batch_embeddings)
return embeddings
Quality Improvements
Feedback Loop
Implicit Feedback
- Click-through rates
- Query refinements
- Time spent on results
- Successful query completion
Explicit Feedback
- User corrections
- Result ratings
- Alternative suggestions
- Training examples
Continuous Learning
Model Fine-tuning
# Collect training pairs
training_data = [
("show revenue", "sales_transactions.total_amount"),
("customer list", "customers.customer_name"),
("monthly sales", "orders.order_date, orders.amount")
]
# Fine-tune embeddings
fine_tuned_model = train_model(base_model, training_data)
A/B Testing
- Test different models
- Compare ranking algorithms
- Measure user satisfaction
- Gradual rollout
Advanced Features
Hybrid Search
Vector + Keyword
def hybrid_search(query, alpha=0.7):
vector_results = vector_search(query)
keyword_results = keyword_search(query)
# Weighted combination
combined = alpha * vector_results + (1 - alpha) * keyword_results
return deduplicate_and_sort(combined)
Faceted Search
- Filter by data type
- Restrict to specific datasets
- Department-level access
- Time-based filtering
Fuzzy Matching
- Handle typos
- Phonetic matching
- Edit distance
- N-gram similarity
Monitoring and Debugging
Search Analytics
Key Metrics
- Query latency (p50, p90, p99)
- Result relevance scores
- Cache hit rates
- Empty result rates
Quality Metrics
# Mean Reciprocal Rank
def calculate_mrr(queries, correct_results):
reciprocal_ranks = []
for query, correct in zip(queries, correct_results):
results = search(query)
rank = get_rank(correct, results)
reciprocal_ranks.append(1/rank if rank else 0)
return mean(reciprocal_ranks)
Debug Tools
Embedding Visualization
- t-SNE/UMAP projections
- Cluster analysis
- Similarity matrices
- Interactive exploration
Query Analysis
{
"query": "customer revenue last month",
"tokens": ["customer", "revenue", "last", "month"],
"expanded": ["client", "income", "previous", "30 days"],
"embedding": [0.123, -0.456, ...],
"similar_queries": ["sales by customer", "monthly income"]
}
Best Practices
Schema Design
- Rich Descriptions: Add meaningful column descriptions
- Business Terms: Include common business names
- Examples: Provide sample queries
- Relationships: Document foreign keys
Query Optimization
- Be Specific: Include context when possible
- Use Business Terms: Natural language works best
- Iterative Search: Refine based on results
- Feedback: Correct misunderstandings
Future Enhancements
Planned Features
- Multi-modal embeddings (text + data samples)
- Cross-lingual search
- Query suggestion engine
- Automated schema documentation
Research Areas
- Few-shot learning for new domains
- Federated vector search
- Privacy-preserving embeddings
- Real-time index updates
Next Steps
- Understand Multi-tenancy for team isolation
- Explore Database Architecture
- Learn about MCP Servers for AI integration