Skip to main content
Version: Next

Tutorial: Vector Search with SochDB

šŸ”§ Skill Level: Intermediate
ā±ļø Time Required: 20 minutes
šŸ“¦ Requirements: Python 3.9+, numpy, sentence-transformers

Learn how to build a semantic search system using SochDB's HNSW vector index.


šŸŽÆ What You'll Build​

A document search system that:

  • āœ… Stores documents with vector embeddings
  • āœ… Performs semantic similarity search
  • āœ… Returns relevant results based on meaning, not keywords

Step 1: Setup​

# Create project
mkdir semantic-search && cd semantic-search
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install sochdb numpy sentence-transformers

Note: sentence-transformers provides real embedding models. For production, you might also use OpenAI, Cohere, or other embedding APIs.


What are Embeddings?​

Embeddings convert text into numerical vectors that capture semantic meaning:

"The cat sat on the mat"  →  [0.12, -0.34, 0.56, ...]  (384 dimensions)
"A feline rested on a rug" → [0.11, -0.32, 0.55, ...] (similar vector!)
"Python programming" → [-0.45, 0.78, -0.23, ...] (different vector)

Similar meanings = similar vectors = small distance

How HNSW Works​

HNSW (Hierarchical Navigable Small World) is a graph-based index:

Layer 2:       ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—
│ │
Layer 1: ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā—
│ │ │ │ │
Layer 0: ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—ā”€ā”€ā”€ā—
  • Search: Start at top layer, greedily descend
  • Complexity: O(log N) average
  • Recall: 95%+ at practical settings

Step 3: Create the Embedding Service​

Create embeddings.py:

"""Embedding service using sentence-transformers."""

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingService:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
"""Initialize with a sentence-transformer model.

Models:
- all-MiniLM-L6-v2: Fast, good quality, 384 dims
- all-mpnet-base-v2: Better quality, 768 dims, slower
- all-distilroberta-v1: Balanced, 768 dims
"""
print(f"Loading model: {model_name}...")
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
print(f"Model loaded. Embedding dimension: {self.dimension}")

def embed(self, texts: list[str]) -> np.ndarray:
"""Convert texts to embeddings."""
embeddings = self.model.encode(
texts,
convert_to_numpy=True,
normalize_embeddings=True # L2 normalize for cosine similarity
)
return embeddings.astype(np.float32)

def embed_single(self, text: str) -> np.ndarray:
"""Convert a single text to embedding."""
return self.embed([text])[0]


# Test it
if __name__ == "__main__":
service = EmbeddingService()

texts = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn canine leaps above a sleepy hound",
"Python programming is fun",
]

embeddings = service.embed(texts)

# Check similarity (dot product of normalized vectors = cosine similarity)
sim_01 = np.dot(embeddings[0], embeddings[1])
sim_02 = np.dot(embeddings[0], embeddings[2])

print(f"Similarity (fox/canine): {sim_01:.3f}") # Should be high (~0.7+)
print(f"Similarity (fox/python): {sim_02:.3f}") # Should be low (~0.2)

Run it:

python embeddings.py
# Loading model: all-MiniLM-L6-v2...
# Model loaded. Embedding dimension: 384
# Similarity (fox/canine): 0.763
# Similarity (fox/python): 0.186

Step 4: Build the Search System​

Create search.py:

"""Semantic search system with SochDB."""

from sochdb import Database
from sochdb.bulk import bulk_build_index, bulk_query_index
from embeddings import EmbeddingService
import numpy as np
import json
import os

class SemanticSearch:
def __init__(self, db_path: str = "./search_db"):
self.db = Database.open(db_path)
self.embeddings = EmbeddingService()
self.index_path = os.path.join(db_path, "vectors.hnsw")
self._documents = []
self._load_documents()

def _load_documents(self):
"""Load existing documents from database."""
for key, value in self.db.scan(b"docs/"):
doc = json.loads(value.decode())
self._documents.append(doc)
print(f"Loaded {len(self._documents)} existing documents")

def add_documents(self, documents: list[dict]):
"""Add documents with automatic embedding generation.

Each document should have at least 'content' field.
Additional fields (title, metadata) are preserved.
"""
print(f"Adding {len(documents)} documents...")

# Extract content for embedding
contents = [doc["content"] for doc in documents]

# Generate embeddings
embeddings = self.embeddings.embed(contents)

# Store documents
start_id = len(self._documents)
with self.db.transaction() as txn:
for i, doc in enumerate(documents):
doc_id = start_id + i
doc["id"] = doc_id
txn.put(f"docs/{doc_id}".encode(), json.dumps(doc).encode())
self._documents.append(doc)

# Rebuild index with all embeddings
self._rebuild_index()

print(f"āœ… Added {len(documents)} documents (total: {len(self._documents)})")

def _rebuild_index(self):
"""Rebuild HNSW index from all documents."""
if not self._documents:
return

# Get all embeddings
contents = [doc["content"] for doc in self._documents]
embeddings = self.embeddings.embed(contents)

# Build HNSW index
stats = bulk_build_index(
embeddings,
output=self.index_path,
m=16, # Graph connectivity
ef_construction=200 # Build quality
)

print(f"Index built: {stats.get('vectors_indexed', len(embeddings))} vectors")

def search(self, query: str, k: int = 5) -> list[dict]:
"""Search for documents similar to query.

Returns documents with similarity scores.
"""
if not self._documents:
return []

# Embed query
query_embedding = self.embeddings.embed_single(query)

# Search index
try:
results = bulk_query_index(
index_path=self.index_path,
query=query_embedding,
k=k,
ef=50
)
except Exception as e:
print(f"Index search failed: {e}")
# Fallback to brute force
return self._brute_force_search(query_embedding, k)

# Map results to documents
search_results = []
for doc_id, distance in results:
if doc_id < len(self._documents):
doc = self._documents[doc_id].copy()
doc["score"] = 1.0 - distance # Convert distance to similarity
doc["distance"] = distance
search_results.append(doc)

return search_results

def _brute_force_search(self, query: np.ndarray, k: int) -> list[dict]:
"""Fallback brute-force search."""
contents = [doc["content"] for doc in self._documents]
embeddings = self.embeddings.embed(contents)

# Compute similarities
similarities = np.dot(embeddings, query)
top_k = np.argsort(similarities)[-k:][::-1]

results = []
for idx in top_k:
doc = self._documents[idx].copy()
doc["score"] = float(similarities[idx])
results.append(doc)

return results


def main():
# Initialize
search = SemanticSearch()

# Add sample documents
documents = [
{
"title": "SochDB Overview",
"content": "SochDB is an LLM-native database designed for AI applications. It provides token-efficient storage and vector search capabilities."
},
{
"title": "Vector Search Basics",
"content": "Vector search uses embeddings to find semantically similar documents. HNSW is a popular algorithm for approximate nearest neighbor search."
},
{
"title": "Python Development",
"content": "Python is a versatile programming language popular for data science, web development, and AI applications."
},
{
"title": "Database Transactions",
"content": "ACID transactions ensure data integrity. SochDB supports MVCC with serializable snapshot isolation for concurrent access."
},
{
"title": "Machine Learning Models",
"content": "Embedding models convert text to vectors. Popular models include sentence-transformers, OpenAI embeddings, and Cohere embeddings."
},
]

search.add_documents(documents)

# Search examples
print("\n" + "="*60)

queries = [
"How does SochDB handle AI workloads?",
"What is HNSW algorithm?",
"How to ensure data consistency?",
]

for query in queries:
print(f"\nšŸ” Query: {query}")
print("-" * 40)

results = search.search(query, k=3)

for i, result in enumerate(results, 1):
print(f"{i}. [{result['score']:.3f}] {result['title']}")
print(f" {result['content'][:80]}...")


if __name__ == "__main__":
main()

Run it:

python search.py

Expected output:

Loading model: all-MiniLM-L6-v2...
Model loaded. Embedding dimension: 384
Loaded 0 existing documents
Adding 5 documents...
Index built: 5 vectors
āœ… Added 5 documents (total: 5)

============================================================

šŸ” Query: How does SochDB handle AI workloads?
----------------------------------------
1. [0.842] SochDB Overview
SochDB is an LLM-native database designed for AI applications...
2. [0.534] Machine Learning Models
Embedding models convert text to vectors...
3. [0.423] Vector Search Basics
Vector search uses embeddings to find semantically similar documents...

šŸ” Query: What is HNSW algorithm?
----------------------------------------
1. [0.756] Vector Search Basics
Vector search uses embeddings to find semantically similar documents...
2. [0.412] SochDB Overview
SochDB is an LLM-native database designed for AI applications...
3. [0.389] Machine Learning Models
Embedding models convert text to vectors...

šŸ” Query: How to ensure data consistency?
----------------------------------------
1. [0.698] Database Transactions
ACID transactions ensure data integrity...
2. [0.312] SochDB Overview
SochDB is an LLM-native database designed for AI applications...
3. [0.287] Vector Search Basics
Vector search uses embeddings to find semantically similar documents...

Step 5: Tune Performance​

Index Parameters​

ParameterEffectTrade-off
m=8Faster build, less memoryLower recall
m=16Balanced (default)Good recall
m=32Better recallMore memory, slower build
ef_construction=100Faster buildSlightly lower quality
ef_construction=200Good qualityBalanced
ef_construction=400Best qualitySlow build
ef=20Fast searchLower recall
ef=50BalancedGood recall
ef=100High recallSlower search

Memory Estimation​

Memory ā‰ˆ vectors Ɨ dimensions Ɨ 4 bytes Ɨ 1.5 overhead

Example: 100,000 vectors Ɨ 384 dims Ɨ 4 Ɨ 1.5 = 230 MB

Step 6: Production Considerations​

Use Persistent Embeddings​

Don't regenerate embeddings on every load:

def add_documents_with_cache(self, documents):
"""Store embeddings alongside documents."""
contents = [doc["content"] for doc in documents]
embeddings = self.embeddings.embed(contents)

with self.db.transaction() as txn:
for i, (doc, emb) in enumerate(zip(documents, embeddings)):
doc_id = len(self._documents) + i

# Store document
txn.put(f"docs/{doc_id}", json.dumps(doc).encode())

# Store embedding (binary)
txn.put(f"embeddings/{doc_id}", emb.tobytes())

Batch Processing​

For large datasets:

def add_documents_batched(self, documents, batch_size=1000):
"""Add documents in batches to manage memory."""
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
self.add_documents(batch)
print(f"Processed {min(i + batch_size, len(documents))}/{len(documents)}")

What You Learned​

ConceptWhat You Did
EmbeddingsConverted text to vectors using sentence-transformers
HNSW indexingBuilt fast approximate nearest neighbor index
Semantic searchFound similar documents by meaning, not keywords
SochDB integrationStored documents and vectors together

Next Steps​

GoalResource
Build RAG systemMCP Integration
Use with MCPMCP Integration
Optimize performancePerformance Guide
Production deploymentDeployment Guide

Tutorial completed! You've built a working semantic search system with SochDB. šŸŽ‰