How to Build and Query Vector Indexes

Name: SochDB
Author: SochDB

Create HNSW indexes for semantic search and similarity queries.

Problem

You have vector embeddings (from OpenAI, sentence-transformers, etc.) and need to perform fast approximate nearest-neighbor (ANN) similarity search.

Two Python packages, both named sochdb

The recipes below use two different PyPI packages that share the name sochdb:

The native engine (sochdb 2.0.3, a Rust/PyO3 extension) exposes the low-level HnswIndex, build_index_from_numpy, and recommended_hnsw_params. Install with pip install sochdb.
The pure-Python SDK (sochdb 0.5.9, ctypes/gRPC) exposes the higher-level Database, Namespace, and Collection. Also pip install sochdb.

They have largely disjoint APIs. Each code block below states which package it targets. See the Python SDK guide for the full picture.

Solution

1. Build an HNSW index from NumPy (native engine)

The fastest path is build_index_from_numpy — vectors go straight to Rust with zero copies and no disk round-trip.

import numpy as np
from sochdb import build_index_from_numpy  # native engine (sochdb 2.0.3)

# Generate or load embeddings (float32, shape [N, D])
embeddings = np.random.randn(10000, 384).astype(np.float32)
ids = np.arange(10000, dtype=np.uint64)

# Build an HNSW index. With m / ef_construction omitted, the index auto-tunes
# them from the dimension via recommended_hnsw_params(D).
index = build_index_from_numpy(
    embeddings,
    ids=ids,            # optional; must be uint64. Omit to auto-assign sequential IDs.
    metric="cosine",    # "cosine" | "euclidean" | "dot"
)

# Persist to disk (compressed)
index.save("./my_index.hnsw")

print(f"Built index: {len(index)} vectors, dim={index.dimension}")

build_index_from_numpy(embeddings, *, m=None, ef_construction=None, metric="cosine", ids=None). When m or ef_construction is None, the value comes from recommended_hnsw_params(D) (see Parameter Tuning).

2. Query the index (native engine)

import numpy as np
from sochdb import HnswIndex  # native engine (sochdb 2.0.3)

# Reload a saved index (load is a staticmethod)
index = HnswIndex.load("./my_index.hnsw")

# Query vector from your embedding model (1D float32, length == dim)
query = np.random.randn(384).astype(np.float32)

# Find the k nearest neighbours. search returns parallel arrays.
ids, distances = index.search(query, k=10, ef_search=200)

for vec_id, distance in zip(ids, distances):
    print(f"ID: {vec_id}, Distance: {distance:.4f}")

HnswIndex.search(query, k, ef_search=None) returns a (uint64[], float32[]) tuple of IDs and distances. Omitting ef_search lets the index pick an adaptive depth. To search many queries at once, use index.search_batch(queries, k, ef_search=None).

3. Construct an index directly (native engine)

If you want explicit control or are streaming vectors in, build the HnswIndex yourself:

import numpy as np
from sochdb import HnswIndex  # native engine (sochdb 2.0.3)

# HnswIndex(dimension, m=32, ef_construction=200, metric="cosine", precision="f32")
index = HnswIndex(dimension=384, m=32, ef_construction=200, metric="cosine")

embeddings = np.random.randn(10000, 384).astype(np.float32)
ids = np.arange(10000, dtype=np.uint64)

# Zero-copy batch insert (arrays must be C-contiguous float32 / uint64)
index.insert_batch_with_ids(ids, embeddings)

# Optional: small-index recall booster (exact layer-0 rebuild). See note below.
# index.optimize()

ids, distances = index.search(embeddings[0], k=5)

optimize() is a small-index booster

HnswIndex.optimize() rebuilds layer 0 with an exact brute-force kNN pass, which lifts recall on small indexes (roughly 0.3s per 10K vectors). On very large indexes the equivalent core operation is skipped to avoid pathological memory/time blow-ups, so do not rely on it as a million-vector tuning step.

4. High-level Collection (pure-Python SDK)

For application code, the pure-Python SDK (sochdb 0.5.9) wraps storage, metadata, and search behind Collection, so you do not manage index files yourself.

from sochdb import Client  # pure-Python SDK (sochdb 0.5.9)

# :memory: uses a temp dir; pass a real path to persist
client = Client(":memory:")
docs = client.open_collection("documents", dimension=384)

docs.insert(id="doc-1", vector=[0.1] * 384, metadata={"title": "intro"})
docs.insert(id="doc-2", vector=[0.2] * 384, metadata={"title": "guide"})

results = docs.vector_search([0.15] * 384, k=5)
for r in results:
    print(r)

Collection also offers insert_batch(...), upsert(...), keyword_search(...), hybrid_search(...), and set_ef_search(n). See the Python SDK guide for the full surface.

5. Integrated with the database (Rust)

The published sochdb Rust crate (currently 2.0.3) exposes vector collections through VectorCollection.

use std::sync::Arc;
use sochdb::{SochConnection, VectorCollection};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let conn = Arc::new(SochConnection::open("./vector_db")?);

    // Create a collection (dimension is fixed at creation)
    let mut docs = VectorCollection::create(&conn, "documents", 384)?;

    // Insert vectors keyed by string IDs
    let embedding: Vec<f32> = get_embedding("Hello world");
    docs.add(&["doc-1"], &[embedding])?;

    // Vector search returns Vec<SearchResult { id, distance, metadata }>
    let query = get_embedding("Hi there");
    for hit in docs.search(&query, 10)? {
        println!("{} -> distance {:.4}", hit.id, hit.distance);
    }

    Ok(())
}

SQL vector search

SochDB's SQL surface also supports a VECTOR_SEARCH(column, query_vector, k, metric) function and the VECTOR(dims) / EMBEDDING(dims) column types. Metric keywords are COSINE, EUCLIDEAN, and DOT_PRODUCT. Note that CREATE INDEX for vectors is handled by the storage-backed SQL path, not by the in-memory reference executor.

HNSW Parameter Tuning

Build parameters

Parameter	`HnswIndex` default	Core engine default	Effect
`m`	32	32 (`M0` = 64)	Higher = better recall, more memory
`ef_construction`	200	256	Higher = better index quality, slower build

The native HnswIndex constructor defaults to m=32, ef_construction=200. The core Rust engine's HnswConfig defaults are M=32, max_connections_layer0 (M0)=64, ef_construction=256, ef_search=500, metric=Cosine, precision=F32.

Auto-tuning with `recommended_hnsw_params`

Rather than guessing, ask the engine. recommended_hnsw_params(dimension, n_vectors=None, target_recall=0.95) returns a dict with m, ef_construction, ef_search, and a note:

from sochdb import recommended_hnsw_params, HnswIndex  # native engine (sochdb 2.0.3)

params = recommended_hnsw_params(768, target_recall=0.95)
print(params)
# {'m': 32, 'ef_construction': 256, 'ef_search': 640, 'note': '...'}

index = HnswIndex(
    dimension=768,
    m=params["m"],
    ef_construction=params["ef_construction"],
)
ids, dists = index.search(query, k=10, ef_search=params["ef_search"])

How the recommendation is derived:

m by dimension: dim <= 128 -> M=16; 129-512 -> M=24; 513+ -> M=32.
ef_construction = max(200, m * 8).
ef_search scales with target_recall: >= 0.99 -> 40 * M; >= 0.95 -> 20 * M; >= 0.90 -> 10 * M; otherwise 6 * M.

Synthetic vs. real data

The recommendations are tuned for real embedding distributions. Synthetic uniform random vectors are harder to index and typically need 2-3x higher ef_search for the same recall. The benchmark numbers in this page use random data and are illustrative only.

Search parameters

Parameter	Typical	Range	Effect
`ef_search`	200	10-500+	Higher = better recall, slower query
`k`	10	1-1000	Number of results

Guidelines (real embeddings):

Fast search: lower ef_search (e.g. 10 * M).
Balanced: 20 * M (the target_recall=0.95 recommendation).
High recall (99%+): 40 * M.

Quantization trade-offs

The engine supports three precisions, selected at index creation via the precision argument ("f32", "f16", "bf16"):

Precision	Bytes/element	Memory	Use case
`f32`	4	100%	Default, best accuracy
`f16`	2	50%	Large indexes, memory-constrained
`bf16`	2	50%	Models trained in bfloat16

Example: Semantic Search System

This example combines the pure-Python SDK KV store (for document text) with the native engine HNSW index (for ANN). It rebuilds the index in-process with build_index_from_numpy.

#!/usr/bin/env python3
"""Semantic search over documents using SochDB."""

import numpy as np
from sochdb import Database                 # pure-Python SDK (sochdb 0.5.9)
from sochdb import build_index_from_numpy, HnswIndex  # native engine (sochdb 2.0.3)


# Simulated embedding function (replace with a real model)
def get_embedding(text: str, dim: int = 384) -> np.ndarray:
    # In production: use sentence-transformers, OpenAI, etc.
    rng = np.random.default_rng(abs(hash(text)) % 2**32)
    return rng.standard_normal(dim).astype(np.float32)


class SemanticSearch:
    def __init__(self, db_path: str, index_path: str):
        self.db = Database.open(db_path)
        self.index_path = index_path
        self.index: HnswIndex | None = None
        self.documents: list[dict] = []

    def add_documents(self, docs: list[dict]) -> None:
        """Add documents with their embeddings, then rebuild the index."""
        embeddings = []
        for i, doc in enumerate(docs):
            doc_id = len(self.documents) + i
            # Store the document text in the KV store
            self.db.put(f"docs/{doc_id}/content".encode(), doc["content"].encode())
            embeddings.append(get_embedding(doc["content"]))
            self.documents.append(doc)

        # Rebuild the HNSW index with every embedding (IDs match the KV doc IDs)
        all_embeddings = np.asarray(embeddings, dtype=np.float32)
        ids = np.arange(len(self.documents), dtype=np.uint64)
        self.index = build_index_from_numpy(all_embeddings, ids=ids, metric="cosine")
        self.index.save(self.index_path)

    def search(self, query: str, k: int = 5) -> list[dict]:
        """Search for similar documents."""
        if self.index is None:
            self.index = HnswIndex.load(self.index_path)

        query_embedding = get_embedding(query)
        ids, distances = self.index.search(query_embedding, k=k, ef_search=200)

        results = []
        for doc_id, distance in zip(ids, distances):
            content = self.db.get(f"docs/{int(doc_id)}/content".encode())
            if content:
                results.append({
                    "id": int(doc_id),
                    "content": content.decode(),
                    # cosine distance -> rough similarity
                    "score": 1.0 - float(distance),
                })
        return results


# Usage
search = SemanticSearch("./search_db", "./search.hnsw")

search.add_documents([
    {"content": "SochDB is an LLM-native database"},
    {"content": "Vector search enables semantic queries"},
    {"content": "HNSW provides fast approximate nearest neighbor search"},
    {"content": "Python SDK makes integration easy"},
])

for r in search.search("database for AI applications", k=3):
    print(f"[{r['score']:.3f}] {r['content']}")

Discussion

When to use vector search

Good for:

Semantic similarity (find similar documents)
Recommendation systems
RAG (Retrieval Augmented Generation)
Image/audio similarity

Not for:

Exact matching (use regular indexes)
Structured queries (use SQL)
Tiny datasets — for a few thousand vectors the engine automatically falls back to an exact parallel SIMD flat scan, which is faster and exact (see below)

Automatic flat-scan fallback

For small indexes the engine skips the HNSW graph and runs an exact brute-force scan, which is both faster and exact at that scale. The crossover is dimension-aware:

Dimension	Flat-scan threshold (vectors)
`<= 128`	10,000
`<= 384`	4,000
else (768D+)	1,000

Below the threshold, search uses the exact scan; above it, it uses the HNSW graph.

Memory estimation

Memory = vectors x dimensions x bytes_per_element x overhead

Example: 1M vectors x 384 dims x 4 bytes x 1.5 overhead
       = 1,000,000 x 384 x 4 x 1.5
       ~= 2.3 GB

With F16 or BF16 precision (2 bytes/element): roughly half that, ~1.15 GB.

Problem​

Solution​

1. Build an HNSW index from NumPy (native engine)​

2. Query the index (native engine)​

3. Construct an index directly (native engine)​

4. High-level Collection (pure-Python SDK)​

5. Integrated with the database (Rust)​

HNSW Parameter Tuning​

Build parameters​

Auto-tuning with recommended_hnsw_params​

Search parameters​

Quantization trade-offs​

Example: Semantic Search System​

Discussion​

When to use vector search​

Automatic flat-scan fallback​

Memory estimation​

See Also​