SochDB Bulk Operations
High-performance bulk vector index operations that bypass Python FFI overhead.
Deep Dive: See Bulk Operations Reference for tool internals and advanced usage.
Why Use Bulk Operations?โ
Python FFI has inherent overhead for vector operations:
| Method | 768D Throughput | Overhead |
|---|---|---|
| Python FFI | ~130 vec/s | 12ร slower |
| Bulk CLI | ~1,600 vec/s | 1.0ร baseline |
The overhead comes from:
- O(Nยทd) memcpy per batch crossing the Python/Rust boundary
- Python allocation tax (reference counting, GC pressure)
- GIL contention in multi-threaded scenarios
The Bulk API eliminates this by:
- Writing vectors to a memory-mapped file (raw f32 or npy)
- Spawning the
sochdb-bulkCLI as a subprocess - Zero FFI marshalling during the actual index build
Quick Startโ
Python APIโ
from sochdb.bulk import bulk_build_index
import numpy as np
# Your embeddings (10K ร 768D)
embeddings = np.random.randn(10000, 768).astype(np.float32)
# Build HNSW index (bypasses FFI)
stats = bulk_build_index(
embeddings,
output="my_index.hnsw",
m=16,
ef_construction=100,
)
print(f"Built {stats.vectors} vectors at {stats.rate:.0f} vec/s")
Command-Line Interfaceโ
# Build from raw f32 file
sochdb-bulk build-index \
--input embeddings.bin \
--output index.hnsw \
--dimension 768
# Build from NumPy .npy file
sochdb-bulk build-index \
--input embeddings.npy \
--output index.hnsw
# With custom HNSW parameters
sochdb-bulk build-index \
--input data.f32 \
--output index.hnsw \
--dimension 768 \
--max-connections 32 \
--ef-construction 200 \
--threads 8
# Query an index
sochdb-bulk query \
--index index.hnsw \
--query query.f32 \
--k 10
# Get index info
sochdb-bulk info --index index.hnsw
Input Formatsโ
Raw float32 (Recommended)โ
The simplest and fastest format - just raw bytes.
File layout:
vectors.f32- N ร D ร 4 bytes of row-major float32 datavectors.json(optional) - Metadata:{"n": 10000, "dim": 768, "metric": "cosine"}ids.u64(optional) - N ร 8 bytes of uint64 IDs
Creating raw f32 from Python:
from sochdb.bulk import convert_embeddings_to_raw
import numpy as np
embeddings = np.load("embeddings.npy")
convert_embeddings_to_raw(embeddings, "embeddings.f32")
NumPy .npyโ
Standard NumPy format. Auto-detected from extension.
Requirements:
- dtype: float32 (
<f4) - order: C-order (fortran_order: False)
- shape: 2D (N, D)
Creating from Python:
import numpy as np
embeddings = np.random.randn(10000, 768).astype(np.float32)
np.save("embeddings.npy", embeddings)
Python API Referenceโ
bulk_build_index()โ
def bulk_build_index(
embeddings: NDArray[np.float32],
output: str | Path,
*,
ids: NDArray[np.uint64] | None = None,
m: int = 16,
ef_construction: int = 100,
batch_size: int = 1000,
threads: int = 0,
quiet: bool = False,
cleanup_temp: bool = True,
) -> BulkBuildStats:
Parameters:
embeddings- 2D float32 array of shape (N, D)output- Path to save the HNSW indexids- Optional uint64 array of IDs (defaults to sequential)m- HNSW max connections per nodeef_construction- HNSW construction search depthbatch_size- Vectors per insertion batchthreads- Number of threads (0 = auto)quiet- Suppress progress outputcleanup_temp- Remove temporary files after build
Returns: BulkBuildStats with performance metrics
BulkBuildStatsโ
@dataclass
class BulkBuildStats:
vectors: int # Number of vectors inserted
dimension: int # Vector dimension
elapsed_secs: float # Total build time
rate: float # Vectors per second
output_size_mb: float # Output file size
command: list[str] # CLI command used
convert_embeddings_to_raw()โ
def convert_embeddings_to_raw(
embeddings: NDArray[np.float32],
output: str | Path,
*,
metric: str | None = None,
) -> Path:
Convert embeddings to SochDB's raw f32 format for optimal bulk loading.
read_raw_embeddings()โ
def read_raw_embeddings(
path: str | Path,
dimension: int | None = None,
) -> NDArray[np.float32]:
Read embeddings from raw f32 format using memory mapping.
CLI Referenceโ
build-indexโ
sochdb-bulk build-index [OPTIONS] --input <FILE> --output <FILE>
Options:
-i, --input <FILE> Input vector file (raw f32 or .npy)
-o, --output <FILE> Output index file
-d, --dimension <DIM> Vector dimension (auto-detected for .npy)
-f, --format <FORMAT> Input format: raw_f32, npy (auto-detected)
--ids <FILE> Optional ID file (raw u64)
-m, --max-connections <N> HNSW M parameter [default: 16]
-e, --ef-construction <N> HNSW ef_construction [default: 100]
--batch-size <N> Batch size for insertion [default: 1000]
-t, --threads <N> Number of threads (0 = auto) [default: 0]
--quiet Suppress progress bar
-v, --verbose Enable verbose logging
queryโ
sochdb-bulk query [OPTIONS] --index <FILE> --query <FILE>
Options:
-i, --index <FILE> Index file
-q, --query <FILE> Query vector file (single vector, raw f32)
-k, --k <N> Number of neighbors [default: 10]
-e, --ef <N> Search ef parameter
infoโ
sochdb-bulk info --index <FILE>
convertโ
sochdb-bulk convert [OPTIONS] --input <FILE> --output <FILE> --to-format <FMT>
Options:
-i, --input <FILE> Input file
-o, --output <FILE> Output file
--from-format <FMT> Input format (auto-detected)
--to-format <FMT> Output format: raw_f32
-d, --dimension <DIM> Dimension (required for some formats)
Building from Sourceโ
# Build release binary
cargo build --release -p sochdb-tools
# Binary location
./target/release/sochdb-bulk --help
# Run benchmarks
cargo bench -p sochdb-tools
# Install to PATH
cargo install --path sochdb-tools
Bundling with Python Packageโ
The Python package can bundle the native binary:
cd sochdb-python-sdk
# Build and install binary for current platform
python build_native.py
# Then build wheel
pip wheel .
The binary is installed to src/sochdb/_bin/<platform>/sochdb-bulk.
Performance Tipsโ
- Use raw f32 format - Fastest to parse, memory-mappable
- Batch size ~1000 - Optimal for HNSW insertion
- Use all CPU cores - Set
threads=0for auto-detection - Pre-normalize vectors - If using cosine similarity
- SSD storage - For large indices, use NVMe storage
Benchmarksโ
Run the performance benchmark:
# Python benchmark
python benchmarks/bulk_benchmark.py --size medium
# Rust microbenchmarks
cargo bench -p sochdb-tools --bench bulk_ingest
Expected results (Apple M1 Pro, 768D vectors):
| Test | Throughput |
|---|---|
| bulk_build_10K | ~1,600 vec/s |
| bulk_build_100K | ~1,400 vec/s |
| ffi_insert_10K | ~130 vec/s |
Troubleshootingโ
"Could not find sochdb-bulk binary"โ
The Python Bulk API requires the sochdb-bulk binary. The SDK automatically
searches in this order:
-
Bundled in wheel (recommended):
pip install sochdb
# Binary is at: site-packages/sochdb/_bin/<platform>/sochdb-bulk -
System PATH:
cargo install --path sochdb-tools
# Or: export PATH="$PATH:/path/to/target/release" -
Cargo target directory (development):
cargo build --release -p sochdb-tools
# Auto-detected if running from workspace
To debug resolution:
from sochdb.bulk import get_sochdb_bulk_path
print(get_sochdb_bulk_path()) # Shows resolved path
Platform Supportโ
The bundled binary supports:
| Platform | Wheel Tag | Notes |
|---|---|---|
| Linux x86_64 | manylinux_2_17_x86_64 | glibc โฅ 2.17 |
| Linux aarch64 | manylinux_2_17_aarch64 | ARM servers |
| macOS | macosx_11_0_universal2 | Intel + Apple Silicon |
| Windows | win_amd64 | Windows 10+ x64 |
"Dimension mismatch"โ
Ensure your dimension parameter matches the data:
- For raw f32:
-d 768or providemeta.json - For npy: Dimension auto-detected from header
"Out of memory"โ
For large datasets (10M+ vectors):
- Use streaming ingestion with smaller batch sizes
- Consider PQ compression before indexing
- Use 64-bit system with sufficient RAM
"GLIBC_2.xx not found" (Linux)โ
Your system glibc is older than the wheel requires:
ldd --version # Check glibc version
# Needs: 2.17 or higher
Solutions:
- Use a newer distro (Ubuntu 14.04+, CentOS 7+)
- Use a container with newer glibc
- Build from source with your system's glibc
Architectureโ
Python Application
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ sochdb.bulk.py โ Python Bulk API
โ - Write vectors โ
โ - Spawn subprocess โ
โโโโโโโโโโโฌโโโโโโโโโโโโ
โ subprocess.run()
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ sochdb-bulk CLI โ Rust Binary (bundled in wheel)
โ - mmap vector file โ
โ - HNSW insertion โ
โ - Save index โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ my_index.hnsw โ Output Index
โโโโโโโโโโโโโโโโโโโโโโโ
The key insight is that subprocess overhead (process spawn) is O(1), while FFI overhead is O(Nยทd) per batch. For bulk operations, the subprocess approach wins decisively.
See Python SDK Guide for full wheel packaging and distribution architecture.