Observability & Metrics

Name: SochDB
Author: SochDB

This guide covers how to observe a running sochdb-grpc-server: the Prometheus metrics endpoint and its exported series, scraping with Prometheus, dashboards with Grafana, and the gRPC health service used by Kubernetes probes.

Versions

This page targets core engine 2.0.3 (the sochdb-grpc-server binary). The language SDKs version independently (Python 0.5.9, Node.js 0.5.3, Go 0.4.5).

For setting log levels (RUST_LOG), the --debug flag, and audit logging, see How to Configure Logging. For running the server in production (Docker images, Helm, sizing), see Deploying to Production.

The metrics endpoint

The server exposes a Prometheus-format metrics endpoint over plain HTTP. It runs on a dedicated OS thread (sochdb-metrics-http), separate from the gRPC service, so scraping never contends with the tokio runtime that serves queries.

Control it with --metrics-port:

# Metrics on the default port 9090
sochdb-grpc-server --host 0.0.0.0 --port 50051

# Move metrics to a custom port
sochdb-grpc-server --metrics-port 9100

# Disable the metrics endpoint entirely
sochdb-grpc-server --metrics-port 0

Flag	Default	Purpose
`--metrics-port`	`9090`	Prometheus HTTP port. `0` disables the endpoint.

The HTTP listener binds 0.0.0.0:<metrics-port> and serves exactly two routes (anything else returns 404):

Route	Response
`GET /metrics`	Prometheus text exposition (format `version=0.0.4`)
`GET /health`	`200 OK` with body `OK`

Quick check once the server is up:

curl -s http://127.0.0.1:9090/health
# -> OK

curl -s http://127.0.0.1:9090/metrics | head -n 30

The metrics endpoint has no authentication

/metrics and /health are served by a plain HTTP listener that is not behind --auth. This is intentional, so in-cluster Prometheus and probes can reach it without credentials. Keep the metrics port on a trusted network or behind your ingress — do not expose it to the public internet.

Exported metrics

The series below come from the server's metrics registry. HNSW index metrics are also auto-included via the default Prometheus registry, so you will see additional hnsw_* series depending on workload.

gRPC

Metric	Type	Labels	Meaning
`sochdb_grpc_requests_total`	counter	`service`, `method`	Total gRPC requests handled
`sochdb_grpc_errors_total`	counter	`service`, `method`, `code`	Requests that returned an error, by status `code`
`sochdb_grpc_request_duration_seconds`	histogram	—	Request latency distribution
`sochdb_grpc_active_connections`	gauge	—	Currently open gRPC connections

SQL and transactions

Metric	Type	Labels	Meaning
`sochdb_sql_queries_total`	counter	`statement_type`	SQL statements executed, by kind (e.g. `select`, `insert`)
`sochdb_sql_query_duration_seconds`	histogram	—	SQL query latency distribution
`sochdb_transactions_total`	counter	`outcome`	Transactions, by `outcome` (e.g. commit / abort)

Storage and WAL

Metric	Type	Labels	Meaning
`sochdb_tables_count`	gauge	—	Number of tables
`sochdb_storage_bytes`	gauge	—	On-disk storage size in bytes
`sochdb_wal_bytes`	gauge	—	Write-ahead-log size in bytes
`sochdb_wal_writes_total`	counter	—	WAL append operations
`sochdb_wal_fsync_total`	counter	—	WAL `fsync` calls
`sochdb_cache_operations_total`	counter	`result`	Cache operations, by `result` (e.g. hit / miss)

Process

Metric	Type	Labels	Meaning
`sochdb_uptime_seconds`	gauge	—	Seconds since the server started
`sochdb_build_info`	gauge	`version`, `rustc`	Build metadata; value is always `1`

Here is what a slice of /metrics looks like:

# HELP sochdb_grpc_requests_total Total gRPC requests handled
# TYPE sochdb_grpc_requests_total counter
sochdb_grpc_requests_total{service="KvService",method="Get"} 1842
sochdb_grpc_requests_total{service="VectorIndexService",method="Search"} 537

# HELP sochdb_build_info Build information
# TYPE sochdb_build_info gauge
sochdb_build_info{version="2.0.3",rustc="1.91.0"} 1

# HELP sochdb_uptime_seconds Seconds since server start
# TYPE sochdb_uptime_seconds gauge
sochdb_uptime_seconds 3641

Confirm the running version

sochdb_build_info carries a version label (e.g. 2.0.3). Use it to verify exactly which engine build is serving traffic — handy after a rolling upgrade.

Scraping with Prometheus

Add a scrape job pointing at the metrics port:

scrape_configs:
  - job_name: "sochdb"
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ["sochdb:9090"]

In Kubernetes, target the pod's metrics port (the Helm chart names this container port metrics, default 9090, path /metrics). The chart also ships a servicemonitor.yaml template for the Prometheus Operator, disabled by default — enable it under observability.metrics if you run the operator.

Scrape cost

Serving /metrics is cheap, but very high scrape frequencies still cost CPU to encode the registry. A 10–15s interval is typical and more than enough for alerting and dashboards.

Useful PromQL

# gRPC request rate per method
sum by (method) (rate(sochdb_grpc_requests_total[5m]))

# gRPC error ratio (errors / requests)
sum(rate(sochdb_grpc_errors_total[5m]))
  / sum(rate(sochdb_grpc_requests_total[5m]))

# p99 gRPC latency
histogram_quantile(
  0.99,
  sum by (le) (rate(sochdb_grpc_request_duration_seconds_bucket[5m]))
)

# WAL fsync rate (durability pressure)
rate(sochdb_wal_fsync_total[5m])

# Cache hit ratio
sum(rate(sochdb_cache_operations_total{result="hit"}[5m]))
  / sum(rate(sochdb_cache_operations_total[5m]))

# Detect a restart (uptime resets toward zero)
sochdb_uptime_seconds

Grafana via the monitoring profile

The repository's docker-compose.yml includes a monitoring profile that starts Prometheus and Grafana alongside the server. Bring it up with:

docker compose --profile monitoring up

This runs three services:

Service	Port	Notes
`sochdb`	`50051`	The gRPC server (metrics on `9090`)
Prometheus	`9090`	Scrapes the `sochdb` service
Grafana	`3000`	Default credentials `admin` / `sochdb`

Open Grafana at http://localhost:3000, log in, and add Prometheus (http://prometheus:9090 inside the compose network) as a data source if it is not already provisioned. Build panels from the metrics above — request rate, error ratio, p99 latency, WAL fsync rate, and storage growth are good starting panels.

Other compose profiles

The same docker-compose.yml defines a web profile (Envoy gRPC-Web on 8080, admin 9901) and a dev profile (debug build on 50052). Profiles are additive; combine them as needed. See Deploying to Production for the full profile list.

For a standalone Kubernetes monitoring stack, the repo also provides deploy/k8s-monitoring/ with prometheus.yaml, grafana.yaml, and namespace.yaml.

Health checks and Kubernetes probes

SochDB exposes health in two distinct places. Pick the one that matches the layer you are probing.

gRPC health service (for K8s and load balancers)

The server mounts the standard gRPC health checking protocol on the gRPC port (50051 by default). The empty service name "" is set to SERVING, and this health service is not behind --auth — so liveness/readiness probes work even when authentication is enabled.

The official grpc_health_probe tool is the canonical client. It is baked into the published Docker image, and the container HEALTHCHECK uses it:

grpc_health_probe -addr=:50051
# -> status: SERVING

In Kubernetes you can use a native gRPC probe (Kubernetes 1.24+):

livenessProbe:
  grpc:
    port: 50051
  initialDelaySeconds: 30
readinessProbe:
  grpc:
    port: 50051
  periodSeconds: 5

Helm chart default probes use a TCP socket

The bundled Helm chart's default startup/readiness/liveness probes use a TCP socket on 50051, not the gRPC health protocol. The startup probe is intentionally long (failureThreshold 60 × 10s = up to 10 minutes) to tolerate WAL replay and migrations during the Boot FSM. The minikube preset switches probes to an HTTP GET /metrics on port 9090 to avoid HTTP/2 probe noise. See Kubernetes health probes in the deployment guide.

HTTP `/health` (for the metrics layer / simple checks)

The metrics listener's GET /health (default port 9090) returns 200 OK. It is the simplest liveness signal for tooling that only speaks HTTP, and it is what the minikube Helm preset probes. It reflects that the process is up and the metrics thread is serving — it does not run a deep storage check.

curl -fsS http://127.0.0.1:9090/health && echo " healthy"

What is and isn't covered by metrics

A few server capabilities do not yet surface as runtime-toggled, metered features. Set expectations accordingly when building dashboards and alerts.

At-rest encryption is a library API, not a server flag

The engine ships EncryptionEngine (AES-256-GCM-SIV), but there is no CLI flag on sochdb-grpc-server to enable at-rest encryption, and main.rs does not construct one. There is correspondingly no "encryption enabled" metric. Treat at-rest encryption as an available library capability with server wiring still planned.

CDC where_predicate is accepted but not enforced

The SubscribeRequest.where_predicate field is accepted by the subscription service, but the streaming loop does not yet apply it. Table and operation-type filtering are enforced. If you alert on subscription delivery, remember that SQL WHERE filtering is not active server-side yet.

The PostgreSQL wire protocol has no auth and no metrics

The pg-wire gateway (--pg-port, default 5433) is simple-query only, cleartext, and trust-auth (no password). It is loopback-only safe; it needs --pg-data-dir to execute real SQL. It is not represented in the Prometheus metrics and should not be exposed on a non-loopback --host.

The metrics endpoint​

Exported metrics​

gRPC​

SQL and transactions​

Storage and WAL​

Process​

Scraping with Prometheus​

Useful PromQL​

Grafana via the monitoring profile​

Health checks and Kubernetes probes​

gRPC health service (for K8s and load balancers)​

HTTP /health (for the metrics layer / simple checks)​

What is and isn't covered by metrics​

See also​