Skip to main content

Observability & Metrics

This guide covers how to observe a running sochdb-grpc-server: the Prometheus metrics endpoint and its exported series, scraping with Prometheus, dashboards with Grafana, and the gRPC health service used by Kubernetes probes.

Versions

This page targets core engine 2.0.3 (the sochdb-grpc-server binary). The language SDKs version independently (Python 0.5.9, Node.js 0.5.3, Go 0.4.5).

For setting log levels (RUST_LOG), the --debug flag, and audit logging, see How to Configure Logging. For running the server in production (Docker images, Helm, sizing), see Deploying to Production.


The metrics endpoint

The server exposes a Prometheus-format metrics endpoint over plain HTTP. It runs on a dedicated OS thread (sochdb-metrics-http), separate from the gRPC service, so scraping never contends with the tokio runtime that serves queries.

Control it with --metrics-port:

# Metrics on the default port 9090
sochdb-grpc-server --host 0.0.0.0 --port 50051

# Move metrics to a custom port
sochdb-grpc-server --metrics-port 9100

# Disable the metrics endpoint entirely
sochdb-grpc-server --metrics-port 0
FlagDefaultPurpose
--metrics-port9090Prometheus HTTP port. 0 disables the endpoint.

The HTTP listener binds 0.0.0.0:<metrics-port> and serves exactly two routes (anything else returns 404):

RouteResponse
GET /metricsPrometheus text exposition (format version=0.0.4)
GET /health200 OK with body OK

Quick check once the server is up:

curl -s http://127.0.0.1:9090/health
# -> OK

curl -s http://127.0.0.1:9090/metrics | head -n 30
The metrics endpoint has no authentication

/metrics and /health are served by a plain HTTP listener that is not behind --auth. This is intentional, so in-cluster Prometheus and probes can reach it without credentials. Keep the metrics port on a trusted network or behind your ingress — do not expose it to the public internet.


Exported metrics

The series below come from the server's metrics registry. HNSW index metrics are also auto-included via the default Prometheus registry, so you will see additional hnsw_* series depending on workload.

gRPC

MetricTypeLabelsMeaning
sochdb_grpc_requests_totalcounterservice, methodTotal gRPC requests handled
sochdb_grpc_errors_totalcounterservice, method, codeRequests that returned an error, by status code
sochdb_grpc_request_duration_secondshistogramRequest latency distribution
sochdb_grpc_active_connectionsgaugeCurrently open gRPC connections

SQL and transactions

MetricTypeLabelsMeaning
sochdb_sql_queries_totalcounterstatement_typeSQL statements executed, by kind (e.g. select, insert)
sochdb_sql_query_duration_secondshistogramSQL query latency distribution
sochdb_transactions_totalcounteroutcomeTransactions, by outcome (e.g. commit / abort)

Storage and WAL

MetricTypeLabelsMeaning
sochdb_tables_countgaugeNumber of tables
sochdb_storage_bytesgaugeOn-disk storage size in bytes
sochdb_wal_bytesgaugeWrite-ahead-log size in bytes
sochdb_wal_writes_totalcounterWAL append operations
sochdb_wal_fsync_totalcounterWAL fsync calls
sochdb_cache_operations_totalcounterresultCache operations, by result (e.g. hit / miss)

Process

MetricTypeLabelsMeaning
sochdb_uptime_secondsgaugeSeconds since the server started
sochdb_build_infogaugeversion, rustcBuild metadata; value is always 1

Here is what a slice of /metrics looks like:

# HELP sochdb_grpc_requests_total Total gRPC requests handled
# TYPE sochdb_grpc_requests_total counter
sochdb_grpc_requests_total{service="KvService",method="Get"} 1842
sochdb_grpc_requests_total{service="VectorIndexService",method="Search"} 537

# HELP sochdb_build_info Build information
# TYPE sochdb_build_info gauge
sochdb_build_info{version="2.0.3",rustc="1.91.0"} 1

# HELP sochdb_uptime_seconds Seconds since server start
# TYPE sochdb_uptime_seconds gauge
sochdb_uptime_seconds 3641
Confirm the running version

sochdb_build_info carries a version label (e.g. 2.0.3). Use it to verify exactly which engine build is serving traffic — handy after a rolling upgrade.


Scraping with Prometheus

Add a scrape job pointing at the metrics port:

scrape_configs:
- job_name: "sochdb"
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ["sochdb:9090"]

In Kubernetes, target the pod's metrics port (the Helm chart names this container port metrics, default 9090, path /metrics). The chart also ships a servicemonitor.yaml template for the Prometheus Operator, disabled by default — enable it under observability.metrics if you run the operator.

Scrape cost

Serving /metrics is cheap, but very high scrape frequencies still cost CPU to encode the registry. A 10–15s interval is typical and more than enough for alerting and dashboards.

Useful PromQL

# gRPC request rate per method
sum by (method) (rate(sochdb_grpc_requests_total[5m]))

# gRPC error ratio (errors / requests)
sum(rate(sochdb_grpc_errors_total[5m]))
/ sum(rate(sochdb_grpc_requests_total[5m]))

# p99 gRPC latency
histogram_quantile(
0.99,
sum by (le) (rate(sochdb_grpc_request_duration_seconds_bucket[5m]))
)

# WAL fsync rate (durability pressure)
rate(sochdb_wal_fsync_total[5m])

# Cache hit ratio
sum(rate(sochdb_cache_operations_total{result="hit"}[5m]))
/ sum(rate(sochdb_cache_operations_total[5m]))

# Detect a restart (uptime resets toward zero)
sochdb_uptime_seconds

Grafana via the monitoring profile

The repository's docker-compose.yml includes a monitoring profile that starts Prometheus and Grafana alongside the server. Bring it up with:

docker compose --profile monitoring up

This runs three services:

ServicePortNotes
sochdb50051The gRPC server (metrics on 9090)
Prometheus9090Scrapes the sochdb service
Grafana3000Default credentials admin / sochdb

Open Grafana at http://localhost:3000, log in, and add Prometheus (http://prometheus:9090 inside the compose network) as a data source if it is not already provisioned. Build panels from the metrics above — request rate, error ratio, p99 latency, WAL fsync rate, and storage growth are good starting panels.

Other compose profiles

The same docker-compose.yml defines a web profile (Envoy gRPC-Web on 8080, admin 9901) and a dev profile (debug build on 50052). Profiles are additive; combine them as needed. See Deploying to Production for the full profile list.

For a standalone Kubernetes monitoring stack, the repo also provides deploy/k8s-monitoring/ with prometheus.yaml, grafana.yaml, and namespace.yaml.


Health checks and Kubernetes probes

SochDB exposes health in two distinct places. Pick the one that matches the layer you are probing.

gRPC health service (for K8s and load balancers)

The server mounts the standard gRPC health checking protocol on the gRPC port (50051 by default). The empty service name "" is set to SERVING, and this health service is not behind --auth — so liveness/readiness probes work even when authentication is enabled.

The official grpc_health_probe tool is the canonical client. It is baked into the published Docker image, and the container HEALTHCHECK uses it:

grpc_health_probe -addr=:50051
# -> status: SERVING

In Kubernetes you can use a native gRPC probe (Kubernetes 1.24+):

livenessProbe:
grpc:
port: 50051
initialDelaySeconds: 30
readinessProbe:
grpc:
port: 50051
periodSeconds: 5
Helm chart default probes use a TCP socket

The bundled Helm chart's default startup/readiness/liveness probes use a TCP socket on 50051, not the gRPC health protocol. The startup probe is intentionally long (failureThreshold 60 × 10s = up to 10 minutes) to tolerate WAL replay and migrations during the Boot FSM. The minikube preset switches probes to an HTTP GET /metrics on port 9090 to avoid HTTP/2 probe noise. See Kubernetes health probes in the deployment guide.

HTTP /health (for the metrics layer / simple checks)

The metrics listener's GET /health (default port 9090) returns 200 OK. It is the simplest liveness signal for tooling that only speaks HTTP, and it is what the minikube Helm preset probes. It reflects that the process is up and the metrics thread is serving — it does not run a deep storage check.

curl -fsS http://127.0.0.1:9090/health && echo " healthy"

What is and isn't covered by metrics

A few server capabilities do not yet surface as runtime-toggled, metered features. Set expectations accordingly when building dashboards and alerts.

At-rest encryption is a library API, not a server flag

The engine ships EncryptionEngine (AES-256-GCM-SIV), but there is no CLI flag on sochdb-grpc-server to enable at-rest encryption, and main.rs does not construct one. There is correspondingly no "encryption enabled" metric. Treat at-rest encryption as an available library capability with server wiring still planned.

CDC where_predicate is accepted but not enforced

The SubscribeRequest.where_predicate field is accepted by the subscription service, but the streaming loop does not yet apply it. Table and operation-type filtering are enforced. If you alert on subscription delivery, remember that SQL WHERE filtering is not active server-side yet.

The PostgreSQL wire protocol has no auth and no metrics

The pg-wire gateway (--pg-port, default 5433) is simple-query only, cleartext, and trust-auth (no password). It is loopback-only safe; it needs --pg-data-dir to execute real SQL. It is not represented in the Prometheus metrics and should not be exposed on a non-loopback --host.


See also