Observability & Metrics
This guide covers how to observe a running sochdb-grpc-server: the
Prometheus metrics endpoint and its exported series, scraping with Prometheus,
dashboards with Grafana, and the gRPC health service used by Kubernetes probes.
This page targets core engine 2.0.3 (the sochdb-grpc-server binary).
The language SDKs version independently (Python 0.5.9, Node.js 0.5.3, Go 0.4.5).
For setting log levels (RUST_LOG), the --debug flag, and audit logging, see
How to Configure Logging. For running the server in
production (Docker images, Helm, sizing), see
Deploying to Production.
The metrics endpoint
The server exposes a Prometheus-format metrics endpoint over plain HTTP. It runs
on a dedicated OS thread (sochdb-metrics-http), separate from the gRPC
service, so scraping never contends with the tokio runtime that serves queries.
Control it with --metrics-port:
# Metrics on the default port 9090
sochdb-grpc-server --host 0.0.0.0 --port 50051
# Move metrics to a custom port
sochdb-grpc-server --metrics-port 9100
# Disable the metrics endpoint entirely
sochdb-grpc-server --metrics-port 0
| Flag | Default | Purpose |
|---|---|---|
--metrics-port | 9090 | Prometheus HTTP port. 0 disables the endpoint. |
The HTTP listener binds 0.0.0.0:<metrics-port> and serves exactly two routes
(anything else returns 404):
| Route | Response |
|---|---|
GET /metrics | Prometheus text exposition (format version=0.0.4) |
GET /health | 200 OK with body OK |
Quick check once the server is up:
curl -s http://127.0.0.1:9090/health
# -> OK
curl -s http://127.0.0.1:9090/metrics | head -n 30
/metrics and /health are served by a plain HTTP listener that is not
behind --auth. This is intentional, so in-cluster Prometheus and probes can
reach it without credentials. Keep the metrics port on a trusted network or
behind your ingress — do not expose it to the public internet.
Exported metrics
The series below come from the server's metrics registry. HNSW index metrics
are also auto-included via the default Prometheus registry, so you will see
additional hnsw_* series depending on workload.
gRPC
| Metric | Type | Labels | Meaning |
|---|---|---|---|
sochdb_grpc_requests_total | counter | service, method | Total gRPC requests handled |
sochdb_grpc_errors_total | counter | service, method, code | Requests that returned an error, by status code |
sochdb_grpc_request_duration_seconds | histogram | — | Request latency distribution |
sochdb_grpc_active_connections | gauge | — | Currently open gRPC connections |
SQL and transactions
| Metric | Type | Labels | Meaning |
|---|---|---|---|
sochdb_sql_queries_total | counter | statement_type | SQL statements executed, by kind (e.g. select, insert) |
sochdb_sql_query_duration_seconds | histogram | — | SQL query latency distribution |
sochdb_transactions_total | counter | outcome | Transactions, by outcome (e.g. commit / abort) |
Storage and WAL
| Metric | Type | Labels | Meaning |
|---|---|---|---|
sochdb_tables_count | gauge | — | Number of tables |
sochdb_storage_bytes | gauge | — | On-disk storage size in bytes |
sochdb_wal_bytes | gauge | — | Write-ahead-log size in bytes |
sochdb_wal_writes_total | counter | — | WAL append operations |
sochdb_wal_fsync_total | counter | — | WAL fsync calls |
sochdb_cache_operations_total | counter | result | Cache operations, by result (e.g. hit / miss) |
Process
| Metric | Type | Labels | Meaning |
|---|---|---|---|
sochdb_uptime_seconds | gauge | — | Seconds since the server started |
sochdb_build_info | gauge | version, rustc | Build metadata; value is always 1 |
Here is what a slice of /metrics looks like:
# HELP sochdb_grpc_requests_total Total gRPC requests handled
# TYPE sochdb_grpc_requests_total counter
sochdb_grpc_requests_total{service="KvService",method="Get"} 1842
sochdb_grpc_requests_total{service="VectorIndexService",method="Search"} 537
# HELP sochdb_build_info Build information
# TYPE sochdb_build_info gauge
sochdb_build_info{version="2.0.3",rustc="1.91.0"} 1
# HELP sochdb_uptime_seconds Seconds since server start
# TYPE sochdb_uptime_seconds gauge
sochdb_uptime_seconds 3641
sochdb_build_info carries a version label (e.g. 2.0.3). Use it to verify
exactly which engine build is serving traffic — handy after a rolling upgrade.
Scraping with Prometheus
Add a scrape job pointing at the metrics port:
scrape_configs:
- job_name: "sochdb"
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ["sochdb:9090"]
In Kubernetes, target the pod's metrics port (the Helm chart names this
container port metrics, default 9090, path /metrics). The chart also ships
a servicemonitor.yaml template for the Prometheus Operator, disabled by
default — enable it under observability.metrics if you run the operator.
Serving /metrics is cheap, but very high scrape frequencies still cost CPU to
encode the registry. A 10–15s interval is typical and more than enough for
alerting and dashboards.
Useful PromQL
# gRPC request rate per method
sum by (method) (rate(sochdb_grpc_requests_total[5m]))
# gRPC error ratio (errors / requests)
sum(rate(sochdb_grpc_errors_total[5m]))
/ sum(rate(sochdb_grpc_requests_total[5m]))
# p99 gRPC latency
histogram_quantile(
0.99,
sum by (le) (rate(sochdb_grpc_request_duration_seconds_bucket[5m]))
)
# WAL fsync rate (durability pressure)
rate(sochdb_wal_fsync_total[5m])
# Cache hit ratio
sum(rate(sochdb_cache_operations_total{result="hit"}[5m]))
/ sum(rate(sochdb_cache_operations_total[5m]))
# Detect a restart (uptime resets toward zero)
sochdb_uptime_seconds
Grafana via the monitoring profile
The repository's docker-compose.yml includes a monitoring profile that
starts Prometheus and Grafana alongside the server. Bring it up with:
docker compose --profile monitoring up
This runs three services:
| Service | Port | Notes |
|---|---|---|
sochdb | 50051 | The gRPC server (metrics on 9090) |
| Prometheus | 9090 | Scrapes the sochdb service |
| Grafana | 3000 | Default credentials admin / sochdb |
Open Grafana at http://localhost:3000, log in, and add Prometheus
(http://prometheus:9090 inside the compose network) as a data source if it is
not already provisioned. Build panels from the metrics above — request rate,
error ratio, p99 latency, WAL fsync rate, and storage growth are good starting
panels.
The same docker-compose.yml defines a web profile (Envoy gRPC-Web on 8080,
admin 9901) and a dev profile (debug build on 50052). Profiles are
additive; combine them as needed. See
Deploying to Production for the full profile list.
For a standalone Kubernetes monitoring stack, the repo also provides
deploy/k8s-monitoring/ with prometheus.yaml, grafana.yaml, and
namespace.yaml.
Health checks and Kubernetes probes
SochDB exposes health in two distinct places. Pick the one that matches the layer you are probing.
gRPC health service (for K8s and load balancers)
The server mounts the standard gRPC health checking protocol on the gRPC
port (50051 by default). The empty service name "" is set to SERVING, and
this health service is not behind --auth — so liveness/readiness probes
work even when authentication is enabled.
The official grpc_health_probe tool is the canonical client. It is baked into
the published Docker image, and the container HEALTHCHECK uses it:
grpc_health_probe -addr=:50051
# -> status: SERVING
In Kubernetes you can use a native gRPC probe (Kubernetes 1.24+):
livenessProbe:
grpc:
port: 50051
initialDelaySeconds: 30
readinessProbe:
grpc:
port: 50051
periodSeconds: 5
The bundled Helm chart's default startup/readiness/liveness probes use a
TCP socket on 50051, not the gRPC health protocol. The startup probe is
intentionally long (failureThreshold 60 × 10s = up to 10 minutes) to tolerate
WAL replay and migrations during the Boot FSM. The minikube preset switches
probes to an HTTP GET /metrics on port 9090 to avoid HTTP/2 probe noise.
See Kubernetes health probes
in the deployment guide.
HTTP /health (for the metrics layer / simple checks)
The metrics listener's GET /health (default port 9090) returns 200 OK.
It is the simplest liveness signal for tooling that only speaks HTTP, and it is
what the minikube Helm preset probes. It reflects that the process is up and
the metrics thread is serving — it does not run a deep storage check.
curl -fsS http://127.0.0.1:9090/health && echo " healthy"
What is and isn't covered by metrics
A few server capabilities do not yet surface as runtime-toggled, metered features. Set expectations accordingly when building dashboards and alerts.
The engine ships EncryptionEngine (AES-256-GCM-SIV), but there is no CLI
flag on sochdb-grpc-server to enable at-rest encryption, and main.rs
does not construct one. There is correspondingly no "encryption enabled" metric.
Treat at-rest encryption as an available library capability with server wiring
still planned.
where_predicate is accepted but not enforcedThe SubscribeRequest.where_predicate field is accepted by the subscription
service, but the streaming loop does not yet apply it. Table and
operation-type filtering are enforced. If you alert on subscription
delivery, remember that SQL WHERE filtering is not active server-side yet.
The pg-wire gateway (--pg-port, default 5433) is simple-query only,
cleartext, and trust-auth (no password). It is loopback-only safe; it needs
--pg-data-dir to execute real SQL. It is not represented in the Prometheus
metrics and should not be exposed on a non-loopback --host.
See also
- How to Configure Logging —
RUST_LOG,--debug, audit logging, and the metrics endpoint basics - Deploying to Production — server binary, ports, Docker, Helm, and Kubernetes health probes
- Performance Guide — tuning and capacity planning