Monitoring¶

FoundationDB provides comprehensive built-in monitoring capabilities. This guide covers the status command, machine-readable status JSON, key metrics, and integration with external monitoring systems.

Status Command¶

The fdbcli status command provides human-readable cluster health information.

Basic Status¶

Bash

fdb> status

Output includes cluster configuration, health state, and key performance metrics.

Detailed Status¶

Bash

fdb> status details

Shows per-process information and detailed role assignments.

Minimal Status¶

Bash

fdb> status minimal

Returns only the cluster health state—useful for scripting.

JSON Status¶

Bash

fdb> status json

Returns the machine-readable status in JSON format.

Example Status Output¶

Text Only

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - triple
  Storage engine         - ssd-2
  Coordinators           - 5
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 15
  Zones                  - 5
  Machines               - 5
  Memory availability    - 6.1 GB per process on machine with least available
  Fault Tolerance        - 2 machines
  Server time            - 02/03/25 09:32:01

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 234.5 GB
  Disk space used        - 456.2 GB

Operating space:
  Storage server         - 1.2 TB free on most full server
  Log server             - 967.3 GB free on most full server

Workload:
  Read rate              - 12543 Hz
  Write rate             - 3421 Hz
  Transactions started   - 8234 Hz
  Transactions committed - 2156 Hz
  Conflict rate          - 12 Hz

Backup and DR:
  Running backups        - 1
  Running DRs            - 0

Machine-Readable Status¶

Access the complete cluster status programmatically using the special key \xFF\xFF/status/json.

Accessing via Client API¶

PythonGoJava

Python

import fdb
import json

fdb.api_version(730)
db = fdb.open()

@fdb.transactional
def get_status(tr):
    return tr[b'\xff\xff/status/json']

status = json.loads(get_status(db))
print(f"Cluster healthy: {status['client']['database_status']['healthy']}")

Go

status, err := db.ReadTransact(func(tr fdb.ReadTransaction) (interface{}, error) {
    return tr.Get(fdb.Key("\xff\xff/status/json")).Get()
})

Java

byte[] statusBytes = db.read(tr -> tr.get(new byte[]{(byte)0xff, (byte)0xff, ...}).join());
String statusJson = new String(statusBytes, StandardCharsets.UTF_8);

Key Status Fields¶

The JSON status contains these major sections:

Section	Description
`client`	Client configuration and database status
`cluster`	Cluster-wide configuration and state
`cluster.processes`	Per-process details and roles
`cluster.data`	Data distribution and replication state
`cluster.workload`	Real-time performance metrics
`cluster.qos`	Quality of service metrics
`cluster.latency_probe`	Latency measurements
`cluster.layers`	Layer-specific status (backup, etc.)

Checking Cluster Health¶

Python

def is_cluster_healthy(status):
    """Check if cluster is operating normally."""
    return (
        status.get('client', {}).get('database_status', {}).get('healthy', False) and
        status.get('cluster', {}).get('data', {}).get('state', {}).get('healthy', False)
    )

Database Available States¶

State	Description
`available`	Database is accepting reads and writes
`read_only`	Database only accepts reads (recovery mode)
`unavailable`	Database is not accepting connections

Data State Values¶

State	Meaning
`healthy`	All data replicated to desired level
`healing`	Recovering lost replicas
`healthy_repartitioning`	Healthy, redistributing data
`healthy_removing_server`	Healthy, removing excluded server
`healthy_rebalancing`	Healthy, balancing data across servers

Key Metrics¶

Performance Metrics¶

Metric	Path in JSON	Target	Alert Threshold
Read rate	`cluster.workload.operations.reads.hz`	Varies	Baseline ±50%
Write rate	`cluster.workload.operations.writes.hz`	Varies	Baseline ±50%
Commit rate	`cluster.workload.transactions.committed.hz`	Varies	-
Conflict rate	`cluster.workload.transactions.conflicted.hz`	< 1% of commits	> 5%

Latency Metrics¶

Metric	Path in JSON	Target	Alert Threshold
Commit latency (p50)	`cluster.latency_probe.commit_seconds`	< 25ms	> 100ms
Read latency	`cluster.latency_probe.read_seconds`	< 5ms	> 50ms
Transaction start	`cluster.latency_probe.transaction_start_seconds`	< 5ms	> 25ms

Capacity Metrics¶

Metric	Path in JSON	Alert Threshold
Storage space free	`cluster.data.total_disk_used_bytes` vs capacity	< 20% free
Memory available	`cluster.processes.*.memory.available_bytes`	< 1GB per process
Moving data	`cluster.data.moving_data.in_flight_bytes`	Sustained > 10GB

Server-Side Latency Bands¶

FoundationDB tracks latency distributions for read and commit operations. Access via:

Text Only

cluster.latency_probe.batch_priority_transaction_start_seconds
cluster.latency_probe.immediate_priority_transaction_start_seconds
cluster.latency_probe.commit_seconds
cluster.latency_probe.read_seconds

The latency_statistics feature provides percentile breakdowns when enabled.

Process Monitoring¶

Process Roles¶

Each process reports its assigned roles:

Role	Description
`storage`	Stores key-value data
`log`	Transaction log
`commit_proxy`	Handles commit requests
`grv_proxy`	Handles read version requests
`resolver`	Conflict resolution
`master`	Cluster coordination
`cluster_controller`	Manages role assignments
`coordinator`	Coordination service

Process Health Indicators¶

Python

def check_process_health(process):
    """Check individual process health."""
    issues = []

    # Check memory
    mem = process.get('memory', {})
    if mem.get('available_bytes', 0) < 1_000_000_000:  # 1GB
        issues.append('low_memory')

    # Check disk
    disk = process.get('disk', {})
    if disk.get('free_bytes', 0) < 10_000_000_000:  # 10GB
        issues.append('low_disk')

    # Check CPU
    cpu = process.get('cpu', {})
    if cpu.get('usage_cores', 0) > 0.95 * cpu.get('cores', 1):
        issues.append('high_cpu')

    return issues

Excluded Servers¶

Monitor excluded servers that are being drained:

Python

excluded = status.get('cluster', {}).get('excluded_servers', [])
for server in excluded:
    print(f"Excluded: {server['address']}")

Fault Tolerance¶

Current Fault Tolerance¶

Check how many failures the cluster can survive:

Python

fault_tolerance = status['cluster']['fault_tolerance']
print(f"Can lose {fault_tolerance['max_zone_failures_without_losing_data']} zones")
print(f"Can lose {fault_tolerance['max_zone_failures_without_losing_availability']} zones and stay available")

Recovery State¶

State	Description
`fully_recovered`	Normal operation
`waiting_for_new_tlogs`	Waiting for transaction log servers
`accepting_commits`	Recovery accepting new commits
`all_logs_recruited`	Logs assigned, finalizing recovery

Monitoring Backup Status¶

Access backup status through the layers section:

Python

backup_status = status.get('cluster', {}).get('layers', {}).get('backup', {})
if backup_status:
    for tag, info in backup_status.get('tags', {}).items():
        print(f"Backup {tag}: {info.get('current_status')}")
        print(f"  Last restorable: {info.get('last_restorable_version')}")

Alerting Recommendations¶

Critical Alerts¶

Condition	Check	Action
Database unavailable	`!client.database_status.available`	Page on-call
Data not fully replicated	`!cluster.data.state.healthy`	Investigate immediately
No fault tolerance	`fault_tolerance.max_zone_failures_without_losing_data == 0`	Add capacity

Warning Alerts¶

Condition	Check	Action
High conflict rate	`conflicted.hz / committed.hz > 0.05`	Review application logic
Low disk space	`< 20% free on any server`	Add storage or clean up
High latency	`commit_seconds > 0.1`	Investigate workload
Sustained data movement	`moving_data.in_flight_bytes > 10GB for 30min`	Check for excluded servers

Example Alerting Script¶

Python

#!/usr/bin/env python3
import fdb
import json
import sys

fdb.api_version(730)
db = fdb.open()

@fdb.transactional
def get_status(tr):
    return json.loads(tr[b'\xff\xff/status/json'])

status = get_status(db)

# Check critical conditions
if not status['client']['database_status']['available']:
    print("CRITICAL: Database unavailable")
    sys.exit(2)

if not status['cluster']['data']['state']['healthy']:
    print("CRITICAL: Data not healthy - " + status['cluster']['data']['state']['name'])
    sys.exit(2)

# Check warnings
fault_tolerance = status['cluster']['fault_tolerance']['max_zone_failures_without_losing_data']
if fault_tolerance == 0:
    print("WARNING: No fault tolerance")
    sys.exit(1)

workload = status['cluster']['workload']['transactions']
if workload['committed']['hz'] > 0:
    conflict_rate = workload['conflicted']['hz'] / workload['committed']['hz']
    if conflict_rate > 0.05:
        print(f"WARNING: High conflict rate {conflict_rate:.1%}")
        sys.exit(1)

print("OK: Cluster healthy")
sys.exit(0)

Stable Cluster Health Metric¶

Status: Upcoming Feature

The Stable Cluster Health Metric is under active development with an open pull request. It is expected to land in a near-future release. The details below reflect the current design.

Overview¶

FoundationDB will provide a single scalar cluster health score ranging from 0 to 100 that summarizes overall cluster state. This metric is designed to make fleet-wide monitoring straightforward—operators can alert on a single value instead of encoding complex FoundationDB-specific logic in PromQL or custom scripts.

Health Score Scale¶

Value	Level	Description
100	`HEALTHY`	Cluster is fully operational with no issues
75	`SELF_HEALING`	Cluster has detected an issue and is automatically recovering
50	`INTERVENTION_REQUIRED`	Operator action is needed to restore full health
25	`CRITICAL_INTERVENTION_REQUIRED`	Urgent operator action is needed; risk of data loss or extended outage
0	`OUTAGE`	Cluster is unavailable

Contributing Factors¶

The health score is derived from multiple signals:

Sev40 events — Severity 40 trace events indicate critical internal errors
Recovery state — Whether the cluster is in or has recently completed recovery
Data Distributor replication — Whether data is fully replicated to the desired redundancy level
Disk space — Available storage across cluster processes
Ratekeeper throttling — Whether the cluster is actively throttling transactions due to load
Coordinator availability — Reachability of coordination servers

The score reflects the worst contributing factor, so a single critical issue will drive the overall value down even if all other signals are healthy.

Accessing the Health Score¶

The cluster health metric will be available through two interfaces:

Status JSON — Accessible via \xFF\xFF/status/json in the client API or status json in fdbcli, under a new field in the cluster status object
Trace events — Emitted periodically in server trace files, allowing integration with log-based monitoring pipelines

Why This Matters¶

Today, determining cluster health requires combining multiple status fields and encoding FoundationDB-specific knowledge into monitoring queries. The health score eliminates this complexity:

PromQL

# Before: complex multi-signal alerting
fdb_cluster_available == 0 or fdb_data_state_healthy == 0
  or fdb_fault_tolerance_zones == 0 or ...

# After: single metric
fdb_cluster_health_score < 50

This is especially valuable for teams managing many FoundationDB clusters, where a single dashboard can show the health of every cluster at a glance.

Prometheus Integration¶

Available Exporters¶

Several community-maintained Prometheus exporters are available for FoundationDB:

Exporter	Language	Maintainer	Status	Link
fdbexporter	Rust	Clever Cloud	Actively maintained, production use, v2.3.1 (Jan 2026), Docker images, supports FDB 7.1 & 7.3	CleverCloud/fdbexporter
fdb-exporter	Go	Tigris Data	Community	tigrisdata/fdb-exporter
foundationdb-exporter	TypeScript	@aikoven	Community	aikoven/foundationdb-exporter
fdb-prometheus-exporter	Go	@PierreZ	Legacy / unmaintained	PierreZ/fdb-prometheus-exporter

Using fdbexporter (Recommended)¶

The CleverCloud/fdbexporter is actively maintained and recommended for production use.

Docker Usage:

Bash

docker run -d \
  --name fdbexporter \
  -p 9090:9090 \
  -e FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster \
  -v /etc/foundationdb:/etc/foundationdb:ro \
  clevercloud/fdbexporter:2.3.1-7.3.69

Prometheus Configuration:

YAML

# prometheus.yml
scrape_configs:
  - job_name: 'foundationdb'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Key Prometheus Metrics¶

Metric	Type	Description
`fdb_cluster_available`	Gauge	1 if cluster available
`fdb_cluster_healthy`	Gauge	1 if cluster healthy
`fdb_workload_reads_hz`	Gauge	Read operations per second
`fdb_workload_writes_hz`	Gauge	Write operations per second
`fdb_workload_commits_hz`	Gauge	Commits per second
`fdb_latency_commit_seconds`	Gauge	Commit latency
`fdb_storage_used_bytes`	Gauge	Total storage used

Native OpenTelemetry Metrics¶

Status: In Development

Native OpenTelemetry (OTel) metrics support is under active development in the FoundationDB open-source codebase but is not yet complete. The information below describes the current state and recommended workarounds.

Current State¶

FoundationDB's codebase contains preliminary support for emitting metrics via the OpenTelemetry protocol, but this functionality is not yet fully implemented or production-ready in the open-source builds. Key gaps include incomplete metric coverage and limited configuration options.

Recommended Workaround¶

Most teams currently obtain metrics by scraping FoundationDB's trace events through external exporters — such as the Prometheus exporters listed in the Prometheus Integration section above. These exporters parse the JSON status output or trace files and expose the data in a format compatible with standard monitoring stacks.

This approach is well-proven in production and remains the recommended path for teams that need metrics today.

For teams looking to export FoundationDB metrics via OpenTelemetry specifically, the community fdb-otel-exporter project provides a useful starting point. It tails FoundationDB trace logs and exports them as OTel metrics, making it a helpful reference for which metrics to track and how to structure OTel-based monitoring for FoundationDB clusters.

Future Benefits¶

Once native OTel metrics support is complete, FoundationDB will provide out-of-the-box metrics emission without requiring external tooling. This will be particularly valuable for:

New FDB users who want monitoring with minimal setup
Teams standardizing on OpenTelemetry across their infrastructure
Environments where running sidecar exporters adds unwanted operational complexity

Native OTel support will enable direct integration with any OTel-compatible backend (such as Grafana, Datadog, or Jaeger) using FoundationDB's built-in instrumentation.

Grafana Dashboards¶

Recommended Dashboard Panels¶

Cluster Health - Overall status indicator
Throughput - Reads, writes, commits over time
Latency - Commit and read latency percentiles
Data Distribution - Storage per server, moving data
Fault Tolerance - Current redundancy level
Process Health - Memory, CPU, disk per process

Sample Grafana Query¶

PromQL

# Commit latency p99
histogram_quantile(0.99, fdb_commit_latency_seconds_bucket)

# Conflict rate percentage
(fdb_workload_conflicted_hz / fdb_workload_committed_hz) * 100

Trace Files¶

FoundationDB servers write detailed trace files in XML format.

Trace File Location¶

Platform	Default Path
Linux	`/var/log/foundationdb/`
macOS	`/usr/local/var/log/foundationdb/`

Trace File Contents¶

Trace files contain:

Transaction timing information
Error events and stack traces
Performance metrics
Role transitions
Network events

Analyzing Trace Files¶

Bash

# Find errors in trace files
grep -h "Severity=\"40\"" /var/log/foundationdb/trace*.xml

# Find warnings
grep -h "Severity=\"30\"" /var/log/foundationdb/trace*.xml

Next Steps¶

Configure Backup & Recovery for data protection
Review Troubleshooting for common issues
See Configuration for cluster tuning