Monitoring¶
FoundationDB provides comprehensive built-in monitoring capabilities. This guide covers the status command, machine-readable status JSON, key metrics, and integration with external monitoring systems.
Status Command¶
The fdbcli status command provides human-readable cluster health information.
Basic Status¶
Output includes cluster configuration, health state, and key performance metrics.
Detailed Status¶
Shows per-process information and detailed role assignments.
Minimal Status¶
Returns only the cluster health state—useful for scripting.
JSON Status¶
Returns the machine-readable status in JSON format.
Example Status Output¶
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - triple
Storage engine - ssd-2
Coordinators - 5
Usable Regions - 1
Cluster:
FoundationDB processes - 15
Zones - 5
Machines - 5
Memory availability - 6.1 GB per process on machine with least available
Fault Tolerance - 2 machines
Server time - 02/03/25 09:32:01
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 234.5 GB
Disk space used - 456.2 GB
Operating space:
Storage server - 1.2 TB free on most full server
Log server - 967.3 GB free on most full server
Workload:
Read rate - 12543 Hz
Write rate - 3421 Hz
Transactions started - 8234 Hz
Transactions committed - 2156 Hz
Conflict rate - 12 Hz
Backup and DR:
Running backups - 1
Running DRs - 0
Machine-Readable Status¶
Access the complete cluster status programmatically using the special key \xFF\xFF/status/json.
Accessing via Client API¶
Key Status Fields¶
The JSON status contains these major sections:
| Section | Description |
|---|---|
client | Client configuration and database status |
cluster | Cluster-wide configuration and state |
cluster.processes | Per-process details and roles |
cluster.data | Data distribution and replication state |
cluster.workload | Real-time performance metrics |
cluster.qos | Quality of service metrics |
cluster.latency_probe | Latency measurements |
cluster.layers | Layer-specific status (backup, etc.) |
Checking Cluster Health¶
def is_cluster_healthy(status):
"""Check if cluster is operating normally."""
return (
status.get('client', {}).get('database_status', {}).get('healthy', False) and
status.get('cluster', {}).get('data', {}).get('state', {}).get('healthy', False)
)
Database Available States¶
| State | Description |
|---|---|
available | Database is accepting reads and writes |
read_only | Database only accepts reads (recovery mode) |
unavailable | Database is not accepting connections |
Data State Values¶
| State | Meaning |
|---|---|
healthy | All data replicated to desired level |
healing | Recovering lost replicas |
healthy_repartitioning | Healthy, redistributing data |
healthy_removing_server | Healthy, removing excluded server |
healthy_rebalancing | Healthy, balancing data across servers |
Key Metrics¶
Performance Metrics¶
| Metric | Path in JSON | Target | Alert Threshold |
|---|---|---|---|
| Read rate | cluster.workload.operations.reads.hz | Varies | Baseline ±50% |
| Write rate | cluster.workload.operations.writes.hz | Varies | Baseline ±50% |
| Commit rate | cluster.workload.transactions.committed.hz | Varies | - |
| Conflict rate | cluster.workload.transactions.conflicted.hz | < 1% of commits | > 5% |
Latency Metrics¶
| Metric | Path in JSON | Target | Alert Threshold |
|---|---|---|---|
| Commit latency (p50) | cluster.latency_probe.commit_seconds | < 25ms | > 100ms |
| Read latency | cluster.latency_probe.read_seconds | < 5ms | > 50ms |
| Transaction start | cluster.latency_probe.transaction_start_seconds | < 5ms | > 25ms |
Capacity Metrics¶
| Metric | Path in JSON | Alert Threshold |
|---|---|---|
| Storage space free | cluster.data.total_disk_used_bytes vs capacity | < 20% free |
| Memory available | cluster.processes.*.memory.available_bytes | < 1GB per process |
| Moving data | cluster.data.moving_data.in_flight_bytes | Sustained > 10GB |
Server-Side Latency Bands¶
FoundationDB tracks latency distributions for read and commit operations. Access via:
cluster.latency_probe.batch_priority_transaction_start_seconds
cluster.latency_probe.immediate_priority_transaction_start_seconds
cluster.latency_probe.commit_seconds
cluster.latency_probe.read_seconds
The latency_statistics feature provides percentile breakdowns when enabled.
Process Monitoring¶
Process Roles¶
Each process reports its assigned roles:
| Role | Description |
|---|---|
storage | Stores key-value data |
log | Transaction log |
commit_proxy | Handles commit requests |
grv_proxy | Handles read version requests |
resolver | Conflict resolution |
master | Cluster coordination |
cluster_controller | Manages role assignments |
coordinator | Coordination service |
Process Health Indicators¶
def check_process_health(process):
"""Check individual process health."""
issues = []
# Check memory
mem = process.get('memory', {})
if mem.get('available_bytes', 0) < 1_000_000_000: # 1GB
issues.append('low_memory')
# Check disk
disk = process.get('disk', {})
if disk.get('free_bytes', 0) < 10_000_000_000: # 10GB
issues.append('low_disk')
# Check CPU
cpu = process.get('cpu', {})
if cpu.get('usage_cores', 0) > 0.95 * cpu.get('cores', 1):
issues.append('high_cpu')
return issues
Excluded Servers¶
Monitor excluded servers that are being drained:
excluded = status.get('cluster', {}).get('excluded_servers', [])
for server in excluded:
print(f"Excluded: {server['address']}")
Fault Tolerance¶
Current Fault Tolerance¶
Check how many failures the cluster can survive:
fault_tolerance = status['cluster']['fault_tolerance']
print(f"Can lose {fault_tolerance['max_zone_failures_without_losing_data']} zones")
print(f"Can lose {fault_tolerance['max_zone_failures_without_losing_availability']} zones and stay available")
Recovery State¶
| State | Description |
|---|---|
fully_recovered | Normal operation |
waiting_for_new_tlogs | Waiting for transaction log servers |
accepting_commits | Recovery accepting new commits |
all_logs_recruited | Logs assigned, finalizing recovery |
Monitoring Backup Status¶
Access backup status through the layers section:
backup_status = status.get('cluster', {}).get('layers', {}).get('backup', {})
if backup_status:
for tag, info in backup_status.get('tags', {}).items():
print(f"Backup {tag}: {info.get('current_status')}")
print(f" Last restorable: {info.get('last_restorable_version')}")
Alerting Recommendations¶
Critical Alerts¶
| Condition | Check | Action |
|---|---|---|
| Database unavailable | !client.database_status.available | Page on-call |
| Data not fully replicated | !cluster.data.state.healthy | Investigate immediately |
| No fault tolerance | fault_tolerance.max_zone_failures_without_losing_data == 0 | Add capacity |
Warning Alerts¶
| Condition | Check | Action |
|---|---|---|
| High conflict rate | conflicted.hz / committed.hz > 0.05 | Review application logic |
| Low disk space | < 20% free on any server | Add storage or clean up |
| High latency | commit_seconds > 0.1 | Investigate workload |
| Sustained data movement | moving_data.in_flight_bytes > 10GB for 30min | Check for excluded servers |
Example Alerting Script¶
#!/usr/bin/env python3
import fdb
import json
import sys
fdb.api_version(730)
db = fdb.open()
@fdb.transactional
def get_status(tr):
return json.loads(tr[b'\xff\xff/status/json'])
status = get_status(db)
# Check critical conditions
if not status['client']['database_status']['available']:
print("CRITICAL: Database unavailable")
sys.exit(2)
if not status['cluster']['data']['state']['healthy']:
print("CRITICAL: Data not healthy - " + status['cluster']['data']['state']['name'])
sys.exit(2)
# Check warnings
fault_tolerance = status['cluster']['fault_tolerance']['max_zone_failures_without_losing_data']
if fault_tolerance == 0:
print("WARNING: No fault tolerance")
sys.exit(1)
workload = status['cluster']['workload']['transactions']
if workload['committed']['hz'] > 0:
conflict_rate = workload['conflicted']['hz'] / workload['committed']['hz']
if conflict_rate > 0.05:
print(f"WARNING: High conflict rate {conflict_rate:.1%}")
sys.exit(1)
print("OK: Cluster healthy")
sys.exit(0)
Stable Cluster Health Metric¶
Status: Upcoming Feature
The Stable Cluster Health Metric is under active development with an open pull request. It is expected to land in a near-future release. The details below reflect the current design.
Overview¶
FoundationDB will provide a single scalar cluster health score ranging from 0 to 100 that summarizes overall cluster state. This metric is designed to make fleet-wide monitoring straightforward—operators can alert on a single value instead of encoding complex FoundationDB-specific logic in PromQL or custom scripts.
Health Score Scale¶
| Value | Level | Description |
|---|---|---|
| 100 | HEALTHY | Cluster is fully operational with no issues |
| 75 | SELF_HEALING | Cluster has detected an issue and is automatically recovering |
| 50 | INTERVENTION_REQUIRED | Operator action is needed to restore full health |
| 25 | CRITICAL_INTERVENTION_REQUIRED | Urgent operator action is needed; risk of data loss or extended outage |
| 0 | OUTAGE | Cluster is unavailable |
Contributing Factors¶
The health score is derived from multiple signals:
- Sev40 events — Severity 40 trace events indicate critical internal errors
- Recovery state — Whether the cluster is in or has recently completed recovery
- Data Distributor replication — Whether data is fully replicated to the desired redundancy level
- Disk space — Available storage across cluster processes
- Ratekeeper throttling — Whether the cluster is actively throttling transactions due to load
- Coordinator availability — Reachability of coordination servers
The score reflects the worst contributing factor, so a single critical issue will drive the overall value down even if all other signals are healthy.
Accessing the Health Score¶
The cluster health metric will be available through two interfaces:
- Status JSON — Accessible via
\xFF\xFF/status/jsonin the client API orstatus jsoninfdbcli, under a new field in the cluster status object - Trace events — Emitted periodically in server trace files, allowing integration with log-based monitoring pipelines
Why This Matters¶
Today, determining cluster health requires combining multiple status fields and encoding FoundationDB-specific knowledge into monitoring queries. The health score eliminates this complexity:
# Before: complex multi-signal alerting
fdb_cluster_available == 0 or fdb_data_state_healthy == 0
or fdb_fault_tolerance_zones == 0 or ...
# After: single metric
fdb_cluster_health_score < 50
This is especially valuable for teams managing many FoundationDB clusters, where a single dashboard can show the health of every cluster at a glance.
Prometheus Integration¶
Available Exporters¶
Several community-maintained Prometheus exporters are available for FoundationDB:
| Exporter | Language | Maintainer | Status | Link |
|---|---|---|---|---|
| fdbexporter | Rust | Clever Cloud | Actively maintained, production use, v2.3.1 (Jan 2026), Docker images, supports FDB 7.1 & 7.3 | CleverCloud/fdbexporter |
| fdb-exporter | Go | Tigris Data | Community | tigrisdata/fdb-exporter |
| foundationdb-exporter | TypeScript | @aikoven | Community | aikoven/foundationdb-exporter |
| fdb-prometheus-exporter | Go | @PierreZ | Legacy / unmaintained | PierreZ/fdb-prometheus-exporter |
Using fdbexporter (Recommended)¶
The CleverCloud/fdbexporter is actively maintained and recommended for production use.
Docker Usage:
docker run -d \
--name fdbexporter \
-p 9090:9090 \
-e FDB_CLUSTER_FILE=/etc/foundationdb/fdb.cluster \
-v /etc/foundationdb:/etc/foundationdb:ro \
clevercloud/fdbexporter:2.3.1-7.3.69
Prometheus Configuration:
# prometheus.yml
scrape_configs:
- job_name: 'foundationdb'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
Key Prometheus Metrics¶
| Metric | Type | Description |
|---|---|---|
fdb_cluster_available | Gauge | 1 if cluster available |
fdb_cluster_healthy | Gauge | 1 if cluster healthy |
fdb_workload_reads_hz | Gauge | Read operations per second |
fdb_workload_writes_hz | Gauge | Write operations per second |
fdb_workload_commits_hz | Gauge | Commits per second |
fdb_latency_commit_seconds | Gauge | Commit latency |
fdb_storage_used_bytes | Gauge | Total storage used |
Native OpenTelemetry Metrics¶
Status: In Development
Native OpenTelemetry (OTel) metrics support is under active development in the FoundationDB open-source codebase but is not yet complete. The information below describes the current state and recommended workarounds.
Current State¶
FoundationDB's codebase contains preliminary support for emitting metrics via the OpenTelemetry protocol, but this functionality is not yet fully implemented or production-ready in the open-source builds. Key gaps include incomplete metric coverage and limited configuration options.
Recommended Workaround¶
Most teams currently obtain metrics by scraping FoundationDB's trace events through external exporters — such as the Prometheus exporters listed in the Prometheus Integration section above. These exporters parse the JSON status output or trace files and expose the data in a format compatible with standard monitoring stacks.
This approach is well-proven in production and remains the recommended path for teams that need metrics today.
For teams looking to export FoundationDB metrics via OpenTelemetry specifically, the community fdb-otel-exporter project provides a useful starting point. It tails FoundationDB trace logs and exports them as OTel metrics, making it a helpful reference for which metrics to track and how to structure OTel-based monitoring for FoundationDB clusters.
Future Benefits¶
Once native OTel metrics support is complete, FoundationDB will provide out-of-the-box metrics emission without requiring external tooling. This will be particularly valuable for:
- New FDB users who want monitoring with minimal setup
- Teams standardizing on OpenTelemetry across their infrastructure
- Environments where running sidecar exporters adds unwanted operational complexity
Native OTel support will enable direct integration with any OTel-compatible backend (such as Grafana, Datadog, or Jaeger) using FoundationDB's built-in instrumentation.
Grafana Dashboards¶
Recommended Dashboard Panels¶
- Cluster Health - Overall status indicator
- Throughput - Reads, writes, commits over time
- Latency - Commit and read latency percentiles
- Data Distribution - Storage per server, moving data
- Fault Tolerance - Current redundancy level
- Process Health - Memory, CPU, disk per process
Sample Grafana Query¶
# Commit latency p99
histogram_quantile(0.99, fdb_commit_latency_seconds_bucket)
# Conflict rate percentage
(fdb_workload_conflicted_hz / fdb_workload_committed_hz) * 100
Trace Files¶
FoundationDB servers write detailed trace files in XML format.
Trace File Location¶
| Platform | Default Path |
|---|---|
| Linux | /var/log/foundationdb/ |
| macOS | /usr/local/var/log/foundationdb/ |
Trace File Contents¶
Trace files contain:
- Transaction timing information
- Error events and stack traces
- Performance metrics
- Role transitions
- Network events
Analyzing Trace Files¶
# Find errors in trace files
grep -h "Severity=\"40\"" /var/log/foundationdb/trace*.xml
# Find warnings
grep -h "Severity=\"30\"" /var/log/foundationdb/trace*.xml
Next Steps¶
- Configure Backup & Recovery for data protection
- Review Troubleshooting for common issues
- See Configuration for cluster tuning